Forem: Alex Delov

llm-nano-vm v0.8.0 — deterministic FSM runtime for LLM pipelines, now with output validation and per-step timeouts

Alex Delov — Sat, 23 May 2026 04:36:37 +0000

PyPI: pip install llm-nano-vm

GitHub: http://github.com/Ale007XD/nano_vm

MCP gateway: http://github.com/Ale007XD/nano-vm-mcp

I've been building a deterministic FSM execution kernel for LLM workflows. v0.8.0 just shipped to PyPI. Here's what it is, what's new, and where it's going.

What it is

Most LLM frameworks treat the model as the orchestrator. nano-vm flips that: the runtime is the orchestrator, the model is just one step in a deterministic graph.

δ(S, E) → S'

Current state + validated event = next state. The model cannot skip steps, reorder them, or escape guardrails. The FSM is the source of truth.

Four step types: llm, tool, condition, parallel. Programs are plain Python dicts. No DSL parser, no heavy framework magic, and zero dependency overhead.

program = Program.from_dict({
    "name": "customer_refund",
    "steps": [
        {
            "id": "analyze",
            "type": "llm",
            "prompt": "Valid refund? Reply 'yes' or 'no'.\nRequest: $user_input",
            "output_key": "decision",
            "allowed_outputs": ["yes", "no"],   # ← v0.8.0
        },
        {
            "id": "guardrail",
            "type": "condition",
            "condition": "'yes' in '$decision'",
            "then": "process_refund",
            "otherwise": "reject",
        },
        {"id": "process_refund", "type": "tool", "tool": "issue_refund",   "is_terminal": True},
        {"id": "reject",         "type": "tool", "tool": "send_rejection", "is_terminal": True},
    ],
})

The guardrail step cannot be bypassed regardless of what the model returns.

What's new in v0.8.0

allowed_outputs — LLM enum guard

Validates the model's raw output against an explicit list before the value touches anything downstream.

{
    "id": "classify",
    "type": "llm",
    "prompt": "Classify. Reply ONLY with: refund / query / other",
    "allowed_outputs": ["refund", "query", "other"],
    "on_error": "skip",   # → falls back to "refund" (first element) on mismatch
}

Three policies on mismatch: fail (default, trace → FAILED), skip (substitute allowed_outputs), retry (retry up to max_retries, then FAILED).

timeout_seconds + on_timeout — per-step LLM timeout

Prevents a hung API call from stalling the entire FSM.

{
    "id": "analyze",
    "type": "llm",
    "timeout_seconds": 5.0,
    "on_timeout": "fallback",   # → falls back to allowed_outputs[0] or ''
}

Two policies: fail (default) and fallback. Both features are independent and composable — you can use either or both on any llm step.

What it can do right now

Suspend / resume. Return "PENDING" from any tool → FSM → SUSPENDED, cursor persisted. Resume from any external event (webhook, approval, settlement). RUNNING → SUSPENDED → RUNNING → SUCCESS
Condition branching with ASTEngine. eval() is gone. Conditions are parsed into a validated JSON AST and evaluated by a sandboxed interpreter. No Python builtins accessible. Method calls (.lower() etc.) raise ASTEvalError at parse time, not silently return False.
GDPR tombstoning. Sensitive values stored as CapabilityRef tokens (vault://secret/). On erasure event: ref tombstoned, all projections return [REDACTED_TOMBSTONE], hash chain stays valid.
GovernanceEnvelope. Every successful step produces an immutable, append-only audit record: execution_id, step_id, policy_hash, canonical_snapshot_hash, sanitized payload.
MCP gateway (nano-vm-mcp). Exposes run_program, get_trace, list_programs etc. over stdio or SSE transport with bearer auth and SQLite WAL persistence. Works with Claude Desktop and any MCP client.
Budget guardrails. max_steps, max_tokens, max_stalled_steps — FSM halts with BUDGET_EXCEEDED or STALLED before the next step, not after.

Benchmark — v0.8.0 (WSL2 · Python 3.12 · MockAdapter · 3×5×10k)
10/10 PASS · 1,096,500 ops · 0 violations
ScenarioMean TPSp95
Refund pipeline
2,200/s
123 ms
Double-execution guard
2,800/s
69 ms
Budget enforcement
2,400/s
97 ms
Parallel throughput
1,000/s
196 ms
MCP store round-trip
11,000/s
0.13 ms
GovernanceEnvelope
2,100/s
108 ms
Crash consistency
11/s
115 ms
Replay equivalence
1,300/s
164 ms
Adversarial retries
2,600/s
87 ms
Long-horizon (1k steps)
95/s
11,887 ms

BM-INT-07 (Crash consistency): crash_rate=100% hash_match=100% — replay after simulated crash produces identical trace hash every time.

BM-INT-10 (Memory footprint): peak RSS 76.5 MB, alloc 3.62 MB for 1,000-step programs — no memory leaks detected.

Validated on real payment APIs

Two PoCs, both 9/9 tests passing with mock adapters:
MoMo Payment API v4 — 3-way condition branch, HMAC-SHA256 IPN verification, polling loop with retry, next_step/is_terminal DSL.
Stripe Payment API v1 — 3DS flow (REQUIRES_ACTION sentinel), refund pipeline with LLM classifier, webhook verification. Found and fixed two bugs in the process: "PENDING" sentinel collision (Stripe was returning it as a domain status, triggering FSM suspend), and silent ASTEvalError for .lower() in condition expressions.

What's coming next
Phase 0 (Immediate): ProgramValidator — static analysis at Program build time. Catches missing then/otherwise/next_step targets, unreachable steps, and cycle detection. Currently these fail at runtime; when dealing with LLM-generated workflows, static analysis is a must.

Phase 1 (Gateway Correctness): StateContext persistence between MCP calls in SQLite WAL. Right now, if the gateway process restarts after /create but before polling completes, you get a new requestId — which is a real financial duplicate risk. Closing this with an execution_contexts table + upsert on every step. Up next: TRACE projection to SQLite, GovernedToolExecutor (policy-level tool capability enforcement), idempotency_store, and native vm.step() MCP wiring.

Phase 2 (Dev Agent): nano-vm-dev-agent — the FSM runtime managing its own development stack (read_repo_files → generate_patch(llm) → run_mypy → run_pytest → write_repo_files). DA-1 milestone is done (12/12 tests). DA-2 will be the first live run against a real sprint task (StateContext persistence). Still working on search_code and reproduce_bug tool-functions before launching live.

Phase 3 (Observability): OpenTelemetry span per FSM step + incremental counters in Trace (llm_calls, tool_calls, retries_total).

Install
pip install llm-nano-vm==0.8.0

pip install llm-nano-vm[litellm]==0.8.0 # LiteLLM provider support

pip install nano-vm-mcp # MCP gateway

LLMs are completely optional. The runtime works perfectly fine as a pure, lightweight deterministic workflow engine.

Questions / feedback welcome!

Models shouldn't have execution authority. Why we built a deterministic FSM runtime for AI agents.

Alex Delov — Thu, 21 May 2026 04:49:39 +0000

Modern agent frameworks implicitly treat a probabilistic model as an execution authority. That is acceptable for read-only tasks (e.g., summarizing logs or searching the web). But once an agent can mutate external state — payments, databases, infrastructure, PII — the architecture becomes fundamentally unsafe.

When preparing our internal agents (PlanBot, SkillBot) for white-label distribution, we realized we needed to change the control plane. nano-vm does not attempt to make the model trustworthy. Instead, it assumes model output is untrusted intent and constrains its blast radius through strict deterministic execution semantics.

The Runtime Guarantees (Not just another wrapper)

We built nano-vm — a deterministic FSM runtime for stateful AI systems. The value isn't just in having an FSM; the value is that the execution graph is finite, verifiable, and known ahead of time.

The runtime enforces:

Deterministic transition graph: Execution graph cannot self-modify at runtime.
Compile-time ordering: Attempting a reorder_steps attack is structurally impossible.
Capability gating: Strictly bounded side-effects.
Replay resistance: Idempotency boundaries built into the state transitions.
Immutable auditability: Cryptographic history of every action.

ASTEngine: Limitation as a Security Property

In most agent runtimes, the execution loop is essentially: prompt -> JSON -> dynamic dispatch -> side-effect.

We completely removed eval(). Conditions and side-effects are evaluated by a sandboxed DeterministicSanitizer using an isolated ASTEngine. It supports basic operators (==, contains, $var.field) but completely lacks loops or system calls.

The policy layer is intentionally less expressive than Python. That limitation is a security property, not a missing feature. Loop exhaustion and ReDoS attacks are structurally impossible.

Sabotage Mode: Demonstrating Failure Semantics

To demonstrate the runtime under adversarial conditions, we built a 7-step fintech pipeline (PDF invoice -> Stripe test-mode adapter) with an integrated Sabotage Mode. Instead of a happy-path demo, we built 5 injectors directly into the UI to demonstrate adversarial failure semantics.

1. tool_injection (Capability boundary violation)
Proposed tool invocations are treated as untrusted intent. If the LLM attempts to initiate an unauthorized wire_transfer($50,000), the ExecutionVM resolves the request against a compile-time capability snapshot. The transition is rejected before any external side-effect layer becomes reachable. Zero side effects reach the network.

2. double_exec (Replay & Idempotency)
External side-effects are executed through idempotent adapters keyed by execution_id, allowing deterministic replay of internal state recovery without duplicating external mutations. Once the FSM reaches a terminal state (SUCCESS or FAILED), it becomes an absorbing state (δ(SUCCESS|FAILED, *) = NOP). Replays are silently dropped.

3. `corrupt_hashTampering with the validation hash instantly throws the FSM into aFAILED` state, resulting in a zeroed envelope chain. The audit trail cannot be silently broken.

GDPR Art.17 vs. Immutable Audit Trails

Handling the "Right to Erasure" without breaking cryptographic audit chains is a major headache in fintech.

We implemented a GDPR-erase mechanism that targets specific vault://secret/ref pointers and replaces the PII with a [REDACTED_TOMBSTONE].

The PII becomes completely inaccessible.
The hash_chain and canonical_hash survive.
Cryptographic continuity is maintained.
Referential integrity is preserved.

You delete the data, but you do not destroy the mathematical proof that the operation occurred safely.

Execution Authority vs. Model Quality

LLMs are excellent planners. They are terrible sources of execution truth.

The core design question for stateful AI systems may not be model quality.
It may be execution authority.

Should a probabilistic model be allowed to mutate state directly?
Or should execution pass through a deterministic control layer first?

If you want to try breaking the FSM yourself, the Sabotage Mode is live, and the core is open-source:

Core runtime: github.com/Ale007XD/nano_vm
MCP gateway layer: github.com/Ale007XD/nano-vm-mcp
Live Sabotage Demo: demo.bannerbot.ru:8843

Curious how others here are approaching capability boundaries, replay resistance, and auditability in agent runtimes.