Forem: Bala Paranj

The contract is the interface: agent-driven Steampipe Stave in one command

Bala Paranj — Sat, 23 May 2026 11:15:39 +0000

Consider a typical cloud-security tool's onboarding flow. A customer installs the tool. The tool's collector tries to authenticate to AWS, fails because the role isn't there yet, the customer follows three pages of setup docs, the role gets created, the collector authenticates, the collector runs, the collector finds nothing because the tool only knows about S3 and IAM and the customer's workload is on EKS. End of week one.

We don't ship a collector. Stave evaluates obs.v0.1 JSON snapshots — whatever produces them. That decision sounds extreme until you've watched the same "the collector doesn't see our environment" conversation play out three times. So instead of a collector, Stave ships a contract: per-asset JSON Schemas, per-asset Steampipe→Stave column mappings, and one command (stave contract show) that emits everything an agent needs to author its own ingest. The customer's preferred source (Steampipe, AWS Config, Terraform state, an internal inventory API) plugs in by satisfying the contract.

This post walks through the steps that closes the pipeline.

What the customer sees

$ stave contract show --asset-type aws_s3_bucket
Contract: aws_s3_bucket
Schema:   schemas/observation/v1/asset-types/aws_s3_bucket.schema.json
Controls: 102 | Chains: 15

Property paths (catalog reads these — sorted by chain unlock, then control unlock):

  PATH                                                          CONTROLS  CHAINS  SEVERITY  NOTE
  ────                                                          ────────  ──────  ────────  ────
  storage.kind                                                  91        15      critical
  storage.tags.data-classification                              14        2       critical  intent
  storage.access.public_read                                    8         2       critical
  storage.controls.public_access_fully_blocked                  3         1       critical
  ...

Steampipe mapping: contracts/steampipe/aws_s3_bucket.yaml

That output names everything the customer's ingest agent needs:

The schema — the JSON Schema the agent's output must satisfy
The property paths — what fields the catalog actually reads on this asset type, ranked by how many controls and chains they unlock
The mapping — a ready-to-run YAML telling the agent which Steampipe column maps to which Stave property path

For the 17 most catalog-impactful asset types, the mapping is committed. For the rest, the customer's agent has the schema; it can author its own.

The YAML mapping format

The Steampipe→Stave mapping is one ordered list of operations per asset type. Four operation kinds cover every transform shape:

field — direct column → property mapping with optional coerce/default
static — a fixed value (e.g. properties.storage.kind: bucket)
extract — pull a nested JSON value from a JSON-shaped column
computed — derive from already-set property paths (all / any reduction)

Operations run in YAML order; later ops can read paths written by earlier ones. The first mapping we wrote — contracts/steampipe/aws_s3_bucket.yaml — replaced a Python function with a declarative file. The loader changes are 100 lines; the resulting observation is byte-identical to what the imperative function produced.

operations:
  - kind: static
    path: properties.storage.kind
    value: bucket

  - kind: field
    path: properties.storage.tags
    column: tags
    default: {}
    type: dict

  - kind: extract
    path: properties.storage.encryption.algorithm
    column: server_side_encryption_configuration
    json_path: "Rules.0.ApplyServerSideEncryptionByDefault.SSEAlgorithm"
    key_variants:
      Rules: rules
      SSEAlgorithm: sse_algorithm
    default: "none"

  - kind: computed
    path: properties.storage.controls.public_access_fully_blocked
    op: all
    inputs:
      - properties.storage.controls.public_access_block.block_public_acls
      - properties.storage.controls.public_access_block.block_public_policy
      - properties.storage.controls.public_access_block.ignore_public_acls
      - properties.storage.controls.public_access_block.restrict_public_buckets

The format is the contract. Any agent in any language can parse the YAML and produce conforming observations.

Per-asset JSON Schemas

The catalog ships 3,957 controls; together they declare applicable_asset_types for 109 distinct asset types. To validate that a mapping's target paths are real, we needed a JSON Schema per asset type. Hand-authoring 109 schemas is a Tuesday lost; the schema generator already existed (it walks every control's predicate AST and infers the property paths + types), but defaulted to the top-3 most-used types.

go run ./internal/tools/genassetschemas/... -top 200
make sync-schemas

Output: 109 per-asset schemas under schemas/observation/v1/asset-types/. Every level is additionalProperties: true — the schemas are discoverability artifacts, not restrictive gates. A schema that lists one property (security_hub.enabled on aws_securityhub_account, for example) tells an agent "this asset type matters to the catalog; here is the one property to populate." Thin schemas are still useful.

Ten hand-authored mappings

The next 10 asset types by control coverage — aws_iam_role, aws_lambda_function, aws_cognito_user_pool, aws_cloudtrail_trail, aws_kms_key, aws_ec2_instance, aws_sqs_queue, aws_iam_user, aws_opensearch_domain, aws_stepfunctions_state_machine — got hand-authored mappings. They served two purposes: actual coverage for the most-asked-for types, and a ground-truth corpus to validate Iter 5's auto-generator against.

Every mapping carries a derived_properties: block listing the catalog-read properties that cannot come from a single Steampipe column. Example from aws_iam_role.yaml:

derived_properties:
  - path: properties.identity.role.cross_account_trust_without_external_id
    source: "Parse trust_policy — detect external Account in Principal without sts:ExternalId condition"
  - path: properties.identity.permission_categories.has_incompatible_categories
    source: Policy analysis against controldata/taxonomy/permission_categories.yaml
  - path: properties.identity.access_advisor.available
    source: iam:GenerateServiceLastAccessedDetails + iam:GetServiceLastAccessedDetails (separate API call per role)

That block is the agent's TODO list. Silently producing an observation without those derived properties is the failure mode the derived_properties: section prevents — Stave's controls don't see the property, the catalog finds nothing wrong, the breach happens anyway.

The Contract Show Command

The three sources — schema, predicate index, mapping file — already existed. Joining them required three separate file reads. The new command joins them once:

stave contract show --asset-type aws_iam_role --format json

{
  "asset_type": "aws_iam_role",
  "has_schema": true,
  "schema_path": "schemas/observation/v1/asset-types/aws_iam_role.schema.json",
  "controls_count": 198,
  "chains_count": 38,
  "property_paths": [
    {
      "path": "properties.identity.kind",
      "controls_count": 196,
      "chains_count": 35,
      "max_severity": "critical",
      "is_intent_property": false
    },
    ...
  ],
  "steampipe_mapping": "contracts/steampipe/aws_iam_role.yaml"
}

Or:

stave contract show --list

Asset types with controls: 109 (schema: 109, steampipe mapping: 17)

  TYPE                              SCHEMA  CONTROLS  CHAINS  MAPPING
  ────                              ──────  ────────  ──────  ───────
  aws_iam_role                      yes     198       38      steampipe
  aws_s3_bucket                     yes     102       15      steampipe
  aws_lambda_function               yes     169       12      steampipe
  aws_bedrock_agent                 yes     24        5       -
  ...

The implementation reuses everything already in the codebase: compose.LoadControlsFrom, compose.LoadChainDefinitions, predindex.Build (the same index the stave gaps command uses), and a 50-line helper in internal/contracts/schema/load.go to access the embedded per-asset schemas. The command is ~330 lines; nothing is new data — it's projection over existing data.

Auto-generator

The remaining ~98 asset types could be hand-authored or auto-generated. We tried auto. The generator joins the cached Steampipe column catalog with each per-asset schema's property paths, applies a four-rule matching priority (per-asset overrides, schema-path lookup with multi-token scoring, tags convention, fallback to properties.<ns>.<col>), and emits a YAML in the same operations-list format Iter 1 established.

make gen-steampipe-mappings           # generate, skip existing
make gen-steampipe-mappings-validate  # measure accuracy

Validation runs the generator against the 11 hand-authored YAMLs (Iter 1 + Iter 3) and compares the auto-generated (column, path) tuples against the ground truth:

Overall: 149/177 = 84% accuracy across 17 type(s)

84% — past the 80% target. The remaining 16% are the multi-target JSON-path extracts the brief flagged as inherently manual (one column → two property paths is not something a name-similarity heuristic can synthesise). Auto-generated YAMLs carry _auto_generated: true + _review_required: N + _unmatched_paths: [...] so the reviewer's surface is bounded.

The detailed story of the heuristic — and how it went from 8% accuracy on the first pass to 84% on the fourth — is its own post. The point here is what's committed: 17 total mappings (11 hand-authored, 6 auto-generated), every one of them an artifact a customer's agent can read in any language.

Who owns contract sits where it does

The architecture choice that makes this work: extractors are client-owned. Stave does not ship a collector. The contracts/steampipe/ directory contains instructions, not code. An agent reads the schema and the mapping; the agent produces the observation; Stave evaluates the observation. The collector boundary is a file, not a process.

This decision has been in our architecture docs since the project started, but until now there was no single command that surfaced the contract to an agent. An agent that wanted to author a Steampipe ingest for a new asset type had to:

Find the per-asset schema (one of several embedded directories)
Decide what property paths to populate (no canonical list — derive from controls)
Map Steampipe columns to those paths (no template — invent it)

The agent runs one command and gets all three. The agent runs make gen-steampipe-mappings and gets a starting-point YAML it can refine. The integration is a lot easier.

What stayed out of Stave

Nothing in the Stave Go binary changed across the five iterations except the new cmd/contract/ directory (one file, ~330 LOC). The agent infrastructure is:

examples/agents/stave_transform.py — reference loader (Python)
contracts/steampipe/*.yaml — 17 mappings (committed)
scripts/gen-steampipe-mappings.py — auto-generator (Python, ~280 LOC)
scripts/steampipe-columns.json — cached column catalog (refreshable from a live Steampipe install)

The deterministic policy engine is unchanged. The contract evolves; the engine doesn't.

The Generic Pipeline Shape

Replace Steampipe with any external data source — AWS Config, Terraform state, your internal inventory, Salesforce, OpenAPI specs — and the pipeline shape is the same:

Define the canonical target contract. For Stave it's obs.v0.1 JSON with per-asset-type sub-schemas. For your tool, it's whatever shape your engine reads.
Author one mapping per source per asset type. YAML is fine. Operations list with field/static/extract/computed semantics covers most transform shapes.
Ship a discovery command. One CLI that joins the schema + the path list + the mapping into a single agent-readable output. The agent stops needing your team's docs.
Auto-generate the boring half. Most column→path mappings are name-similarity. The exceptions are rare enough to hand-author. Use the hand-authored set as a ground-truth corpus to measure your generator's accuracy.
Mark uncertainty explicitly. _review_required, _unmatched_paths, derived_properties:. Silent gaps are worse than loud ones.

Five points, one functioning pipeline. The customer who needed three pages of collector setup now needs make gen-steampipe-mappings and an agent that can read a YAML.

The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose

Bala Paranj — Fri, 22 May 2026 11:40:52 +0000

Every boundary in Stave's pipeline has a machine-verifiable contract. We built them for solo developer productivity. They turned out to be exactly what agents need. The CLI tool became a platform. Here's why that changes who can run a cloud security program.

I didn't set out to build a cloud security platform for agents. My goal was to build a CLI tool one person could maintain.

The decisions that followed — standard JSON Schema instead of a proprietary format, exit codes instead of prose output, deterministic evaluation instead of probabilistic scoring, small composable tools instead of a monolith were made for human productivity. One person can't maintain a proprietary schema, debug non-deterministic output or maintain a monolith.

Fourteen months later, I ran five independent trials. I gave agents a reasoning specification and Stave's data export. No implementation code. No documentation beyond the spec. No hints. The agents produced correct security verdicts for five different reasoning engines — Z3 (mathematical proof), Soufflé (blast radius enumeration), Clingo (violation detection), Prolog (proof trees), and PRISM (risk probability).

Scope: this proof is valid within the scope of exported SIR facts; the Fact Export reference names which property domains the SIR currently covers.

Two of the trials were fully blind — fresh agents with zero prior context. Both passed.

I didn't target agent support from day 1. The architecture produced it.

What agent-centric means

Every security vendor is adding AI. Copilots that summarize findings. Chatbots that answer questions about your security posture. LLMs that suggest remediation steps. These are useful features. They're also decorations on top of existing architectures.

Agent-centric means an agent can build the pipeline — not just answer questions about it. The distinction is the difference between a tool that has AI and a tool that agents can develop against.

The test is simple: can an agent that has never seen your source code produce correct results from your published contracts alone? If yes, the tool is agent-centric. If the agent needs implementation code, internal documentation, or human guidance, the tool has AI features bolted onto a human-dependent architecture.

Stave passed this test. Five times. With two blind runs.

The contracts that make it work

Every boundary in the pipeline has three properties:

1. Machine-readable specification. Not documentation — a JSON Schema or YAML file that an agent parses, understands, and generates conforming output against.

2. Binary assertion. The step either succeeded or it didn't. stave validate --strict exits 0 or non-zero. stave apply produces findings or doesn't. No subjective quality judgment. No "does this look right?" While the agent drafting the code is probabilistic, the contract it targets is deterministic. The platform provides a rigorous feedback loop: the agent iterates until it hits the binary "success" state defined by the contract. The platform is a deterministic sandbox for a probabilistic agent.

3. Actionable error on failure. When the assertion fails, the error names the specific field that's wrong and what was expected. The agent reads the error, fixes the field, and retries. No human interpretation needed.

Here's what the pipeline looks like with these contracts:

Steampipe table schema          →  Published mapping YAML
(agent reads column names)         (agent reads field_map)
                                          ↓
                                   stave validate --strict
                                   (assertion: exit 0?)
                                          ↓
                                   stave apply
                                   (assertion: deterministic findings)
                                          ↓
                                   stave export-sir
                                   (SIR: Stave Intermediate Representation
                                    — JSONL triples / SMT-LIB assertions)
                                          ↓
                                   reasoning-spec YAML
                                   (agent maps logic → engine code)
                                          ↓
                                   golden answer comparison
                                   (assertion: matches?)

Every arrow is a contract. Every contract is machine-readable. Every assertion is binary. An agent traverses this pipeline the same way a developer does — except the agent never needs to ask "is this right?" because the contracts answer that question automatically.

What the five trials proved

We wrote reasoning specifications — YAML files describing a security question, the input data, the step-by-step reasoning chain, and the expected output format. The reasoning spec defines logic constraints (e.g., "a bucket is public if the policy allows the AllUsers principal"), not implementation code. The agent's job was to translate those logic constraints into the specific syntax of the target engine — Soufflé Datalog, Z3 SMT-LIB, Clingo ASP atoms. We stripped the expected answer. We gave the spec and the input data to agents with no access to our codebase.

Trial	Engine	Question	Blind?	Result
1	Z3	"Can anonymous users reach this S3 bucket?"	Same-session	PASS — correct verdict + SAT witness (attack path)
2	Soufflé	"How many resources can an anonymous identity reach?"	Same-session	PASS — count: 12 (byte-identical)
3	Clingo	"Which violation rules fire on this configuration?"	Blind	PASS — all 4 violations correct
4	Prolog	"What is the proof tree for this attack path?"	Blind	PASS — 12 proof trees correct
5	PRISM	"What is the probability of successful exploitation?"	Same-session	PASS — 0.412 (within ±0.005)

Two of the five trials caught real defects — one in the spec, one in our test suite. The framework automatically classified them: when the engine and agent agreed but the golden answer differed, we found a human transcription error in our test suite (we'd written 6 when the correct count was 12). When the agent's output failed to match the engine's actual vocabulary, we found a spec ambiguity (the spec said mfa_enforced but the export uses has_mfa_enforced). The contracts allowed the agents to debug our own test methodology.

No other security platform has published evidence that agents can produce correct security reasoning from published contracts alone.

What this changes for enterprises

Before: a team problem

Deploying cloud security posture management traditionally requires:

A security engineer to configure the scanner
A cloud architect to interpret the findings
A compliance analyst to map findings to frameworks
A DevOps engineer to integrate into CI/CD
A manager to prioritize remediation

Five roles. Monthly ongoing cost. The tool is the smallest part of the expense — the team to operate it is the real cost. This is why startups skip CSPM: not because the tool costs $50K, but because the team to run it costs $500K.

After: an agent problem

With agent-centric architecture, the same pipeline runs with one engineer directing agents:

Engineer: "Connect Steampipe to our AWS account and produce
           Stave observations for S3 and IAM."

Agent 1:   Reads contracts/steampipe/aws_s3_bucket.yaml
           Reads contracts/steampipe/aws_iam_role.yaml
           Queries Steampipe, transforms output, validates
           → valid observations

Engineer: "Evaluate and show me compound risks."

Agent 2:   Runs stave apply → findings
           Runs stave gaps → what's missing
           → prioritized findings + gap report

Engineer: "Prove whether anonymous access to PHI is reachable."

Agent 3:   Reads reasoning-specs/z3-public-read-bucket/spec.yaml
           Runs stave export-sir → SMT-LIB facts
           Follows reasoning steps → SAT/UNSAT verdict
           → mathematical proof

Engineer: "Map findings to HIPAA Technical Safeguards."

Agent 4:   Reads compliance profile → requirement mapping
           Aggregates findings per requirement
           → compliance status report

One engineer. Four agents. The agents work because every step has a machine-verifiable contract. The engineer's job shifts from operating the tool to directing agents and reviewing results. The security expertise is still human — which questions to ask, which findings matter most and the business context. The mechanical work such as collection, transformation, evaluation, export, reasoning is agent-executed.

The staffing math changes:

Traditional CSPM	Agent-centric CSPM
5 roles × $150K = $750K/year	1 Security Architect (Agent Orchestrator) × $200K = $200K/year
Tool: $50K-$100K	Tool: $0 (open source)
Time to value: 3-6 months	Time to value: days
Scales by hiring	Scales by adding agents

This isn't theoretical. The contracts exist. The trials passed. The agent templates ship in the repo. An engineer who runs the demo today can direct agents against their own infrastructure tomorrow.

Why monolithic tools can't match this capability

A monolithic security tool where collection, evaluation, and reporting are one binary with one proprietary format can add an AI chatbot. It can add an LLM-generated summary. It can add a copilot sidebar.

What it can't add is agent-developable composition. Because composition requires:

Separate steps with independent contracts (monolith has one step)
Standard formats between steps that any agent framework can read (monolith has proprietary internals)
Machine-verifiable assertions at each boundary (monolith validates internally, opaquely)
Published reasoning specs that agents execute independently (monolith's reasoning is embedded in code)

Retrofitting these properties means decomposing the monolith which means abandoning the architecture. The Unix philosophy isn't a feature. It's a structural decision that produces emergent properties you can't add later.

Every enterprise customer has their own tools — their own CMDB, their own collector, their own SIEM, their own compliance framework. A monolithic scanner says: use our collector, our evaluator, our dashboard. An agent-centric pipeline says: bring your tools, target our contracts, agents compose the pipeline.

Customer A:  Steampipe       → Stave → Z3       → Splunk
Customer B:  AWS Config      → Stave → Soufflé  → Jira
Customer C:  Terraform state → Stave → Clingo   → PagerDuty
Customer D:  Custom CMDB     → Stave → Prolog   → Neo4j

Four customers. Four different collectors. Four different reasoning engines. Four different downstream consumers. The same evaluation contracts in the middle. Zero custom integration code. The variation is absorbed by the contracts at the boundaries, not by adapters inside the tool.

Cloud security for the agentic era

The shift is already happening. Google's defensive roadmap calls for agentic SOC. AWS is adding agent capabilities to Security Hub. Every vendor is racing to add AI to their existing tools.

The question isn't whether agents will operate security platforms. It's whether the tools are built for agents to operate — or whether agents are added as a layer on top of tools built for humans.

Stave is built for agents to operate. Not because we planned it. Because the architectural decisions that make a tool maintainable by one person are the same decisions that make it operable by agents: standard contracts, binary assertions, deterministic evaluation, composable steps, published reasoning specs.

The landing page says: Cloud Security for the Agentic Era.

It means: one engineer with agents runs a cloud security program that used to require a team. The contracts are published. The trials are passed. The architecture is proven.

The era where cloud security required a team to operate a tool is ending. The era where one engineer directs agents against a published contract platform is beginning. What started as a CLI tool built for one developer's constraints — small tools, standard formats, deterministic evaluation — became the platform the agentic era needs.

That wasn't the plan. It's better than any plan could have been.

Stave is an open-source cloud security platform. 2,650+ controls, 585 compound chains, 109 per-asset-type JSON Schemas, 17 Steampipe mappings, 5 validated reasoning specs, 9 independent reasoning engines. Every pipeline boundary has a machine-verifiable contract that agents develop against. Try it: bash examples/demo-ai-security/run.sh

Seven Contradictions Shaped an Architecture.

Bala Paranj — Thu, 21 May 2026 11:34:52 +0000

Cloud security has seven structural contradictions. Every vendor treats them as trade-offs — improve one side, accept the other gets worse. TRIZ says trade-offs are engineering failures. Contradictions can be resolved. All seven resolved to the same architecture.

Every cloud security tool makes trade-offs:

"We give you flexibility, but you'll get more misconfigurations."
"We check before deployment, but it slows your engineers down."
"We scan continuously, but the results aren't reproducible."

Trade-offs feel reasonable. They're not. They're engineering failures — situations where improving one property worsens another because the framing of the problem is wrong.

There's a methodology — 80 years old, derived from analyzing 200,000 patents — that says: don't trade off. Resolve the contradiction. Keep both properties. The trick is splitting what was previously one thing into two — one that keeps the desirable property, one that absorbs the opposite.

I applied that methodology to cloud security. Seven contradictions were resolved to produce the architecture. Every contradiction resolved to the same three components. That convergence is the validation and the reason the resulting product is 1,030 lines of code that didn't need to be rewritten.

Contradictions vs trade-offs

The distinction matters:

Trade-off	Contradiction resolved
Accept that improving X worsens Y	Keep both X and Y — separate them in time, space, or scope
"More flexibility means more misconfigurations"	Flexibility and safety coexist when the flexible thing (configuration) is separated from the safe thing (invariant evaluation of the result)

A trade-off accepts the framing. A resolution changes the framing. The methodology — TRIZ, by Genrich Altshuller — provides the discipline for finding which framing change resolves which contradiction.

The seven contradictions

Four months of analysis before writing code produced seven structural contradictions in cloud security. Each is a pair of desirable properties that the industry treats as mutually exclusive:

1. Flexibility vs Safety

We want engineers to build anything. But more flexibility produces more misconfigurations.

What every tool does: Add a scanner with a checklist. The checklist grows. Flexibility doesn't shrink. The contradiction persists.

Resolution: Configuration stays flexible — Terraform, CDK, Pulumi can produce anything. The result is evaluated against invariants. Flexibility lives in the authoring. Safety lives in the evaluation. They're separated.

2. Speed vs Correctness

We want engineers to ship fast. But fast shipping produces insecure configurations.

What every tool does: Add another CI check. Engineers learn to bypass with // skipcq and nosec. The check becomes a speed bump, not a safety net.

Resolution: The invariant evaluation runs in milliseconds. The engineer's velocity is unaffected. The security architect writes the control once. Every subsequent evaluation is automatic. Speed and correctness operate on different clocks — neither blocks the other.

3. Decentralization vs Control

We want decentralized ownership (each team manages their own infrastructure). But decentralization produces inconsistency.

What every tool does: Add a central policy registry. The registry becomes the bottleneck teams route around.

Resolution: The invariant catalog is centrally declared (one canonical set of controls). The snapshots are decentrally produced (each team collects their own). The evaluation runs at the edge — wherever the snapshot is. Central authority over WHAT must hold. Decentralized responsibility for collecting evidence.

4. Human Judgment vs Reliability

We want human contextual judgment (this role needs broad permissions for the migration). But human judgment is inconsistent and irreproducible.

What every tool does: Add severity scores and ML-based triage. The ranking is opaque. The queue is unchanged. The human still decides.

Resolution: The judgment goes into the invariant when it's authored — the security engineer encodes "PHI buckets must never be reachable by anonymous principals" once. Subsequent evaluations are mechanical. Zero judgment at evaluation time. The judgment was upstream, in the control definition. The evaluation is downstream, deterministic.

5. Expressiveness vs Complexity

We want expressive IaC (conditional resources, dynamic blocks, modules calling modules). But expressiveness produces complexity where errors hide.

What every tool does: Add a static analyzer for the IaC source. The analyzer can't see what the IaC produces at apply-time — it analyzes the source, not the result.

Resolution: Evaluate the PRODUCED STATE, not the source code. The snapshot is the JSON representation of what Terraform/CDK/Pulumi produced after plan or apply. The IaC source can be as expressive as needed. The snapshot reduces all that expressiveness to typed facts the evaluator can read. Expressiveness lives in the source. Verifiability lives in the snapshot.

6. Dynamic Systems vs Stable Security

We want dynamic infrastructure (auto-scaling, ephemeral, managed services). But dynamic infrastructure breaks static security assumptions.

What every tool does: Make the security controls dynamic too — continuous scanning, runtime agents, real-time detection. Trades adaptability/predictability for responsiveness/reproducibility. The new contradiction is worse: same infrastructure, different results on different days.

Resolution: Keep the controls STATIC. Make the OBSERVATIONS dynamic. The invariant ("anonymous users must not reach PHI data") doesn't change when the infrastructure changes. The snapshot captures the infrastructure at a point in time. The evaluation reruns the static invariant against the dynamic snapshot. Deterministic, reproducible, composable.

This is the most consequential resolution in the set — and the only one where the textbook TRIZ principle (Dynamization — make the static thing dynamic) was INVERTED. Every vendor followed the textbook. The inversion produced the opposite architecture — and it's the reason the product is deterministic where competitors aren't.

7. Granular Permissions vs Manageability

We want fine-grained IAM permissions (least privilege). But 17,000 IAM actions produce an unmanageable rule set.

What every tool does: Add a policy generator. Generated policies are accurate today, stale tomorrow. The complexity moves from "write the policy" to "maintain the generator."

Resolution: Humans declare role intent at the category level ("this is a read-only data role"). The control catalog encodes which permission patterns match which intent. The 17,000 IAM actions stay fine-grained — the manageability lives at the intent layer above them. An intermediary between human-readable intent and machine-readable permissions.

The four physical contradictions underneath

Underneath the seven technical contradictions sit four deeper ones — statements where a single thing must have opposite properties:

Physical contradiction	Resolution
Configurations must be flexible AND restricted	Configuration stays flexible. Invariants restrict the result.
Humans must configure AND not configure	Humans declare intent. Automation produces the configuration.
IaC must allow freedom AND restrict freedom	IaC is unrestricted. The snapshot of its output is evaluated.
Cloud must be dynamic AND static	Cloud stays dynamic. Invariants stay static. Snapshots bridge them.

Every resolution has the same shape: split the thing that was previously one into two. One piece keeps the desirable property. The other absorbs the opposite. The split happens at a different point each time (source vs result, intent vs enforcement, control vs observation) but the move is the same: segmentation.

The convergence

The seven contradictions resolved and kept landing on the same three components:

The three components are:

A fixed declarative artifact that states a property that must hold → the invariant (YAML control).
A mutable observational artifact that captures the world at a point in time → the snapshot (JSON observation).
A deterministic evaluator that runs the first against the second → the engine (CEL predicate evaluation).

Resolving the seven contradictions shaped the architecture.

The convergence is the validation. If the seven contradictions had resolved to seven different architectures, the analysis would have decomposed the problem into seven sub-problems but missed the structural one. The convergence proves the seven are symptoms of one shape underneath and the three-component architecture is that shape.

The nine inventive principles

The TRIZ contradiction matrix suggested specific principles for each contradiction pair. Nine principles did work across all seven:

Principle	What it does	Where it appears
Segmentation	Split one thing into two	Invariant/snapshot split. Intent/enforcement split. Central catalog/edge collection split.
Taking Out	Remove the unsafe element	Safe defaults. Remediation guidance.
Prior Counteraction	Neutralize danger before it emerges	Judgment encoded in the control predicate at authoring time — not at triage time.
Preliminary Action	Act before the dangerous state appears	CI gate blocks pre-merge. Preflight evaluation before deployment.
Mitigation in Advance	Guardrails that prevent failure even when mistakes happen	Compound chains. Ghost-reference detection.
Dynamization (inverted)	Dynamize the snapshot, not the control	Static controls, dynamic observations. Every vendor did the opposite.
Intermediary	Glue layer between two incompatible levels	Intent tags between human-readable role descriptions and machine-readable IAM actions.
Self-service	Automated execution without human queue	Deterministic evaluation with exit codes. No triage queue. No human interpretation.
Parameter Changes	Operate at a different level	Evaluate the produced STATE, not the source CODE.

The same principles appearing across multiple contradictions is the signature of an architecture with high internal cohesion. The same moves do work across different problem dimensions — because the problem dimensions are symptoms of one underlying structure.

What the industry accepts as trade-offs

Every item is a trade-off the industry accepts. Every item is a contradiction that can be resolved:

Industry trade-off	Contradiction	Resolution
"More flexibility means more misconfigurations"	Flexibility vs Safety	Evaluate the result, not the source
"Faster shipping means less security"	Speed vs Correctness	Evaluate in milliseconds, not in sprints
"Decentralization means inconsistency"	Decentralization vs Control	Central invariants, decentralized evidence
"Human judgment means irreproducibility"	Judgment vs Reliability	Judgment upstream in the control, evaluation downstream and mechanical
"Expressive IaC means hidden errors"	Expressiveness vs Complexity	Evaluate the snapshot, not the source
"Dynamic infrastructure means brittle security"	Dynamic vs Static	Static invariants, dynamic observations
"Fine-grained permissions means unmanageable rules"	Precision vs Manageability	Intent layer above the permission layer

Seven trade-offs the industry treats as laws of nature. Seven contradictions that are resolvable with known techniques. The techniques are 80 years old. The application to cloud security is new.

Why convergence matters for builders

If you're facing multiple contradictions in your own domain and each one seems to need a different solution, you haven't found the structural contradiction yet. The surface contradictions are symptoms. The structural contradiction is underneath them.

The test: do your resolutions converge?

If seven contradictions resolve to seven different architectures, you've decomposed the problem into seven sub-problems. Each sub-solution adds a component. The result is a complex system with seven moving parts.

If seven contradictions resolve to one architecture, you've found the shape underneath. The single architecture is simpler than seven sub-solutions. It's more coherent. And it's more likely to be correct — because the convergence itself is evidence that you've identified the structural problem, not just the symptoms.

Four months of analysis. Seven contradictions named. Nine inventive principles applied. One architecture produced. 1,030 lines of kernel code. No rewrites.

The thinking was the foundation. The code was the proof.

The three-component architecture — invariant (2,650 YAML controls), snapshot (obs.v0.1 JSON), evaluator (1,030-line CEL predicate engine) — is implemented in Stave, an open-source Risk Reasoner. Seven contradictions resolved. One architecture. Every conclusion deterministic, traceable, and provable. Try it: bash examples/demo-ai-security/run.sh

Google Engineers Can't Create Public Cloud Storage Buckets. Not Because They're Smarter. Because the Option Doesn't Exist.

Bala Paranj — Wed, 20 May 2026 12:20:45 +0000

Misconfiguration isn't a personnel failing. It's a structural property of platforms that PERMIT unsafe constructs. Google, Spotify, Netflix, and Shopify solved this by removing the unsafe constructs from the developer surface entirely. 95-99% reduction in misconfiguration incidents. Most organizations can't build that platform. Here's the alternative — and why the two approaches are complementary, not competing.

Google's internal infrastructure doesn't have publicly exposed storage buckets. Not because Google engineers are more careful. Because the construct "public bucket" doesn't exist in their developer surface. A Google engineer deploying an internal service writes a one-line service declaration. The platform synthesizes everything — network policy, RBAC, TLS certificates, monitoring, secrets management. The developer never sees the configuration knobs that would produce a misconfiguration.

The misconfiguration doesn't happen because it CAN'T happen. The unsafe construct isn't guarded against. It's ABSENT.

This is the upstream approach to misconfiguration — and it's been independently adopted by Google, Spotify (Backstage + Golden Paths), Shopify (Polaris), and Netflix (Paved Road + Spinnaker). Each reports 95-99% reductions in misconfiguration incidents.

Most organizations can't build this. Here's why — and what to do instead.

The reframe: misconfiguration is structural, not personal

The industry frames misconfiguration as a KNOWLEDGE problem:

"The engineer didn't know the right configuration"
    → Fix: more training
    → Fix: better documentation  
    → Fix: security champions
    → Fix: mandatory reviews

Each fix addresses the engineer. Each assumes the PERSON is the variable. Train them better. Document more clearly. Review more thoroughly.

The structural reframe:

"The platform PERMITS the unsafe configuration"
    → Fix: remove the unsafe configuration from the platform's vocabulary

This fix addresses the PLATFORM, not the engineer. The engineer's knowledge doesn't matter because the unsafe construct doesn't exist in the surface they interact with. A developer who doesn't know that publicly exposed storage is dangerous can't create it — not because they learned it's dangerous, but because the option literally isn't available.

Three implications:

Personnel interventions don't work at scale. Training addresses one engineer at a time. The next hire resets to baseline. Turnover regenerates the problem. The structural property persists regardless of who's on the team.

Process interventions are insufficient. Code reviews catch SOME misconfigurations. The reviewer must notice the unsafe construct among hundreds of lines of IaC. The review is human-speed; deployments are machine-speed. The process can't keep pace.

Structural interventions work. Redesign the developer surface so unsafe constructs are unexpressible. The misconfigurations disappear because they have no expressive form. Not hard to create. IMPOSSIBLE to express.

What the upstream platform looks like

A developer using the upstream platform:

Developer writes:     "Deploy service: order-processor, tier: production"

Platform synthesizes:
    ✓ Namespace with correct labels + Pod Security Admission
    ✓ NetworkPolicy: default-deny + exact required egress
    ✓ RBAC: least-privilege ServiceAccount derived from tier
    ✓ SPIFFE identity automatically issued and mounted
    ✓ Secrets from Vault with automatic rotation
    ✓ OpenTelemetry auto-injected
    ✓ Immutable root filesystem, read-only containers, drop-all capabilities

The developer's input: one line. The platform's output: a complete, production-ready, secure-by-construction service. The developer never sees NetworkPolicy YAML. Never writes RBAC rules. Never configures TLS. Never manages secrets.

The platform's vocabulary is BOUNDED to pre-approved safe forms (golden templates). The developer can't request a public endpoint without going through an explicit, reviewed approval path. The unsafe construct isn't in the default vocabulary.

Four architectural properties:

Property	What it means
Synthesis from intent	Developer declares WHAT; platform produces HOW
Golden templates only	Platform vocabulary is bounded to pre-approved forms
Continuous audit	Templates are continuously updated as threats evolve
No bypass mechanism	The platform has no ability to create unsafe configurations

The fourth property is the most distinctive: physically impossible to deploy publicly exposed storage or an overpermissive role because those constructs DON'T EXIST in the allowed schema.

Who has built this — and what they achieved

Organization	Platform	Result
Google (2015+)	Borg + internal IDP	Near-zero misconfiguration incidents in internal services
Spotify	Backstage + Golden Paths	High developer satisfaction + uniform safety properties
Shopify	Polaris	Standardized safe defaults across all services
Netflix	Paved Road + Spinnaker	Reduced incident rate; safety via template compliance

Four independent organizations. Same pattern. Same results. The convergence is empirical evidence the approach works.

Why most organizations can't do this

Despite the evidence, most organizations do NOT build upstream platforms:

Building an Internal Developer Platform requires:
    ✗ A dedicated platform team (5-20+ engineers)
    ✗ Multi-year staffing commitment
    ✗ Deep integration with cloud vendor APIs
    ✗ Continuous maintenance as cloud features evolve
    ✗ Organizational authority to mandate platform adoption
    ✗ Budget for a system that doesn't ship customer features

The investment pays off AFTER years of operation. Most organizations need safety NOW with the team they HAVE. The upstream approach is available but not accessible.

This creates a structural gap: the approach that WORKS (upstream platform) is inaccessible to the organizations that NEED it most (teams without platform-engineering capacity).

The downstream alternative

The downstream approach addresses the gap. Instead of preventing unsafe constructs from being EXPRESSED, it catches them before they reach PRODUCTION:

Upstream (Internal Developer Platform):
    Developer intent → Platform synthesizes safe config → Production
    Unsafe constructs never expressible

Downstream (Invariant evaluation):
    Developer writes IaC → Evaluation catches unsafe state → Block before production
    Unsafe constructs expressible but caught before deploy

Property	Upstream platform	Downstream evaluation
Where it operates	Authoring (before IaC exists)	Evaluation (after IaC, before deploy)
What it changes	The expression vocabulary	The state evaluation
Who absorbs complexity	Platform team (5-20+ engineers)	Catalog authors (1-3 engineers)
Adoption cost	Multi-year IDP build	Single binary integration in CI
Coverage	All configs through platform	All configs through CI/CD
Bypass risk	Very low (must bypass platform)	Moderate (can bypass CI/CD)
Safety level	95-99% reduction	Substantial reduction within catalog coverage
Time to adopt	Months to years	Hours to days

The upstream approach has HIGHER safety but HIGHER cost. The downstream approach has LOWER safety (bypassable, coverage-bounded) but DRAMATICALLY lower cost (adoptable by any team with CI/CD).

The choice matrix

Organization profile	Recommended approach
Has platform team, multi-year budget	Upstream IDP (Google-style)
No platform team, but mature CI/CD	Downstream (invariant evaluation)
Both available	Hybrid: upstream for new services, downstream for legacy
Highly regulated, zero bypass tolerance	Upstream (with explicit override paths)
High velocity, needs safety NOW	Downstream (adoptable in hours)

Most organizations are in the SECOND row: no platform team, but mature CI/CD. The downstream approach is their accessible path to safety properties the upstream approach would provide if they could build it.

The THIRD row (hybrid) is increasingly common. Large organizations build upstream platforms for new services AND use downstream evaluation for legacy services that don't yet flow through the platform. The two approaches are complementary — each covers what the other misses.

What the downstream approach catches that upstream can't

Upstream platforms are powerful but bounded:

What upstream platforms miss:
    ✗ Legacy services not on the platform (migration takes years)
    ✗ Emergency bypass / break-glass operations
    ✗ Cloud provider API changes that introduce new unsafe defaults
    ✗ Platform template bugs (the template itself is misconfigured)
    ✗ Cross-service compound risks (the template is safe per-service; the combination isn't)

The downstream approach catches ALL of these — because it evaluates ACTUAL STATE (snapshots) against invariants (catalog), regardless of how the state was produced. A legacy service that never touched the platform? Evaluated. A break-glass console change? Caught in the next post-deploy snapshot. A template bug? The invariant catches what the template missed. A compound risk across services? Chain controls evaluate cross-asset conditions.

The downstream approach is the SAFETY NET under the upstream platform. Even organizations with full IDPs benefit from a downstream evaluation layer that catches what slips past the platform.

The convergence trajectory

The industry is moving toward upstream platforms:

Period	Dominant approach
2010-2015	Pure permissive: developers write everything
2015-2020	Scanners + reviews: catch misconfigurations after deployment
2020-2025	Policy-as-code: declared rules at PR time
2025-2030	Internal platforms: synthesize safe configs (selected organizations)
2030+	Ubiquitous platforms: most organizations adopt some form of IDP

As the trajectory progresses, the downstream approach's role evolves:

Today: Primary safety mechanism for organizations without platforms (most organizations).

2030+: Complementary safety mechanism for organizations WITH platforms — catching legacy, bypass, template bugs, and compound risks the platform doesn't cover.

The future role isn't diminished. It's SPECIALIZED. Even in a fully-platformed world, the downstream evaluation layer provides defense-in-depth that the platform alone can't.

The honest comparison

The downstream approach does NOT claim upstream-level safety:

Metric	Upstream (IDP)	Downstream (invariant evaluation)
Misconfiguration reduction	95-99% (documented by Google, Spotify, etc.)	Substantial — bounded by catalog coverage and CI/CD integration
Bypass resistance	Very high (must bypass the platform)	Moderate (can bypass CI/CD; caught by post-deploy snapshots)
Time to value	Months to years	Hours to days
Team required	Platform team (5-20+)	One person can start
Coverage of legacy	Low (legacy not on platform)	High (evaluates any state snapshot)
Cost	$1M-10M+/year in platform team	Open source + operator's existing CI

The downstream approach trades SAFETY CEILING for ACCESSIBILITY. The safety ceiling is lower (bypassable, coverage-bounded). The accessibility is incomparably higher (any team, any CI pipeline, any cloud provider, today).

For 90% of organizations — the ones that will never build a Google-scale IDP — the accessible option is the only option. And the accessible option with 2,650 invariants evaluated before every deploy is DRAMATICALLY safer than the current state of no evaluation at all.

For your organization

If you have a platform team: Build the upstream IDP. The evidence supports 95-99% reduction. Add downstream evaluation as defense-in-depth for legacy, bypass, and compound risks.

If you don't have a platform team: Adopt downstream evaluation. Single binary in CI. 2,650 controls evaluating every deploy. Achievable this week, not next year.

If you're building toward a platform: Start with downstream evaluation NOW while building the platform. The catalog you develop during downstream evaluation INFORMS the golden templates you'll build for the platform. The two investments compound.

Google engineers can't create publicly exposed storage buckets because the option doesn't exist in their surface. Your engineers can — because your surface permits it. The upstream fix removes the option. The downstream fix catches it before production. Both work. One takes years and a platform team. The other takes a binary and an afternoon. Start with what you can do today.

The downstream alternative — 2,650 invariants evaluated against actual cloud state, catching what upstream platforms miss, accessible to any team with CI/CD — is Stave, an open-source Risk Reasoner. Single binary. No platform team required. Defense-in-depth for teams building toward upstream, primary safety for teams that aren't. Try it: bash examples/demo-ai-security/run.sh

Versioned Schema Contracts in a Go CLI: How obs.v0.1 Prevents Silent Breaks

Bala Paranj — Tue, 19 May 2026 11:38:10 +0000

How embedding schema versions in every data file — observations, controls, output, baselines — enables forward compatibility, fail-fast loading, and contract testing without external schema registries.

A user upgrades the CLI from v0.8 to v0.9. Their observation files still say "schema_version": "obs.v0.1". The new CLI adds a field to the output schema. Is the old input still valid? Can the new output be read by downstream tools?

Without versioned schemas, you're guessing. With them, the answer is in the data: the file says what version it speaks, the tool says what versions it accepts, and the mismatch is a clear error — not a silent corruption.

Every Data File Carries Its Version

{
  "schema_version": "obs.v0.1",
  "captured_at": "2026-01-01T00:00:00Z",
  "assets": [...]
}

dsl_version: ctrl.v1
id: CTL.S3.PUBLIC.001
name: Block Public Access
unsafe_predicate:
  any:
    - field: properties.public
      op: eq
      value: true

{
  "schema_version": "out.v0.1",
  "kind": "ASSESSMENT",
  "run": {...},
  "findings": [...]
}

Four schema versions in the system:

Schema	Format	Purpose
`obs.v0.1`	JSON	Observation snapshots (cloud resource state)
`ctrl.v1`	YAML	Control definitions (security rules)
`out.v0.1`	JSON	Evaluation output (findings, verdicts)
`baseline.v0.1`	JSON	Baseline artifacts (accepted posture)

Fail-Fast at Load Time

The loader checks the version before parsing the body:

func (v *Validator) validateDocument(raw []byte, cfg docConfig, opts ...Option) (*diag.Assessment, error) {
    // 1. Parse just the version field
    var partial struct {
        Version string `json:"schema_version" yaml:"dsl_version"`
    }
    cfg.Unmarshal(raw, &partial)

    // 2. Check if this version is accepted
    if !slices.Contains(cfg.Accepted, actual) {
        return unsupportedVersionResult(actual, cfg.Accepted,
            "Use a supported schema version"), nil
    }

    // 3. Validate the full document against the versioned schema
    diags, err := v.Validate(Request{
        Kind:          schemas.Kind(cfg.Kind),
        ActualVersion: actual,
        Data:          raw,
    })
    // ...
}

If an observation file says "schema_version": "obs.v0.2" and the tool only knows obs.v0.1, the error is immediate and clear:

Schema version "obs.v0.2" is not supported. Accepted versions: obs.v0.1

No partial parsing. No silent field dropping. No "it loaded but something is wrong."

Schemas are Embedded in the Binary

//go:embed embedded/*/*/*.json
var embeddedFS embed.FS

The JSON Schema files are compiled into the binary via go:embed. The tool doesn't need network access, a schema registry, or a config directory to validate input. This is critical for air-gapped environments.

The schema for controls includes additionalProperties: false at every level:

{
  "type": "object",
  "additionalProperties": false,
  "required": ["dsl_version", "id", "name", "description", "type"],
  "properties": {
    "op": {
      "type": "string",
      "enum": ["eq", "ne", "lt", "gt", "missing", "present", "contains", "in"]
    }
  }
}

A typo like operater: eq (instead of op: eq) is caught immediately — unknown fields are rejected, not silently ignored.

Why Versions Live in the Data, Not the Tool

The version is in the JSON/YAML file, not in the CLI's flag or config. This means:

Files are self-describing. You can identify what a file is by reading it — no filename convention needed.
Multiple versions can coexist. A directory can contain obs.v0.1 and (future) obs.v0.2 files. The loader handles each according to its declared version.
Downstream tools know what they're reading. A CI pipeline that consumes out.v0.1 output can validate it against the published schema for that version.

Versioned schema contracts are used in Stave, a Go CLI for offline security evaluation. The embedded JSON Schemas validate input at load time with additionalProperties: false catching typos. The trace.v0.1 logic trace schema was added following the same pattern.

Visual Regression Testing for CLIs with VHS

Bala Paranj — Mon, 18 May 2026 12:06:31 +0000

How to use Charm's VHS to create GIF-based visual regression tests for your CLI's terminal output — catching formatting bugs that unit tests miss.

Your CLI's unit tests verify that the right data comes out. But they don't test what the user actually sees.

A missing newline. A table column that wraps at 80 characters. A progress spinner that bleeds into the output. An ANSI color code that renders as garbage on a light terminal theme. These are visual bugs that pass every unit test but make your CLI look broken.

VHS by Charm solves this by recording your terminal as a GIF from a script — and you can use those GIFs as visual regression tests.

Using VHS

VHS reads a .tape file that describes terminal interactions:

# demo.tape
Output demo.gif
Set Width 120
Set Height 40
Set Theme "Monokai"

Type "stave apply --controls ./controls --observations ./obs --format text"
Enter
Sleep 2s

Run it:

vhs demo.tape

Output: demo.gif — a pixel-perfect recording of what the terminal looks like when that command runs.

How This Differs from Asciinema

	Asciinema (.cast)	VHS (.gif/.png)
Output	Text-based replay (NDJSON)	Pixel-based image (GIF/PNG/WebM)
Renders	In a JavaScript player	As a static image anywhere
Tests	Text content correctness	Visual formatting correctness
Use case	Documentation, interactive replay	README badges, visual regression
File size	Small (text)	Large (image)
Searchable	Yes (it's text)	No (it's pixels)

Asciinema answers: "What text does the CLI produce?"
VHS answers: "What does the CLI look like?"

Both are useful. They test different things.

Visual Regression Testing Pattern

Step 1: Create a `.tape` file per workflow

# tapes/apply-violation.tape
Output testdata/screenshots/apply-violation.gif
Set Width 120
Set Height 40
Set FontSize 14
Set Theme "Catppuccin Mocha"

Type "stave apply --controls controls/s3 --observations observations --now 2026-01-15T00:00:00Z --format text"
Enter
Sleep 3s

Step 2: Generate the baseline

vhs tapes/apply-violation.tape

Commit testdata/screenshots/apply-violation.gif as the golden file.

Step 3: Compare in CI

# .github/workflows/visual.yml
- name: Generate screenshots
  run: |
    for tape in tapes/*.tape; do
      vhs "$tape"
    done

- name: Check for visual changes
  run: |
    git diff --exit-code testdata/screenshots/

If any GIF changes, the diff catches it. The developer reviews the visual change and either updates the golden file or fixes the formatting bug.

Step 4: Review with PR comments

For GitHub PRs, you can post the before/after GIF directly in a comment:

- name: Post visual diff
  if: failure()
  run: |
    echo "Visual regression detected. See the updated screenshots below."
    # Upload artifacts or post to PR

What Visual Tests Catch That Unit Tests Miss

Table alignment

CONTROL_ID          ASSET_ID              STATUS
CTL.S3.PUBLIC.001   my-very-long-bucket   NON_COMPLIANT
                    -name-that-wraps

A unit test checks that the data is correct. A visual test catches that the column wraps and breaks the alignment.

Color and formatting

[PASS] CTL.S3.ENCRYPT.001 — Server-Side Encryption
[FAIL] CTL.S3.PUBLIC.001 — No Public Read Access

A unit test sees [PASS] and [FAIL]. A visual test sees whether the ANSI color codes render correctly — green for pass, red for fail — or whether they produce \033[32m[PASS]\033[0m garbage.

Progress indicators

Running: evaluating controls... ⠋

A spinner that works in a real terminal but bleeds into piped output. A visual test with a fixed terminal size catches this.

Help text layout

Usage:
  stave apply [flags]

Flags:
  -i, --controls string   Path to control definitions (default "controls/s3")
  -o, --observations string
                          Path to observation snapshots (default "observations")

Does the flag help wrap correctly? Are the defaults aligned? Is the long description properly indented? Unit tests don't check layout. VHS checks layout.

VHS `.tape` Cheat Sheet

Output file.gif              # Output file (gif, png, webm, mp4)
Set Width 120                # Terminal width
Set Height 40                # Terminal height
Set FontSize 14              # Font size in pixels
Set Theme "Dracula"          # Terminal theme
Set TypingSpeed 50ms         # Delay between keystrokes

Type "command"               # Type text (simulated keystrokes)
Enter                        # Press Enter
Sleep 2s                     # Wait for output
Ctrl+C                       # Send interrupt
Tab                          # Press Tab (for completion testing)
Backspace 5                  # Delete 5 characters

Hide                         # Stop recording (for setup commands)
Show                         # Resume recording

Combining Both Tools

For a complete CLI testing strategy:

Layer	Tool	Tests
Unit tests	`go test`	Data correctness, error handling, exit codes
E2E golden files	`go test` + JSON comparison	Full output correctness, determinism
Text recordings	Custom asciicast generator	Documentation accuracy, demo freshness
Visual regression	VHS	Formatting, alignment, colors, layout

Each layer catches different bugs. Unit tests catch logic errors. Golden files catch output regressions. Asciicast recordings catch documentation drift. VHS catches visual formatting bugs.

Getting Started

# Install VHS (macOS)
brew install charmbracelet/tap/vhs

# Install VHS (Linux)
go install github.com/charmbracelet/vhs@latest

# Create your first tape
cat > hello.tape << 'EOF'
Output hello.gif
Set Width 80
Set Height 24
Type "echo 'Hello from VHS'"
Enter
Sleep 1s
EOF

# Record
vhs hello.tape

The GIF is your visual test. Commit it, compare it in CI, review it in PRs.

Stave uses programmatic asciicast generation for documentation recordings and Go-based golden file testing for output correctness. VHS is the natural next step for visual regression testing of the text-formatted output.

Z3 Can Prove Your Cloud is Unsafe. It Can't Tell You Why.

Bala Paranj — Sun, 17 May 2026 11:35:08 +0000

Z3 is one of the most powerful reasoning engines ever built. Microsoft Research created it to verify chip designs and flight software. It can take your cloud configuration, model it as a set of logical assertions, and mathematically prove whether an attack path exists.

Z3 says when it finds one:

sat

Three letters. No context. No explanation. No link to the configuration property that caused it. No fix. Just "sat" — which means "satisfiable," which means "the unsafe state you asked about is reachable," which means your cloud is vulnerable. Probably. If you encoded the question correctly. Which you can't verify from the output.

This is the gap between a powerful engine and a useful tool. The engine answers the question. The tool explains the answer. This article explains why the explanation layer matters more than the engine, and what it looks like in practice.

What Z3 outputs

Let's say you want to check whether an anonymous internet user can read PHI data from an S3 bucket through a Cognito identity pool. You model the configuration as SMT-LIB assertions, write a query, and run it through Z3.

The input Z3 sees:

(set-logic ALL)
(declare-fun allows_unauthenticated (String String) Bool)
(declare-fun maps_unauth_to (String String) Bool)
(declare-fun has_action (String String) Bool)
(declare-fun has_tag (String String) Bool)

(assert (allows_unauthenticated "pool-abc" "true"))
(assert (maps_unauth_to "pool-abc" "role/AppUnauthRole"))
(assert (has_action "role/AppUnauthRole" "s3:GetObject"))
(assert (has_tag "bucket-prod-phi" "data_classification:phi"))

(declare-const principal String)
(declare-const action String)
(declare-const resource String)

(assert (allows_unauthenticated principal "true"))
(assert (has_action principal action))
(assert (has_tag resource "data_classification:phi"))

(check-sat)
(get-model)

The output Z3 produces:

sat
(
  (define-fun principal () String "pool-abc")
  (define-fun action () String "s3:GetObject")
  (define-fun resource () String "bucket-prod-phi")
)

In SMT-LIB, sat means the forbidden state is reachable. The model names the specific pool, action, and bucket. The proof is mathematically sound.

If you're a security engineer who just wants to know whether your Cognito configuration is safe, this output is useless. You have three questions:

Is this result correct? Did the input assertions match your configuration, or did the translation introduce a bug?
What does it mean? Which specific settings in which specific files create the vulnerability?
What do I do about it? What's the fix, what does it cost, and how do I verify it worked?

Z3 answers none of these. It answered the math question. The security question is still open.

The two translation boundaries

Between your cloud configuration and Z3's answer, there are two translation steps. Each can introduce bugs. Each is invisible if you only look at the solver's output.

YOUR CLOUD CONFIG          THE MATH              YOUR ANSWER

  S3 bucket policy    →    SMT-LIB assertions    →    "sat"
  IAM role policy          (was the translation        (was the translation
  Cognito settings          correct?)                   back to cloud
                                                        terms correct?)

  ENCODING BOUNDARY        Z3 SOLVER              DECODING BOUNDARY
  (can have bugs)          (trusted)              (can have bugs)

Encoding bugs: Your S3 bucket has PublicAccessBlock.BlockPublicPolicy = true, but the translation emits has_public_read "bucket" "true". The assertion is wrong — the bucket is private, but Z3 thinks it's public. Z3 faithfully returns sat based on the wrong input. The "vulnerability" doesn't exist. You can't tell from sat that the encoding was wrong.

Decoding bugs: Z3 returns sat with a model. You read the model and conclude "the bucket is directly accessible from the internet." But the actual path is through the Cognito identity pool, not direct access. The model told you which variables were assigned, but the interpretation of those variables — what they mean in cloud terms — is your responsibility. Misread the model, misunderstand the vulnerability.

The solver is the most reliable component in the pipeline. The translation layers are where the bugs hide. And they're the layers that get the least attention because everyone focuses on the solver.

What an orchestration layer provides

An orchestration layer wraps the solver with five capabilities that make the answer trustworthy, traceable, and actionable:

1. Encoding explanation — "Did the tool understand my configuration?"

Before the solver runs, you see what the tool extracted from your configuration, in your language:

Configuration Summary: 7 assets, 43 facts extracted

Asset: arn:aws:s3:::prod-phi (S3 Bucket)
  ├── Public read access: ENABLED
  │   Source: prod-phi-bucket.obs.json → access.public_read
  │   Fact: a3f8c2e91b04
  │
  ├── Encryption: AES256 (SSE-S3, not KMS)
  │   Source: prod-phi-bucket.obs.json → encryption.algorithm
  │   Fact: b7d1e4f03c89
  │
  └── Data classification: PHI
      Source: prod-phi-bucket.obs.json → tags.data_classification
      Fact: c9e2a1f04d77

Asset: arn:aws:cognito-identity:...:identitypool/abc (Identity Pool)
  ├── Unauthenticated access: ALLOWED
  │   Source: cognito-pool.obs.json → identity.access.allow_unauthenticated
  │   Fact: d4e5f6a7b8c9
  │
  └── Maps unauthenticated users to: arn:aws:iam::111122223333:role/AppUnauthRole
      Source: cognito-pool.obs.json → identity.cognito.unauth_role_arn
      Fact: e5f6a7b8c9d0

No SMT-LIB. No predicate names. "Public read access: ENABLED" — the security engineer compares this to their mental model of the bucket. If the bucket is private, the encoding is wrong and the engineer knows before the solver runs.

Every fact has a unique identifier and a traceable source showing which file and which property path produced it. This is the audit trail. When the solver's answer doesn't match expectations, the engineer traces the identifier back to the source and checks whether the encoding was correct.

2. Human verdict — "What does the result mean?"

After the solver runs, you see the answer in security language:

VERDICT: UNSAFE

An anonymous internet user can read PHI data from the prod-phi
bucket through the Cognito identity pool.

The forbidden state is reachable because:
  1. Identity pool allows unauthenticated access
     (cognito-pool.obs.json → identity.access.allow_unauthenticated = true)
  2. Unauthenticated users receive credentials for AppUnauthRole
     (cognito-pool.obs.json → identity.cognito.unauth_role_arn)
  3. AppUnauthRole has s3:GetObject permission
     (iam-role.obs.json → policies.attached_policies[0].Action)
  4. Target bucket contains PHI
     (prod-phi-bucket.obs.json → tags.data_classification = phi)

Not sat. Not a model with define-fun expressions. A four-step chain in plain English, each step linked to a specific file and property path. The security engineer reads it and knows exactly which settings create the vulnerability and where they live.

3. Fix guidance — "What do I do about it?"

FIX: Disable unauthenticated access on the identity pool.

  aws cognito-identity update-identity-pool \
    --identity-pool-id us-east-1:abc123 \
    --no-allow-unauthenticated-identities

  Cost: $0. Time: 30 seconds.
  Effect: Breaks the chain at step 1.

VERIFICATION: After applying the fix, re-run the analysis.
  Expected result: SAFE
  (Z3 returns UNSAT — no assignment of principals, actions,
  and resources can satisfy the attack path conditions.)

The fix is a shell command. The cost is quantified. The effect names the specific chain step that breaks. The verification tells the engineer what to expect and explains the solver's output in cloud terms.

4. Traceability — "Which property caused this?"

Every step in the verdict traces back to a specific property in a specific file through a unique identifier:

# The verdict says step 1 caused the chain.
# Trace identifier d4e5f6a7b8c9:

grep "d4e5f6a7b8c9" facts.jsonl

# Returns:
# fact_id: d4e5f6a7b8c9
# subject: pool-abc
# predicate: allows_unauthenticated = true
# source: cognito-pool.obs.json
# property: identity.access.allow_unauthenticated
# captured: 2026-05-01T00:00:00Z

One identifier. One grep. Full trace from the verdict to the configuration property, including when the snapshot was taken. No manual correlation across output files.

5. Encoding verification — "Can I trust the translation?"

The orchestration layer verifies its own encoding by comparing each extracted fact against the raw configuration:

Encoding verification: 43/43 facts verified ✓

Every extracted fact matches the corresponding property
in the observation file. The solver's input is consistent
with the configuration snapshot.

Or, when there's a bug:

Encoding verification: 41/43 facts verified

MISMATCH:
  Fact d4e5f6a7b8c9:
    Extracted: allows_unauthenticated = "true"
    Observation: identity.access.allow_unauthenticated = "false"
    File: cognito-pool.obs.json
    → ENCODING BUG: fact says unauthenticated is allowed,
      but the configuration says it's disabled.
      The UNSAFE verdict may be incorrect.

No other tool in the market verifies its own translation layer. The security engineer doesn't need to read SMT-LIB to trust the result — the tool proves its encoding is correct, or reports where it is wrong.

What you get without orchestration vs. with it

Capability	Z3 alone	With orchestration
Prove an attack path exists	`sat`	UNSAFE: anonymous user can read PHI data
Explain which settings cause it	Read the SMT-LIB model	Four-step chain with file names and property paths
Verify the encoding is correct	Manually review assertions	Automated: 43/43 facts match observations
Trace the finding to a file	Grep through comments	One identifier, one grep, full trace
Get the fix	Derive it from the model	Shell command, cost, time, expected result
Run multiple solvers	Reformat per solver	Same input, three solvers, consensus
Explain to an auditor	Show them SMT-LIB	Show them the chain in English with evidence

The right column is the product. The left column is a math library. Security engineers don't need a math library — they need trustworthy answers in their language.

Why this matters more for compound detection

The orchestration layer is most valuable when the analysis spans multiple services. A single-service check ("is this bucket public?") is simple enough to verify manually. A cross-service check ("can an anonymous user reach PHI data through a chain of Cognito → IAM → S3?") involves three services, three configuration files, and dozens of properties.

The encoding has more places to be wrong. The verdict has more steps to explain. The traceability has more links to follow.

This is where raw solver output becomes dangerous. If Z3 returns sat on a three-service chain and the encoding has a bug in the IAM layer, the finding is a false positive. The security team triages it as critical, burns two sprints investigating, and discovers it was caused by a property path error in the translation code. The encoding verification catches that error before the solver runs.

If Z3 returns unsat and the encoding has a bug — a property that should be true was encoded as false — the finding is a false negative. The team thinks they're safe. They're not. The encoding verification catches that too.

The more complex the analysis, the more valuable the translation layer. Single-service checks can tolerate raw solver output. Cross-service compound detection cannot.

The bottom line

Don't evaluate formal verification tools by which solver they use. Z3, cvc5, and Yices all produce correct answers to the questions they're asked. The question is whether the question was asked correctly and whether the answer is translated back to your language.

Don't accept sat as an answer. Accept "UNSAFE: an anonymous internet user can read PHI data from the prod-phi bucket through the Cognito identity pool, caused by these four settings in these three files, fixable with this one command for $0 in 30 seconds, verified by three independent solvers, encoding confirmed correct against 43 observation properties."

That's an answer. sat is a data point.

The orchestration layer described in this article is implemented in Stave, an open-source static analysis tool that evaluates cloud configurations via CEL predicates and exports standardized facts for consumption by nine independent reasoning engines. The encoding explanation, verdict translation, traceability, and encoding verification are built on Stave's fact_id provenance chain — every fact carries a deterministic identifier linking the solver's input to the specific observation file and property path that produced it. All analysis runs on air-gapped snapshots with no cloud credentials required.

Proof, not prediction: where formal verification beats AI in cloud security

Bala Paranj — Sat, 16 May 2026 11:33:09 +0000

An AI scanner says 'this configuration looks unsafe' with 87% confidence. A formal verifier says 'this configuration IS unsafe, here is the exact principal, action, and resource that proves it.' One is a prediction. The other is evidence. The difference matters for insurance, for audits, and for the 80% of cloud security questions that have exact answers.

A CISO walks into a renewal meeting with the cyber-insurance underwriter. The underwriter asks one question:

Can your most sensitive S3 bucket be accessed by an unauthorized principal?

There are two ways to answer. An AI-powered cloud security tool gives you a something like "this bucket appears to have appropriate controls based on similar configurations in our training data, confidence: 92%." A formal verifier gives you a yes/no with a witness: "no — UNSAT against the following 17 constraints over the bucket policy, identity federation, IAM, and account-level Public Access Block."

One is a prediction. The other is a proof. The underwriter knows which one survives a subrogation suit.

What is a proof

When a solver like Z3 says UNSAT, it has shown that no assignment of the free variables in the model satisfies the constraints. There is no principal, no action, no resource, no role assumption, no trust path, no policy condition — within the model — that violates the property. The answer is mathematically complete relative to the model.

When the same solver says SAT, it returns a witness: the concrete assignment that satisfies the constraints. The witness names the exact principal, action, and resource that constitute the violation. The output is constructive. You didn't just learn the property is false; you got the counterexample that proves it.

There is no confidence score. There is no probability. There is no temperature setting. There is no false-positive rate. There is no training data. There is no "this might be similar to a known bad pattern." There is a function from facts to verdicts, and the verdict is correct relative to the inputs it was given.

The qualification relative to the model is critical. If your model doesn't include SCP evaluation, the solver won't catch SCP issues. If your model doesn't include Cognito identity-pool trust, the solver won't catch identity-federation chains. But the model's limitations are enumerable. You can list them. Operators know exactly what the verifier covers and what it doesn't.

An AI tool's limitations are statistical and unknowable. You can't enumerate what the training data didn't cover. A configuration that doesn't match any training pattern gets missed silently. A configuration that superficially resembles a bad pattern gets flagged incorrectly. The miss is invisible until the breach.

Proof vs prediction

Dimension	Formal verification (Z3 / cvc5 / Yices)	AI-based cloud-security tools
Output shape	Proof or counterexample	Prediction with confidence score
Correctness guarantee	Mathematically complete relative to the model	Statistically approximate relative to the training data
False positives	Zero relative to the model	Non-zero, tunable via threshold
False negatives	Zero relative to the model	Non-zero, depends on training coverage
Explainability	Exact witness — "principal P, action A, resource R"	This configuration resembles known-bad patterns
Novel attack paths	Discovers paths nobody has seen before, if the model captures them	Recognises only patterns similar to training data
Reproducibility	Deterministic — same input always produces the same answer	Varies across model versions, temperatures, prompts

The deterministic property makes the verifier auditable. Run it today, get UNSAT. Run it tomorrow against the same fact base, get UNSAT. Run it in three years when the auditor asks for evidence, get UNSAT with the same fact base. The proof is the artifact. It does not decay.

An AI prediction at 92% confidence today is a 87% confidence prediction tomorrow because the model was updated, or a 79% confidence prediction next week because the prompt template changed. The prediction is not an artifact. It is an output of a service. The service can be re-priced, re-trained, or shut down. Your audit evidence cannot.

The cost structure that nobody mentions

The accuracy story is the visible one. The cost story is the one that bends adoption.

Z3 runs on a single CPU core. The binary is ~10 MB. Running it on Stave's full SMT-LIB export — 5,000 facts plus 90 closed-world axioms — completes in milliseconds:

$ time z3 facts.smt2
sat

real    0m0.131s
user    0m0.123s
sys     0m0.009s

130 milliseconds. No GPU. No API call. No usage tier.

An AI-based tool pays for every query. Inference cost, GPU hours, API fees, token pricing. The cost scales linearly with usage: more assets, more policies, more frequent runs, higher bill. Enterprise customers with thousands of resources across multiple AWS accounts pay proportionally.

The cost ratio shows up most starkly in continuous-assessment scenarios. A cyber-insurance underwriter might require posture verification on every configuration change. For a company making hundreds of changes daily across multiple accounts, the AI-based tool runs hundreds of times per day per account. The formal verifier runs the same number of times — at a marginal cost of zero. The CPU was already paid for.

	AI-based tools	Stave + Z3/cvc5
Per-query cost	API inference fee	Zero
Infrastructure	GPU clusters	Single CPU core
Scaling cost	Linear with usage	Flat
Offline capability	Requires API connectivity	Runs fully offline (`STAVE_NO_NETWORK=1`)
Predictability	Variable pricing, usage-based	Fixed — it's a binary

The offline guarantee compounds with the cost guarantee. Air-gapped environments — defence, classified workloads, certain financial verticals — cannot ship configuration data to a third-party AI API. They must analyse locally. A formal verifier was designed for this constraint. AI-powered SaaS tools were designed for the opposite constraint. The trade-off is structural, not a UX choice.

Why this matters for insurance and evidence

Cyber-insurance underwriting is the canary for this entire shift. Underwriters today read CSPM-tool screenshots and trust the vendor's narrative. The narrative is statistical: "our customers have 92% lower breach rates." The screenshot is an opinion: "this bucket appears compliant."

Underwriters are moving toward evidence. They want artifacts they can hand to legal, archive in their underwriting file, and produce in a subrogation suit if the breach happens anyway. A formal proof — (check-sat) returning unsat against a specific fact base on a specific date — is an artifact. It can be archived. It can be re-run. It can be independently verified by Z3, cvc5, or Yices to confirm the answer.

The proof and the fact base are the audit evidence. The fact base is obs.v0.1 JSON, schema-validated, with provenance for every property. The proof is whatever the solver returns. Together they reconstruct the verdict exactly. The auditor does not need to trust the security vendor's model — they can run the solver themselves against the committed fact base and verify the verdict independently.

A Z3 proof is evidence. An AI prediction is an opinion.

This is also Stave's strongest positioning. When an insurer asks "prove your S3 bucket can't be accessed by unauthenticated principals," the formal answer is the artifact the underwriter is starting to demand.

Where AI genuinely belongs

Roughly 80–90% of cloud-security questions have exact answers:

Is MFA enabled on the root account?
Does this bucket policy allow Principal: *?
Can an unauthenticated Cognito identity assume this role?
Is CloudTrail logging multi-region?
Are these three controls all violated on the same asset?

None of these requires a language model. None of them benefits from a confidence score. They are predicates over structured facts. The deterministic verifier answers them in milliseconds with mathematical completeness relative to the model. Paying AI pricing for "is MFA enabled?" is a category error.

The remaining 10–20% genuinely needs AI:

Natural-language policy interpretation: parsing a vendor's English SOC 2 attestation and asking "does this commit them to encrypting backups?" The text is ambiguous; the intent must be inferred.
Anomaly detection in behavioural data: "is this CloudTrail access pattern unusual for this principal?" The baseline is statistical, the answer is probabilistic.
Intent inference from incomplete documentation: "does the team that owns this bucket consider its contents public-by-design, or did the public ACL get added by mistake?" The signal is in commit messages, ticket history, and Slack threads; the answer is judgment.

AI tools are valuable on those. They are also expensive on those. The cost is justified because the alternative is human review at twice the price.

The foundation-layer architecture

The right shape is composition. Deterministic verification at the bottom; AI judgement at the top:

+-------------------------------------------------------+
|  AI tools (10-20% of work)                            |
|    Policy interpretation, anomaly detection,          |
|    intent inference from English / behavioural data   |
+-------------------------------------------------------+
|  Stave + Z3/cvc5 (80-90% of work)                     |
|    Compound chain detection, formal proofs,           |
|    deterministic verdicts on structured facts         |
+-------------------------------------------------------+
|  Snapshot facts (obs.v0.1)                            |
|    Schema-validated, provenance-tracked, offline      |
+-------------------------------------------------------+

The AI bill drops 80–90% because trivial deterministic checks no longer route through an inference endpoint. The remaining AI workload has a much better signal-to-noise ratio because it sees only the hard cases — the ones where human judgement is the alternative. The AI tool gets better, not worse, when the deterministic layer is in place underneath it.

This is also the partnership angle. An AI-powered cloud-security vendor benefits from Stave handling the deterministic layer: their costs drop because the volume drops, their accuracy improves because the inputs are cleaner, and their pricing model survives the next round of GPU price increases because they're no longer paying inference cost for questions like "is encryption enabled?"

Stave does not replace AI tools. It makes them economically viable by removing the work they shouldn't be doing.

A 130-millisecond demo

The repository ships a working pipeline. Clone, build, and run:

git clone https://github.com/sufield/stave
cd stave && make build

# Pick a real fixture — HackerOne #1021906, the Shopify "tag says internal,
# policy says everyone" case. The fixture is a reconstructed snapshot from
# the public disclosure.
FIXTURE=testdata/e2e/e2e-h1-shopify-1021906

# Export the snapshot as SMT-LIB v2 — facts only, no query.
./stave export-sir --format smt2 \
  --controls $FIXTURE/controls \
  --observations $FIXTURE/observations \
  --now 2026-01-15T00:00:00Z > /tmp/facts.smt2

# Append the satisfiability query.
echo "(check-sat)" >> /tmp/facts.smt2

# Two independent solvers.
z3 /tmp/facts.smt2
cvc5 --lang smt2 /tmp/facts.smt2

z3 returns sat on 302 lines of SMT-LIB in 132 milliseconds. cvc5 runs over the same input as an independent cross-check; on this particular fact base it returns unknown, which is itself an honest verdict — the second solver could not establish the answer with its default tactic. Two engines returning the same sat/unsat is the strongest possible cross-validation; one returning unknown is the signal to either widen the timeout, switch tactics, or refine the model. The point is that the protocol — facts plus query plus solver — is reproducible. The verdict is auditable regardless of which solver produced it.

No GPU, no API key, no per-query cost. The binary that produced the export does not call any cloud API. The solver runs locally. The audit evidence is the file at /tmp/facts.smt2 plus the verdict — both reproducible byte-for-byte at any point in the future against the same obs.v0.1 snapshot.

What this is not

It is not "AI is bad for security." AI is excellent for the 10–20% of problems that genuinely require judgement, anomaly detection, or natural-language interpretation. The framing is use the right tool for each class of problem.

It is not "formal verification solves everything." The verifier is correct relative to the model. If your model doesn't include the Cognito identity-pool trust pattern, the solver will not flag a Cognito identity-pool attack. But model coverage is enumerable; you can list what is and isn't covered. AI coverage is statistical; you cannot.

It is not "Z3 replaces your CSPM." CSPM and formal verification are complementary. CSPM tells you what's in the cloud right now. Formal verification tells you whether what's there satisfies your stated invariants. The CSPM dashboard is the inventory; the formal verifier is the auditor with the proof.

The structural shift

Cloud security has spent a decade paying language-model prices for boolean questions. The market matured around AI because no deterministic alternative existed that handled compound risk across services. That gap is closed. Stave's open-source policy library plus the SMT-LIB export plus a $0 solver replaces the deterministic-check layer at zero marginal cost.

The AI vendors who survive will be the ones who specialise in genuinely judgemental questions and welcome a free, deterministic foundation beneath them. The CSPM vendors who survive will be the ones who emit obs.v0.1-compatible inventories and let any open-source verifier consume them. The customers who survive are the ones who stop paying API fees for Principal: * checks.

When the insurance underwriter asks for proof, the customer hands them the SMT-LIB file and the solver verdict. The verifier ran on a laptop. The cost was zero. The evidence is independently reproducible.

That is what formal verification looks like in cloud security in 2026. The technique has been validated at scale by Microsoft Research (SecGuru, 2015) on Azure's datacentre network. Applying it to cloud service configurations is the only step that was missing. It is no longer missing.

Stave is an open-source intent verification engine for cloud infrastructure with 2,650+ controls, compound risk detection, and nine independent reasoning engines.

Zero-cost abstractions in Go: deleting your way to better code

Bala Paranj — Fri, 15 May 2026 12:09:37 +0000

The most impactful refactoring in a Go CLI wasn't adding code — it was deleting pass-through layers, thin wrappers, and premature frameworks. Here's how to recognize abstractions that cost more than they save.

Over 60 refactorings on a security CLI, the highest-ROI changes were deletions. Deletions of abstractions that existed "just in case" and cost every reader cognitive overhead with zero runtime benefit.

Here are the abstractions we removed and why.

1. Pass-through packages

A package that exists only to forward calls to another package:

// internal/app/workflow/evaluate.go
package workflow

func Evaluate(input EvalInput) (Result, error) {
    return eval.Evaluate(input) // just forwards
}

Every command imported workflow instead of eval directly. The package had no logic, no transformation, no error handling. It was a phantom layer — it appeared in import paths, confused grep results, and added one more package to understand.

The fix: Delete the package. Rewire all callers to import eval directly. One commit, zero behavior change.

The rule: If a package only forwards calls, it shouldn't exist.

2. Thin wrapper functions

func RenderJSON(w io.Writer, v any) error {
    return jsonutil.WriteIndented(w, v)
}

func RenderText(w io.Writer, report Report) error {
    return textReporter.Render(w, report)
}

These wrappers added names but no behavior. Every caller could call the underlying function directly. The wrappers existed because "we might add logging later" — but we never did.

The fix: Inline the call at every call site. Delete the wrapper.

The principle: Don't create a function to wrap a single function call. The call itself is already readable.

3. Anemic files

25 Go files with fewer than 20 lines of logic. Each contained a single type or a single function that belonged in its neighboring file.

types.go          → 1 type, 0 methods
constants.go      → 3 constants
helpers.go        → 1 helper function

Every file is a navigation decision. 25 anemic files means 25 wrong guesses when searching for code. Merging types.go into policy.go means the type lives next to the logic that uses it.

The fix: Merge into the logical neighbor. 25 files became 0 additional files — the types moved into the files that used them.

The rule: Name files after the primary type they contain. Don't create types.go, utils.go, or helpers.go.

4. Premature generic frameworks

The most expensive deletion: a 500-line Pipeline[T] generic framework.

type Pipeline[T any] struct {
    stages []Stage[T]
}

func (p *Pipeline[T]) Run(ctx context.Context, input T) (T, error) {
    for _, stage := range p.stages {
        var err error
        input, err = stage.Execute(ctx, input)
        if err != nil {
            return input, err
        }
    }
    return input, nil
}

This was used in only one place — the evaluation pipeline. The stages were three sequential function calls. The framework added generics, interfaces, registration, and error handling for a problem that looked like this:

controls, err := loadControls(ctx, dir)
if err != nil { return err }

result, err := evaluate(controls, snapshots)
if err != nil { return err }

return writeOutput(result)

Three lines of sequential Go. No generics. No interfaces. No framework.

The fix: Delete the pipeline package. Replace with three sequential calls.

The principle: Three lines of sequential Go beats a 500-line generic fluent API. Every time.

5. Backward-compatibility aliases

After renaming invariant to control across 60 files, we kept type aliases for safety:

// Deprecated: use ControlID.
type InvariantID = ControlID

The aliases created confusion about which name was canonical. Grep returned both. Autocompletion showed both. New code used both names randomly.

The fix: Delete all aliases in the same commit as the rename. No transition period. The codebase has no external consumers — there is nobody to break.

The rule: If you have no external consumers, backward compatibility is debt.

6. Dead methods

After the hexagonal migration, deadcode analysis found 207 unreachable functions:

$ deadcode -test ./...
internal/core/controldef.(*Operand).AsBool
internal/core/controldef.(*Operand).AsString
internal/core/controldef.(*Operand).AsNumber
internal/core/controldef.(*Operand).IsZero
internal/core/evaluation.(*EvalContext).GetLogger
...

Methods that were written speculatively, methods left behind after a refactoring, methods duplicated across packages during a migration. All dead. All deleted.

The fix: Run deadcode -test ./... after every structural change. Delete what it finds. No exceptions.

The cost model

Every abstraction has a cost:

Abstraction	Cost per reader	Runtime benefit
Pass-through package	Import confusion, grep noise	Zero
Thin wrapper	Extra indirection to read	Zero
Anemic file	Navigation overhead	Zero
Unused generic framework	500 lines to understand	Zero
Type alias	Namespace pollution	Zero
Dead method	"Is this used?" investigation	Zero

If the runtime benefit column is zero, the abstraction is not zero-cost. It's negative-cost.

When to delete

After every structural refactoring, ask:

Does this package forward calls without adding logic? Delete it.
Does this function wrap a single function call? Inline it.
Does this file contain fewer than 20 lines? Merge it.
Does this framework serve one use case? Replace with sequential code.
Does this type alias exist for transition safety? Delete it now.
Does deadcode find anything? Delete it all.

The Real Zero Cost Abstraction

Go doesn't have zero-cost abstractions in the Rust sense. Every interface adds a vtable lookup. Every package adds compile time. Every file adds navigation cost. Every line adds reading time.

The only truly zero-cost abstraction in Go is the one you deleted.

Stave is an open-source intent verification engine for cloud infrastructure with 2,650+ controls, compound risk detection, and nine independent reasoning engines.

The $0 cloud infrastructure security stack

Bala Paranj — Thu, 14 May 2026 11:02:07 +0000

Maya Kaczorowski documented Oblique's $0 security stack for code, email, logs, and devices. This is the companion piece: the $0 stack for cloud infrastructure — intent verification, compound risk detection, and formal safety proofs for AWS configurations, with nine independent reasoning engines.

Inspired by a $0 stack

Maya Kaczorowski recently wrote about Oblique's $0 security stack — world-class security tooling at zero cost. Semgrep for code analysis, TruffleHog for secret scanning, RunReveal for SIEM, Sublime for email, Apple Business for device management. All free or free-tier. All solving real problems. Her point is important: the excuse that security costs too much no longer holds.

Her article covers application security, email security, log aggregation, and device management. This article covers a different domain: cloud infrastructure security — verifying whether your AWS resources are configured safely, not just correctly.

The distinction matters. A configuration can be correct by every checklist and still be unsafe. Three individually-correct settings — an unauthenticated identity pool, a scoped IAM role, and a private PHI-tagged bucket — can compose into a path that lets anonymous users reach patient data. No individual check catches it because the vulnerability exists in the composition, not in any single setting. This stack solves that problem.

The infrastructure security pipeline

Commercial cloud security posture management — Wiz, Orca, Prisma Cloud — starts at five figures annually. The open-source alternative costs nothing, and it does something commercial tools structurally cannot: verify that your configurations don't contradict your own declared intent, with mathematical proofs from engines built to verify flight software.

Input              Evaluation           Reasoning              Downstream
─────              ──────────           ─────────              ──────────
Steampipe    →     Stave          →     9 External Engines →   Neo4j GDS
(cloud SQL)        (CEL predicates      Z3 / cvc5 / Yices      SIEM
                    + compound chains)  Soufflé / Clingo       SARIF
                                        Prolog / PySAT         Evidence bundles
                                        Risk / Game Theory

Each layer is independent. Replace any piece without touching the others.

Steampipe is the input layer. It queries your cloud APIs like a database — AWS, Azure, and GCP resources become SQL tables. Steampipe produces the inventory. It doesn't evaluate it.

Stave is the evaluation layer. It reads observation snapshots and evaluates them against 2,650+ controls across 74 AWS service domains. CEL predicates detect individual misconfigurations. Compound chains compose multiple findings into named attack paths. The evaluation is deterministic — same snapshot, same findings, every time.

Nine reasoning engines consume Stave's fact export independently. Z3 proves whether forbidden states are mathematically reachable. Soufflé enumerates blast radius and reachability paths. Clingo fires declarative violation rules. Each engine adds a reasoning dimension that CEL predicates structurally cannot express — quantification, graph traversal, satisfiability. The engines are external consumers, not internal components. Stave exports facts as JSONL triples or SMT-LIB assertions. The engines read them.

Downstream systems consume what Stave and the engines produce. Neo4j Community Edition provides graph analysis — centrality, shortest paths, choke points. SARIF output feeds IDEs. JSONL output feeds SIEMs. Signed evidence bundles feed compliance audits.

What scanners check vs. what Stave verifies

A scanner asks: "Is this setting correct?" It checks attributes on individual resources — encryption enabled, public access blocked, MFA enforced. These are necessary checks. They verify that individual nodes meet a baseline.

Stave asks a different question: "Do your configurations contradict what you declared?"

When you tag a bucket data_classification: phi, you're declaring intent: this contains patient records. When a Bedrock knowledge base indexes that bucket and a customer-facing agent serves the results without a guardrail, your infrastructure contradicts your declaration. Three individually-correct configurations compose into a violation of your own stated intent.

Scanners check 10 attributes per resource. With five layers of AWS security — IAM, SCPs, resource policies, VPC endpoints, trust relationships — a single bucket can exist in over 200 possible effective-access states. The other 190 stay unexamined. Stave collapses that state space through invariants: define which states are forbidden, prove they're unreachable.

The findings reflect this difference:

A scanner says: "S3 bucket is publicly accessible. Severity: High."

Stave says:

CHAIN: bedrock_rag_phi_exposure (CRITICAL)

  CTL.COGNITO.UNAUTH.ACCESS.001
    Identity pool allows unauthenticated access

  CTL.IAM.ROLE.MAPPED.BROAD.001
    Mapped role has s3:GetObject on PHI bucket

  CTL.BEDROCK.AGENT.GUARDRAIL.001
    Agent has no content-filtering guardrail

  Compound: anonymous internet user can reach
  patient health records through the RAG pipeline.
  Three settings, three clean individual checks,
  one CRITICAL attack path.

The scanner found zero issues. Stave found the composition that makes the system unsafe.

What 2,650 controls cover

The catalog spans 74 AWS service domains. The controls that matter most are the families that detect structural patterns no individual check can see.

32 AI agent identity controls cover Bedrock agents, SageMaker pipelines, and Lambda tool chains. Agent execution role overprivilege, missing guardrails, ghost action groups referencing deleted Lambda functions, shadow agents created outside IaC, knowledge base data boundary violations. Five compound chains compose these into attack paths: agent-to-PHI exposure through RAG pipelines, cross-account training data access, shadow agent credential theft.

Shadow admin detection catches IAM roles that accumulated permissions beyond their declared scope. A role named S3-ReadOnly that can retrieve secrets, invoke Lambda functions, and enumerate the full IAM inventory. Five controls check permission drift (unused service ratio via Access Advisor), category mixing (data access combined with IAM write), and intent mismatch (permissions contradict the declared role-type tag). Two compound chains fire when the pattern is systemic.

Vendor delegation governance checks whether vendors with access to your S3 buckets have exceeded their declared scope, whether their access review is overdue, whether they can make your bucket public, and whether you can revoke their access. Five controls, one compound chain at threshold 3-of-5 — a single overdue review is a reminder, three concurrent failures is a systemic governance breakdown.

23 ghost reference controls detect policies that reference resources absent from the snapshot — dangling IAM trust policies pointing at deleted accounts, Cognito triggers referencing deleted Lambda functions, S3 policies granting access to buckets that no longer exist. An attacker who recreates the deleted resource under the same name inherits every permission the dangling policy still grants.

47 credential TTL controls across IAM rotation, token expiry, certificate lifecycle, Secrets Manager rotation, and KMS key management. The Time-Bound Credential Invariant checks not just that a TTL is declared, but that the TTL hasn't elapsed — the difference between "rotation is configured" and "rotation actually happened."

Temporal analysis treats time as a built-in dimension. Drift detection compares snapshots and flags configuration changes. Duration tracking measures how long a misconfiguration has persisted — the same public bucket at 6 hours and 6 months tells different stories. Observation freshness detects stale snapshots — findings based on data that's 90 days old need different urgency than findings based on data from today.

Nine engines, one fact export

Stave evaluates controls with CEL predicates — the primary detection mechanism. But CEL has expressivity limits: it can't quantify over lists, traverse reachability graphs, or prove satisfiability of combined assertions. The nine external engines fill those gaps.

The fact export is the interface. Stave projects observation properties into JSONL triples (subject-predicate-object) and SMT-LIB assertions. 44 scalar predicates and 6 per-element array projectors cover AI agents, VPC peering, EC2 instance profiles, IAM role drift, and S3 delegation. Every engine reads the same facts and produces independent analysis.

Z3 / cvc5 / Yices (SMT solvers) prove whether forbidden states are reachable. "Can an anonymous user reach PHI data through any combination of identity pool, role mapping, and bucket policy?" The answer is sat (reachable — unsafe) or unsat (mathematically impossible — proven safe). These are the solvers Microsoft Research built for verifying flight control software and CPU designs. The proof is deterministic and independently reproducible. Microsoft has used the same approach to solve firewall rules verification for Azure.

Soufflé (Datalog) enumerates reachability paths and counts blast radius. "How many resources can this compromised role reach?" "Which vendor principals have excessive delegation reach on this bucket?" Per-element facts — has_unused_service(role, "lambda"), has_delegated_principal(bucket, principal_arn) — enable queries that name specific services and specific principals, not just booleans.

Clingo (Answer Set Programming) fires declarative violation rules. Shipped rules cover AI agent patterns (broad Lambda + no guardrail), delegation violations (unknown principal, scope exceeded, irrevocable access), shadow admin signals (incompatible categories + unused services), VPC peering exposure, and shadow EC2 lateral movement. Every remediated fixture produces zero violations — verified end-to-end.

Prolog derives proof trees showing step-by-step reasoning: attacker → shadow account → trusted role → production data. PySAT checks boolean satisfiability on multi-control compounds. Risk model computes exploitation probability. Game theory quantifies attacker cost vs. defender remediation ROI. TLA+ checks temporal safety — how many configuration changes separate the current state from an unsafe state.

Each engine finds a different class of issue on the same fact set. That's breadth without tool sprawl — nine reasoning dimensions from one data export.

Where this fits in a security program

A security program has multiple layers. Each addresses a different domain. Each can be $0 independently:

Layer                    Domain                  $0 tools
─────                    ──────                  ────────
Application security     Code + dependencies     Semgrep, TruffleHog
Infrastructure security  Cloud configuration     Steampipe, Stave, Neo4j CE
Detection & response     Logs + runtime events   RunReveal
Email security           Phishing + BEC          Sublime Security
Device management        Endpoints               Apple Business

The infrastructure layer is the one most startups skip because commercial CSPM starts at five figures. The pipeline above is $0 and covers 2,650+ controls with formal verification from engines designed for safety-critical systems.

Setting it up

Step 1: Install Steampipe and the AWS plugin.

brew install turbot/tap/steampipe
steampipe plugin install aws

Steampipe reads your AWS credentials from the standard locations. No additional configuration needed.

Step 2: Verify your inventory.

select name, region, bucket_policy_is_public
from aws_s3_bucket
where bucket_policy_is_public = true;

If this returns results, you have public S3 buckets. Steampipe makes cloud inventory queryable in seconds.

Step 3: Try the demo (30 seconds, no AWS account needed).

git clone https://github.com/sufield/stave.git
cd stave
bash examples/demo-ai-security/run.sh

The demo shows the full pipeline on built-in fixtures: 5 AI agent findings → 3 CRITICAL compound chains → remediation → clean. No cloud credentials required.

Step 4: Run against your own snapshots.

stave apply --observations ./my-snapshots

Stave evaluates every observation against 2,650+ controls. Compound chains fire when multiple findings compose into attack paths on the same resource.

Step 5: Export to reasoning engines.

# JSONL triples for Soufflé / Clingo
stave export-sir --format jsonl --observations ./my-snapshots > facts.jsonl

# SMT-LIB assertions for Z3 / cvc5
stave export-sir --format smt2 --observations ./my-snapshots > facts.smt2

Step 6: Track drift over time.

stave diff --before ./snapshot-march --after ./snapshot-april

Compares snapshots and flags configuration changes with before-and-after state.

Step 7: Export graph for Neo4j.

stave graph export --format graphml --observations ./my-snapshots > graph.graphml

Load into Neo4j Community Edition for centrality analysis, shortest-path computation, and effective permission reasoning.

What you get for $0

The input layer (Steampipe)

SQL queries over AWS, Azure, and GCP resources
150+ plugins covering every major cloud provider and SaaS service

The evaluation layer (Stave)

2,650+ controls across 74 AWS service domains
30+ compound chains composing individual findings into named attack paths
32 AI agent identity controls (Bedrock, SageMaker, Lambda tool chains)
Shadow admin detection (permission drift + category mixing + intent mismatch)
Vendor delegation governance (scope, lifecycle, revocability, escalation)
23 ghost reference controls detecting dangling policies after resource deletion
47 credential TTL controls with elapsed-TTL verification
Temporal analysis: drift detection, duration tracking, observation freshness
Intent verification: findings fire when configurations contradict declared tags, role-type taxonomies, and vendor registries

The reasoning layer (9 engines)

Z3 / cvc5 / Yices: mathematical proofs of safety or reachability
Soufflé: blast radius enumeration and reachability path counting
Clingo: declarative violation rules across 5 attack pattern families
Prolog: step-by-step proof trees for attack path derivation
PySAT / Risk / Game Theory / TLA+: satisfiability, probability, cost, temporal safety

The insights layer (downstream)

Neo4j GDS: graph centrality, shortest paths, choke point identification
SARIF: IDE integration for developer-visible findings in pull requests
JSONL: SIEM ingestion for correlation with runtime events
Evidence bundles: signed compliance archives

The commercial equivalent of this pipeline costs $25,000–$100,000+ annually. The open-source version costs nothing, includes formal verification from SMT solvers designed for flight software, and verifies intent — not just configuration — across 74 service domains.

Infrastructure security doesn't have to cost anything

The reason most startups skip cloud security posture management is that the tools cost more than their entire infrastructure spend. A startup paying $500/month for AWS can't justify $50,000/year for a CSPM tool. So they don't. Their S3 buckets stay public, their IAM roles accumulate permissions nobody reviews, their deleted resources leave ghost references nobody detects, and their AI agents serve PHI through RAG pipelines nobody verified.

The open-source pipeline removes this barrier. Steampipe + Stave + nine reasoning engines + Neo4j costs nothing and provides capabilities commercial CSPM tools structurally cannot: intent verification against operator declarations, compound risk detection across service boundaries, mathematical safety proofs from SMT solvers, and temporal analysis that tracks how configurations evolve over time.

Maya Kaczorowski showed that application security, email security, log aggregation, and device management can all be $0. This article shows the same is true for cloud infrastructure security — and that the $0 version doesn't just match the commercial tools. On compound risk detection, intent verification, and formal safety proofs, it goes beyond them.

Three tools. Nine engines. Zero dollars. A pipeline, not a product.

Stave is an open-source intent verification engine for cloud infrastructure with 2,650+ controls, compound risk detection, and nine independent reasoning engines. Steampipe is an open-source tool for querying cloud APIs via SQL. Together with Neo4j for graph insights, they form the $0 infrastructure security pipeline.

Your Go Golden Tests Don't Need to Regenerate Everything

Bala Paranj — Wed, 13 May 2026 10:55:51 +0000

A practical pattern for targeted golden file regeneration in Go projects — from minutes to 0.27 seconds.

I have 5,810 golden files in my project. Every time I changed one test, I was regenerating all of them. It took minutes. Now it takes 0.27 seconds.

The fix was just organizing the regeneration path so you could aim it at one file instead of firing at everything.

The problem with regenerate everything

Golden tests are great. You capture known-good output, save it to a file, and compare against it on every run. When the output changes intentionally, you regenerate the golden file.

Most Go projects start with a simple approach: a Makefile target that regenerates all golden files at once.

make regenerate-goldens

This works when you have 20 golden files. When you have 5,810, it doesn't. You change one test, you wait for the tool to process every fixture directory, and most of the time nothing else changed. You're wasting time to update one file.

Two kinds of golden tests

Before fixing anything, I audited how golden files worked in the codebase. I found two completely different mechanisms hiding behind the same word.

In-process goldens — a test function renders output, compares it against a .golden or golden.json file in testdata/. The test itself can write the file if you ask it to. I had 3 of these.

E2e fixture goldens — an external tool runs the compiled binary against fixture directories and captures stdout into expected.* files. The test reads those files and compares. I had 5,807 of these.

These need different regeneration strategies. Trying to unify them under one mechanism would either duplicate the external tool inside the test (pointless) or force the external tool to understand in-process test output (brittle).

The in-process pattern: UPDATE_GOLDEN env var

For the 3 in-process golden tests, I added a small helper to the existing test utilities package.

package testutil

import (
    "bytes"
    "os"
    "path/filepath"
    "testing"

    "github.com/google/go-cmp/cmp"
)

func UpdateGolden() bool {
    return os.Getenv("UPDATE_GOLDEN") != ""
}

func AssertGolden(t *testing.T, path string, got []byte) {
    t.Helper()

    if UpdateGolden() {
        writeIfChanged(t, path, got)
    }

    want, err := os.ReadFile(path)
    if err != nil {
        t.Fatalf("read golden file %s: %v\nRun with UPDATE_GOLDEN=1 to create it", path, err)
    }

    if diff := cmp.Diff(string(want), string(got)); diff != "" {
        t.Fatalf("golden mismatch %s (-want +got):\n%s\nRun with UPDATE_GOLDEN=1 to update", path, diff)
    }
}

func writeIfChanged(t *testing.T, path string, data []byte) {
    t.Helper()

    old, err := os.ReadFile(path)
    if err == nil && bytes.Equal(old, data) {
        return
    }

    if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
        t.Fatalf("create golden dir: %v", err)
    }
    if err := os.WriteFile(path, data, 0644); err != nil {
        t.Fatalf("write golden file %s: %v", path, err)
    }
    t.Logf("updated golden file: %s", path)
}

Usage in a test:

func TestTextReporter_Golden(t *testing.T) {
    got := renderReport(buildFixture())
    testutil.AssertGolden(t, "testdata/reports/hipaa_golden.txt", []byte(got))
}

Regenerate just that one test:

UPDATE_GOLDEN=1 go test ./internal/profile/reporter -run TestTextReporter_Golden

0.27 seconds.

Why an env var instead of a flag

My first version used flag.Bool("update", ...). The problem: Go test flags are per-package. If you define -update in one package and run go test ./... -update, every other package fails because it doesn't recognize the flag.

An environment variable works across all packages without any registration.

# This works for any package, any test
UPDATE_GOLDEN=1 go test ./path/to/package -run TestWhatever

Why writeIfChanged matters

Without it, every UPDATE_GOLDEN=1 go test ./... run touches every golden file's timestamp, even if the content didn't change. Your git status fills up with phantom changes. writeIfChanged reads the file first, compares bytes, and skips the write if nothing changed. Five lines that keep your diffs clean.

The e2e pattern: wrap what already exists

For the 5,807 fixture goldens, I already had a regeneration tool (regengoldens) that accepted a -filter regex. The problem was discoverability. Nobody remembered the flag syntax.

I added a Makefile target:

.PHONY: golden-fixture
golden-fixture:
    @test -n "$(FILTER)" || (echo "Usage: make golden-fixture FILTER=<regex>" && exit 1)
    $(MAKE) regenerate-goldens ARGS='-filter $(FILTER)'

Now regenerating one fixture set:

make golden-fixture FILTER=hipaa

No syntax to remember. Tab-completable.

The full Makefile surface

# In-process goldens
.PHONY: golden-update-all
golden-update-all:
    UPDATE_GOLDEN=1 go test ./...

.PHONY: golden-update
golden-update:
    @test -n "$(PKG)" || (echo "Usage: make golden-update PKG=./path/to/pkg/..." && exit 1)
    UPDATE_GOLDEN=1 go test $(PKG)

.PHONY: golden-one
golden-one:
    @test -n "$(PKG)" || (echo "Usage: make golden-one PKG=./path/to/pkg/... RUN=TestName" && exit 1)
    @test -n "$(RUN)" || (echo "Usage: make golden-one PKG=./path/to/pkg/... RUN=TestName" && exit 1)
    UPDATE_GOLDEN=1 go test $(PKG) -run '$(RUN)'

# E2e fixture goldens
.PHONY: golden-fixture
golden-fixture:
    @test -n "$(FILTER)" || (echo "Usage: make golden-fixture FILTER=<regex>" && exit 1)
    $(MAKE) regenerate-goldens ARGS='-filter $(FILTER)'

Two mechanisms, one discoverable surface. grep golden Makefile tells you everything.

What I didn't do

I didn't unify the two mechanisms. The in-process tests and e2e fixture tests have different architectures. Forcing them into one pattern would mean either duplicating the external regeneration tool inside Go tests or making the tool understand in-process rendering. Both are worse than having two clear paths.

I didn't add subtests where they weren't needed. The original plan called for converting flat tests to subtests for -run targeting. In practice, my 3 in-process golden tests were either single-case or already subtested. Converting for the sake of a pattern would have been churn.

I didn't migrate one test that would have changed golden content. One golden file was a hand-ordered JSON map. json.MarshalIndent sorts keys alphabetically, so running AssertGolden with UPDATE_GOLDEN=1 would have silently reordered the file. The rule I set was: the migration changes the mechanism, not the content. If any golden file changes, something is wrong. I left that test alone and documented why.

The verification that matters

After the migration, this is the check:

UPDATE_GOLDEN=1 go test ./...
git diff --name-only -- '*.golden' '*.golden.*'

The diff must be empty. If any golden file changed during migration, the helper introduced a difference — a trailing newline, an encoding change, a key reordering. That's a bug in the migration, not a legitimate update.

Results

Before: change one test, run make regenerate-goldens, wait minutes.

After: change one test, run UPDATE_GOLDEN=1 go test ./pkg/foo -run TestBar, wait 0.27 seconds.

The approach is boring. Env var, a 40-line helper, a few Makefile targets. Nothing novel. But the development loop went from avoid touching golden tests to golden tests are free to change.

This speed up was applied to Stave codebase, an offline configuration safety evaluator.

9 Go Performance Patterns That Don't Need a Profiler to Find

Bala Paranj — Tue, 12 May 2026 11:26:16 +0000

Pre-allocated slices, bitmask operations, index-based iteration, strings.Builder with Grow, switch-over-map hot paths, defensive cloning, sync. Once caching, WalkDir over Walk, and defined types avoiding string conversion — patterns visible in code review.

Most performance advice starts with run a profiler. That's correct for micro-optimization. But the patterns in this article are visible in code review. You don't need a flamegraph to know that copying a 408-byte struct in a loop 10,000 times is slower than taking a pointer to it.

These are the patterns we applied across a 50,000-line Go security CLI. Each one has a reason that doesn't require benchmarks to justify.

1. Pre-Allocate Slices and Maps

The Cost of Growth

// BAD: append grows the backing array 5+ times for 100 elements
var results []Finding
for _, asset := range assets {
    if isViolation(asset) {
        results = append(results, buildFinding(asset))
    }
}

Each time append exceeds the slice capacity, Go allocates a new backing array (typically 2x the current size), copies all existing elements, and lets the old array be garbage collected. For 100 elements starting from zero, that's approximately 7 allocations and 7 copies.

The Fix

// GOOD: one allocation, zero copies
results := make([]Finding, 0, len(assets))  // cap = upper bound
for _, asset := range assets {
    if isViolation(asset) {
        results = append(results, buildFinding(asset))
    }
}

Same pattern for maps:

// BAD: map grows through multiple rehashes
baseMap := make(map[string]Finding)

// GOOD: pre-sized to avoid rehashing
baseMap := make(map[string]Finding, len(baseline))

The capacity hint doesn't need to be exact. It's an upper bound. make([]T, 0, len(input)) is correct even if the filter removes 90% of elements. The wasted capacity (a few KB of pointers) is cheaper than the allocation churn.

Where to Apply

Pre-allocate when:

You know the upper bound (input length, map size)
The collection is built in a loop
The collection survives the function (returned, stored in a struct)

Skip when:

The collection is tiny (< 8 elements — Go's small-object optimization handles it)
The upper bound is unknown or enormous
You're building a result from streaming data

2. Index-Based Iteration for Large Structs

The Cost of Range-Value Copy

// BAD: copies 408 bytes per iteration
for _, f := range findings {
    process(f)
}

Go's for _, v := range copies the element into v on every iteration. For a Finding struct (408 bytes), iterating over 1,000 findings copies 408 KB of data — just to read each element.

The Fix

// GOOD: zero copy — pointer to element in existing slice memory
for i := range findings {
    f := &findings[i]
    process(f)
}

&findings[i] takes a pointer to the element in the slice's backing array. No copy. The pointer is 8 bytes regardless of struct size.

When It Matters

The Go compiler may optimize small structs (< 64 bytes) to registers. For structs above 128 bytes, the copy is measurable. We set the rangeValCopy gocritic threshold to 128 bytes and fixed 70 loops across the codebase.

Types we fixed:

Type	Size	Loop Count
`remediation.Finding`	408B	15 loops
`policy.ControlDefinition`	304B	12 loops
`remediation.RemediationFinding`	352B	8 loops
`securityaudit.Finding`	160B	5 loops
`risk.Item`	152B	4 loops
`s3/policy.Statement`	128B	3 loops

Do NOT change []T to []*T. Pointer slices destroy cache locality — each element is a separate heap allocation that the CPU prefetcher can't predict. Keep contiguous []T memory; access individual elements by index.

3. Switch Over Map for Hot-Path Dispatch

The Cost of Map + Closures

// BAD: map lookup + closure allocation on every call
var operators = map[Operator]operatorFunc{
    OpEq:  handled(func(exists bool, f, m any) bool { return exists && EqualValues(f, m) }),
    OpNe:  handled(func(exists bool, f, m any) bool { return !exists || !EqualValues(f, m) }),
    // ... 14 operators
}

func EvaluateOperator(op Operator, exists bool, val, compare any) (bool, bool) {
    fn, ok := operators[op]
    if !ok { return false, false }
    return fn(exists, val, compare)  // 2 levels of function indirection
}

Three costs: (1) hash the operator string and probe the map, (2) call through a handled() wrapper that allocates a closure, (3) call through the inner closure. On the hot path (43 controls × 50 assets × 10 snapshots), that's 21,500 map lookups with closure calls.

The Fix

// GOOD: compiler-generated jump table, zero allocation
func EvaluateOperator(op Operator, exists bool, val, compare any) (res bool, handled bool) {
    handled = true
    switch op {
    case OpEq:
        res = exists && EqualValues(val, compare)
    case OpNe:
        res = !exists || !EqualValues(val, compare)
    case OpGt:
        res = exists && GreaterThan(val, compare)
    // ... 14 cases
    default:
        return false, false
    }
    return res, true
}

The switch compiles to a jump table — same O(1) dispatch as the map, but without hashing, without closure allocation, and without function pointer indirection. The CPU branch predictor can optimize switch patterns better than indirect function calls.

4. Bitmask Operations for Permission Analysis

The Cost of Slice-Based Tracking

// BAD: O(n) contains check for each permission
type Permissions []string

func (p Permissions) HasRead() bool {
    return slices.Contains(p, "s3:GetObject") || slices.Contains(p, "s3:*") || slices.Contains(p, "*")
}

Three linear scans per check. For a policy with 20 actions, that's 60 string comparisons to answer "does this grant read access?"

The Fix

// GOOD: O(1) bitmask check
type ActionMask uint8

const (
    ActionRead    ActionMask = 1 << iota  // s3:GetObject
    ActionWrite                           // s3:PutObject
    ActionDelete                          // s3:DeleteObject
    ActionList                            // s3:ListBucket
    actionACLWrite                        // s3:PutBucketAcl (unexported)
)

func (m ActionMask) has(flag ActionMask) bool {
    return m&flag != 0
}

func (s Statement) ResolveActions() (ActionMask, int) {
    var mask ActionMask
    for _, action := range s.Action {
        a := strings.ToLower(action)
        switch {
        case a == "*" || a == "s3:*":
            mask |= ActionRead | ActionWrite | ActionDelete | ActionList | actionACLWrite
        case a == "s3:getobject":
            mask |= ActionRead
        case a == "s3:putobject":
            mask |= ActionWrite
        // ...
        }
    }
    return mask, bits.OnesCount8(uint8(mask))
}

// Check: one bitwise AND, constant time
func (s Statement) GrantsReadAccess() bool {
    mask, _ := s.ResolveActions()
    return mask.has(ActionRead)
}

Parse the action list once into a bitmask. Every subsequent permission check is a single bitwise AND — constant time, zero allocation, fits in a register.

5. strings.Builder with Pre-Allocation

The Cost of String Concatenation

// BAD: each += allocates a new string
result := ""
for _, finding := range findings {
    result += finding.ControlID.String() + ": " + finding.Evidence.TemporalRisk + "\n"
}

Go strings are immutable. Each += allocates a new backing array and copies the accumulated string. For 100 findings with 50-character lines, that's 100 allocations copying an average of 2.5 KB each.

The Fix

// GOOD: one allocation, zero copies
var b strings.Builder
b.Grow(len(findings) * 64)  // estimate: 64 bytes per finding

for i := range findings {
    f := &findings[i]
    b.WriteString(f.ControlID.String())
    b.WriteString(": ")
    b.WriteString(f.Evidence.TemporalRisk)
    b.WriteByte('\n')
}
result := b.String()

strings.Builder writes to a growing byte buffer. Grow(n) pre-allocates n bytes so the buffer never needs to resize if your estimate is close. The final String() converts the buffer to a string without copying (Go 1.10+).

WriteString over fmt.Fprintf: WriteString is a direct memcpy. Fprintf parses a format string, allocates for the format args, and calls reflection for %v. For simple concatenation, WriteString is 5-10x faster.

The Grow Estimate

b.Grow(5 * 1024)  // 5 KB for a markdown report
b.Grow(256)        // 256 bytes for a diagnostic error message
b.Grow(len(items) * 80)  // 80 bytes per line estimate

Over-estimating is fine — the wasted capacity is a few KB. Under-estimating triggers a reallocation (still better than +=). The rule: estimate the total output size from the input size and allocate once.

6. strings.Cut Over Split+Index

The Cost of Split

// BAD: allocates a []string, splits entire string
parts := strings.Split(line, "=")
if len(parts) >= 2 {
    key = parts[0]
    value = parts[1]
}

strings.Split allocates a slice and all the substrings. For "KEY=value=with=equals", it allocates 4 strings when you only need 2.

The Fix

// GOOD: zero allocation, stops at first separator
key, value, ok := strings.Cut(line, "=")
if ok {
    // key = "KEY", value = "value=with=equals"
}

strings.Cut (Go 1.18+) returns substrings of the original string — no allocation. It stops at the first separator, so "KEY=value=with=equals" correctly splits into "KEY" and "value=with=equals".

For prefix checking:

// BAD: allocates trimmed string
if strings.HasPrefix(s, "severity:") {
    rest := s[len("severity:"):]
}

// GOOD: CutPrefix returns substring, no allocation
if rest, ok := strings.CutPrefix(s, "severity:"); ok {
    // rest is a substring, zero allocation
}

7. Defined Types Avoiding Conversion

The Cost of String Conversion

// BAD: string() creates a new string from the typed value
assetID := asset.ID("bucket-1")
history[assetID.String()] = observations  // allocates new string on every call

assetID.String() returns string(assetID) — which in Go allocates a new string and copies the bytes. In a loop over 1,000 assets × 10 snapshots, that's 10,000 string allocations.

The Fix

// GOOD: use the typed value directly as map key
history[assetID] = observations  // asset.ID IS a string — no conversion

asset.ID is type ID string. Go allows defined string types as map keys. The map uses the underlying string bytes directly. No conversion, no allocation.

The same principle applies to function parameters:

// BAD: convert to string for a function that accepts string
fmt.Println(string(controlID))

// GOOD: use the String() method (or let fmt call it)
fmt.Println(controlID)  // fmt calls String() automatically

When Conversion is Needed

Sometimes you need string():

// String conversion required: joining typed values into a single string
joinedPath := string(vendor) + "/" + string(assetType)

But within the same type system — passing ControlID to a function that accepts ControlID, using it as a map key, comparing it — no conversion is needed.

8. sync.Once and sync.Map for Initialization

The Cost of Repeated Initialization

// BAD: check-then-init has a race condition
type ExemptionConfig struct {
    prepared bool
    exactMap map[string]*Rule
}

func (c *ExemptionConfig) Prepare() {
    if c.prepared { return }  // RACE: two goroutines can both see false
    c.exactMap = buildIndex(c.Assets)
    c.prepared = true
}

The Fix

// GOOD: sync.Once is atomic and runs exactly once
type ExemptionConfig struct {
    once     sync.Once
    exactMap map[string]*Rule
}

func (c *ExemptionConfig) Prepare() {
    c.once.Do(func() {
        c.exactMap = buildIndex(c.Assets)
    })
}

For caching values keyed by pointer identity (TTY detection results):

// Cache TTY detection result per file descriptor
var ttyCache sync.Map

func CanColor(out io.Writer) bool {
    f, ok := out.(*os.File)
    if !ok { return false }

    key := reflect.ValueOf(f).Pointer()
    if cached, ok := ttyCache.Load(key); ok {
        v, _ := cached.(bool)
        return v
    }

    enabled := detectTTY(f)
    ttyCache.Store(key, enabled)
    return enabled
}

sync.Map is optimized for the "write-once, read-many" pattern — exactly what caching needs. No mutex contention on the read path after the first call.

9. WalkDir Over Walk

The Cost of Extra Syscalls

// BAD: filepath.Walk calls Lstat on every entry
filepath.Walk(dir, func(path string, info os.FileInfo, err error) error {
    // info is from Lstat — an extra syscall per entry
})

filepath.Walk calls os.Lstat on every directory entry to populate os.FileInfo. For a directory with 10,000 files, that's 10,000 extra syscalls — each one a kernel context switch.

The Fix

// GOOD: filepath.WalkDir uses cached DirEntry
filepath.WalkDir(dir, func(path string, d fs.DirEntry, err error) error {
    // d.IsDir() and d.Name() are cached from readdir — no extra syscall
    // Only call d.Info() when you actually need size/mode/modtime
})

filepath.WalkDir (Go 1.16+) uses fs.DirEntry, which caches the entry type from the readdir syscall. d.IsDir() and d.Name() are free. d.Info() calls Lstat only when you explicitly need it — most filters don't (skip hidden dirs, match extensions).

The Checklist

Pattern	Cost of Not Doing It	Fix Effort
Pre-allocate slices	O(log n) allocations + copies	One `make()` parameter
Index-based iteration	128-408B copy per iteration	Change `_, v` to `i := range`
Switch over map	Hash + closure + indirect call per dispatch	Replace map with switch
Bitmask operations	O(n) contains per check	One-time parse + O(1) checks
strings.Builder + Grow	O(n²) concatenation	Replace `+=` with `WriteString`
strings.Cut	Allocate full split slice	Replace `Split` with `Cut`
Defined type as key	String conversion per use	Remove `string()` casts
sync.Once / sync.Map	Race condition or mutex contention	Replace bool flag with Once
WalkDir over Walk	Extra Lstat syscall per entry	Change `Walk` to `WalkDir`

None of these require benchmarks to justify. They're visible in code review, mechanical to apply, and eliminate entire categories of unnecessary work. The profiler is for finding the 1% improvements after these 9 patterns are already in place.

These 9 performance patterns were applied across Stave, a Go CLI for offline security evaluation. The index-based iteration alone fixed 70 loops copying structs up to 408 bytes. The switch-over-map change eliminated closure allocations on the predicate evaluation hot path.

Forem: Bala Paranj

The contract is the interface: agent-driven Steampipe Stave in one command

What the customer sees

The YAML mapping format

Per-asset JSON Schemas

Ten hand-authored mappings

The Contract Show Command

Auto-generator

Who owns contract sits where it does

What stayed out of Stave

The Generic Pipeline Shape

The First Agent-Centric Cloud Security Platform — And Why We Didn't Build It That Way On Purpose

What agent-centric means

The contracts that make it work

What the five trials proved

What this changes for enterprises

Before: a team problem

After: an agent problem

Why monolithic tools can't match this capability

Cloud security for the agentic era

Seven Contradictions Shaped an Architecture.

Contradictions vs trade-offs

The seven contradictions

1. Flexibility vs Safety

2. Speed vs Correctness

3. Decentralization vs Control

4. Human Judgment vs Reliability

5. Expressiveness vs Complexity

6. Dynamic Systems vs Stable Security

7. Granular Permissions vs Manageability

The four physical contradictions underneath

The convergence

The nine inventive principles

What the industry accepts as trade-offs

Why convergence matters for builders

Google Engineers Can't Create Public Cloud Storage Buckets. Not Because They're Smarter. Because the Option Doesn't Exist.

The reframe: misconfiguration is structural, not personal

What the upstream platform looks like

Who has built this — and what they achieved

Why most organizations can't do this

The downstream alternative

The choice matrix

What the downstream approach catches that upstream can't

The convergence trajectory

The honest comparison

For your organization

Versioned Schema Contracts in a Go CLI: How obs.v0.1 Prevents Silent Breaks

Every Data File Carries Its Version

Fail-Fast at Load Time

Schemas are Embedded in the Binary

Why Versions Live in the Data, Not the Tool

Visual Regression Testing for CLIs with VHS

Using VHS

How This Differs from Asciinema

Visual Regression Testing Pattern

Step 1: Create a .tape file per workflow

Step 2: Generate the baseline

Step 3: Compare in CI

Step 4: Review with PR comments

What Visual Tests Catch That Unit Tests Miss

Table alignment

Color and formatting

Progress indicators

Help text layout

VHS .tape Cheat Sheet

Combining Both Tools

Getting Started

Z3 Can Prove Your Cloud is Unsafe. It Can't Tell You Why.

What Z3 outputs

The two translation boundaries

What an orchestration layer provides

1. Encoding explanation — "Did the tool understand my configuration?"

2. Human verdict — "What does the result mean?"

3. Fix guidance — "What do I do about it?"

4. Traceability — "Which property caused this?"

5. Encoding verification — "Can I trust the translation?"

What you get without orchestration vs. with it

Why this matters more for compound detection

The bottom line

Proof, not prediction: where formal verification beats AI in cloud security

Step 1: Create a `.tape` file per workflow

VHS `.tape` Cheat Sheet