Forem: Jacob Molz

m0lz.02 — Stack Loops

Jacob Molz — Mon, 18 May 2026 16:51:25 +0000

Introduction

Stack Loops is the m0lz.02 workflow for checking a feature across its technology layers instead of treating the whole diff as one review contract. The failure class is a seam gap: app code, API shape, schema, infrastructure, deployment, and observability can each look acceptable in isolation while their boundary breaks. A frontend field rename that never reaches the API schema, or a new background queue that ships without the deploy/runtime wiring, is the kind of bug this workflow is designed to expose.

This launch post keeps the claim narrow. The current evidence shows that Stack Loops detects and configures the expected layer contracts for the reference projects, persists separate layer runs, surfaces an infrastructure review gate, and executes its parallel evaluator cohort under the published release-gate ratio. It does not claim live-model bug-detection lift over a named commercial reviewer.

What It Does

Stack Loops turns one feature review into a set of layer-scoped checks. m0lz.02 identifies the layers that matter for a project, writes those contracts into the PICE plan, runs the evaluator cohort per layer, and records a separate layer run for each activated layer before the feature is treated as passed.

The always-run policy matters because many production failures live outside application code. Infrastructure, deployment, and observability layers are configured to stay in the loop when they are active, even if the feature diff looks like a frontend or API change. This benchmark does not compare that policy against an ablated run that omits those layers, and it does not record per-fixture feature-diff scope; it captures the seven-layer set that discovery and configuration produced on the reference fixtures and the persisted layer runs that followed.

The concrete gate example is fastapi-postgres. In the captured acceptance run, that fixture hit a pending infrastructure review gate, recorded one gate decision, received approval, and resumed to passed. That is the workflow shape Stack Loops needs: pause at the cross-layer contract, make the decision auditable, then continue only after the gate is resolved. The capture does not include a negative-control run where an activated layer fails or a gate is rejected, so the fail-closed pass/fail aggregation is evidenced as plumbing and audit rather than as a refusal at aggregation time.

How It Works

The workflow starts with layer discovery. m0lz.02 reads project evidence from manifests, framework files, directory layout, config files, and explicit overrides. It then evaluates the feature against the per-layer contract files at .pice/contracts/{layer}.toml instead of asking a single reviewer to infer every boundary from one prompt. The per-layer contracts evaluated in this capture are the template contracts that pice init --upgrade writes from templates/pice/contracts/; their criteria headlines and template hashes are recorded under results.json.methodology.per_layer_contract_content. Bespoke per-fixture contracts that customize criteria to the schema/auth/deployment/observability risks of a specific project are out of scope for this launch capture.

Evaluation runs through the daemon. The CLI accepts terminal commands and renders status; pice-daemon owns background jobs, provider sessions, manifests, layer-run persistence, metrics, templates, audit rows, and review gates. The CLI talks to the daemon over the local socket. The daemon talks to providers over stdio, with provider stdout reserved for JSON-RPC frames.

That split is the main product boundary. The CLI stays a thin operator surface. The daemon owns orchestration and state, so status --follow, logs --follow, review-gate approval, and resumed background evaluation all read from the same job record instead of from terminal text.

Architecture

m0lz.02 is the companion repository for this launch: https://github.com/jmolz/m0lz.02. It contains the PICE CLI, the headless daemon, provider adapters, reference fixtures, release documentation, and acceptance scripts.

The architecture is built around a CLI/daemon split:

Boundary	Responsibility
CLI	Parse commands, connect to the daemon socket, display status/log streams, and send operator actions such as review-gate approval.
Daemon	Own background evaluation, provider sessions, manifests, layer runs, metrics, templates, audit records, and gate state.
Provider stdio	Carry JSON-RPC request and response frames between the daemon and the provider process.
SQLite state	Persist evaluations, layer runs, pass events, seam findings, gate decisions, and audit records.

This matters for Stack Loops because layer evaluation is not a single synchronous terminal response. It needs resumable state, separate layer records, stream-json terminal frames, and human gate decisions that survive process boundaries.

Benchmark Results

The raw benchmark artifact is results.json in the benchmark workspace. The human-readable summary is below.

Check	Result	Evidence	Boundary
Parallel cohort assertion	passed	Sequential mean `6.238500097s`; parallel mean `3.147005138s`; ratio `0.504` at or below target `0.625`.	Release-gate timing assertion only; three iterations and no confidence interval.
Phase 8 reference projects	passed	Five PICE-authored fixtures passed. Each produced seven detected layers, seven configured layers, seven distinct `layer_runs` rows, terminal exit code zero, and `evaluate_status: passed`.	Reference-fixture mechanics, not external project coverage.
Infrastructure review gate	passed	`fastapi-postgres` recorded `gate_decisions: 1`, `audit_id: 1`, approval, and resumed to `passed`.	Demonstrates gate plumbing and auditability, not general bug-detection lift.
Infrastructure contract tier	recorded	Each fixture reported `infrastructure_contract_tier: 3`.	Tier three means the infrastructure layer used the agent-team evaluation contract configured by the fixture.
Provider mode	stubbed	The acceptance script used provider `stub`, model `stub-model`, and eight `9.5,0.001` stub-score pairs.	Validates Stack Loops mechanics, not live provider judgment quality.

Three expected release-readiness targets were not run in this capture: the steady-state Criterion benchmark, the release artifact smoke test, and the local Linux CI script. The post therefore treats the capture as benchmark evidence for Stack Loops mechanics on one darwin arm64 machine, not as a complete release certification.

Methodology

The benchmark workspace now includes METHODOLOGY.md beside results.json and environment.json. That methodology file records the commands, repository revision, environment (including the Rust toolchain that produced the Cargo timing), provider mode, target provenance, omissions, and scope limits.

The speedup assertion is a release-gate check. It compares the sequential evaluator cohort path with the parallel evaluator cohort path and verifies that the ratio remains at or below the inherited target. The harness recorded the means and ratio, but it did not record raw per-iteration timings, variance, standard deviation, or a confidence interval. The ratio target came from the earlier release validation contract; it was not statistically re-derived in this capture.

The Phase 8 acceptance run is also bounded. It used PICE-authored reference fixtures and a stub provider. That is the right harness for checking layer discovery, layer persistence, stream termination, review-gate state, and daemon orchestration. It is not evidence that a live model would catch more real bugs than a single-contract reviewer, and it is not evidence that every framework topology is covered.

The environment snapshot records darwin 25.4.0 on arm64, Apple M3 Max x 16 CPUs, 128 GB memory, Node.js v22.15.0, and npm 11.12.1. The snapshot was captured before the results import, so it should be read as hardware and toolchain context rather than an exact same-process timing envelope.

Limitations

The comparator in this post is a one-contract-per-feature review workflow, not a named commercial product. That avoids a stronger claim than the evidence supports.

The reference fixtures are authored by the m0lz.02 project. They exercise canonical Next.js, FastAPI, Rails, Express, and SvelteKit shapes, but they do not cover polyrepo discovery, monorepo package-boundary inference, dynamic service-mesh topology, or live-provider disagreement cases.

The always-run layer policy was enforced, not ablated. To prove the policy improves defect detection, m0lz.02 still needs a comparison run that disables infrastructure, deployment, and observability layers against the same task set. The always-run trigger condition is also not separately evidenced in this capture: results.json records seven configured and seven persisted layer runs per fixture but does not record each fixture's feature-diff scope, so the artifact supports always-run plumbing and persistence rather than the trigger condition that meta-layers ran despite an otherwise narrower diff.

The fail-closed pass/fail aggregation is not exercised by a negative control in this capture. Every fixture passed under the stub provider, and the one infrastructure gate decision was an approval that resumed to passed. A failing-layer, missing-layer-run, or rejected-gate negative-control run would prove that aggregation refuses to mark a feature passed; until that artifact exists, the post bounds the claim to plumbing, persistence, and approved-gate resume behavior.

The per-layer contract content evaluated in this capture is the m0lz.02 template content under templates/pice/contracts/, recorded with criteria headlines and template hashes in results.json.methodology.per_layer_contract_content. The artifact therefore evidences per-layer contract evaluation against named risk-class criteria, not bespoke per-fixture contracts authored for the schema, auth, deployment, or observability risks of a specific real project.

The seam-failure classes used to motivate the launch (frontend field rename that never reaches the API schema, background queue that ships without deploy/runtime wiring, the schema-to-API and deploy-to-runtime risks named in the conclusion) are not validated by this capture either. The acceptance harness evidences layer mechanics and per-layer template criteria; a seam-failure fixture or seam-finding negative control is the artifact that would prove detection of those specific risk classes.

The current capture is single-machine darwin arm64 evidence. Any Linux, CI, or cross-platform runtime claim needs a separate capture from the omitted release-readiness targets. The Rust toolchain that produced the speedup timing is recorded in environment.json under rust_toolchain (rustc 1.94.1, cargo 1.94.1, stable channel, no project toolchain pin, default dev test profile) so the Cargo target is reproducible.

Conclusion

Use Stack Loops when the risk lives between layers: schema to API, code to infrastructure, deploy to runtime, runtime to observability. This launch proves the workflow mechanics on the current reference capture and keeps the broader claim for later evidence.

The next work is clear: replay the same layer contracts with live providers, add harder non-canonical projects, run the omitted release-readiness targets, and publish the comparison only when those artifacts exist.

m0lz.01: Does it make sense to me?

Jacob Molz — Mon, 11 May 2026 21:34:05 +0000

Does it make sense to me?

That is the bar I wanted for m0lz.01.

I do not need another content calendar, another notes app, or another agent that says it can publish while leaving the real work scattered across terminals, drafts, browser tabs, and half-finished checklists. I wanted one local system that could take an idea, turn it into research, draft the post, challenge the argument, publish to my site, and prepare the distribution work without pretending the risky parts are magic.

m0lz.01 is that system. It is a standalone blog CLI with Codex and Claude authoring surfaces over the same SQLite state and file artifacts. The CLI owns the mechanical work. Codex and Claude help plan, write, review, and operate it.

The important part is not that it can write. The important part is that the system leaves me with a workflow I can inspect.

The Shape

The pipeline has six working phases: ideas, research, benchmark, draft, evaluate, and publish. published is the successful terminal state. unpublished is a rollback state that keeps the slug reserved because the canonical URL is permanent.

The hub is this site. Every post lands at https://m0lz.dev/writing/<slug>. Cross-posts point their canonical URL back here. Project repos and research pages are spokes around the hub, not competing sources of truth.

The CLI stores state in .blog-agent/state.db and writes phase artifacts next to it:

research notes
benchmark results
MDX drafts
evaluation reports
generated research pages
social paste files
publish receipts

That sounds boring because it should. The durable value is not an agent personality. It is state that survives the chat window.

Codex First, Not Claude Only

The first version of this story over-indexed on the /blog Claude surface because that was the first polished interactive path. That is no longer the full picture.

The truth is simpler: m0lz.01 is a local CLI first. Codex is now a first-class way I work on and operate the system. This repo has Codex command wrappers under .codex/commands/* and migrated source-command skills under .agents/skills/source-command-*. Claude Code still has the packaged .claude-plugin/ /blog skill, and that plugin ships in the npm tarball.

Those are two clients over the same CLI boundary. Neither one gets to be the database. Neither one gets to be the publishing system. They propose work, inspect files, draft text, run checks, and hand off state changes to blog.

That distinction matters. If I am writing from Codex, I can run the same plan and evaluation discipline I use for code changes. If I am in Claude Code, I can use the packaged /blog flow. The content pipeline does not care which assistant helped produce the next approved step.

One Prompt Still Does Not Mean No Judgment

The pleasant demo is: ask for a launch post, get a published post.

The actual workflow is stricter. The authoring layer proposes a plan. The CLI validates that the plan is made of registered leaf commands. The operator approves it. blog agent apply executes the approved steps and writes a receipt.

For a project launch, the work is roughly:

blog research init m0lz-01-launch --topic "m0lz.01 launch"
blog research finalize m0lz-01-launch
blog benchmark skip m0lz-01-launch
blog draft init m0lz-01-launch
blog draft complete m0lz-01-launch
blog evaluate init m0lz-01-launch
blog evaluate structural-autocheck m0lz-01-launch
blog evaluate record m0lz-01-launch structural structural.json
blog evaluate record m0lz-01-launch adversarial adversarial.json
blog evaluate record m0lz-01-launch methodology methodology.json
blog evaluate synthesize m0lz-01-launch
blog evaluate complete m0lz-01-launch
blog publish start m0lz-01-launch

The plan path wraps those commands with a hash gate. Approval records a SHA256 hash of the canonical plan payload. If the plan changes after approval, verify and apply reject it with HASH_MISMATCH.

That gate is useful, but it is not a sandbox. The authoring surface can still ask to run blog commands. A human at the terminal can still bypass the plan system. The model is cooperative: a cooperative author, a cooperative assistant, and a CLI that makes the approved path hard to accidentally drift from.

Evaluation Is Where the System Earns Its Keep

I do not trust first drafts. I trust pressure.

m0lz.01 runs a three-reviewer panel:

Structural review checks the content shape, MDX contract, sources, and publish readiness.
Adversarial review uses Codex GPT-5.5 high to argue against the thesis.
Methodology review uses Codex GPT-5.5 xhigh for benchmark validity, reproducibility, and evidence claims.

The synthesis step groups findings into consensus, majority, and single-reviewer issues. Consensus and majority issues block. Autocheck findings block. Single-reviewer issues can be advisory, but I rejected several algorithmic passes while dogfooding this post because serious findings were landing as single-reviewer advisories.

That is useful signal. The system did not merely produce a green check. It made the weakness visible enough for me to refuse the result.

What Ships Automatically

The publish pipeline is checkpointed. If step five fails, the next blog publish start <slug> resumes from step five. Each step is designed to be idempotent.

The pipeline can:

create or update the canonical MDX in the hub repo
generate the research companion page
pause for the site pull request gate
create a Dev.to draft with published: false
attach the canonical URL to the Dev.to draft
prepare paste-ready Medium, Substack, LinkedIn, and Hacker News text
update companion repo links when a project repo exists

Dev.to is the one API cross-post path I am willing to automate right now. Medium, Substack, LinkedIn, and Hacker News remain paste-ready outputs because the manual review step is still useful and the APIs are not worth binding into the publish path yet.

What It Does Not Bind Yet

The plan hash does not bind .blogrc.yaml. If the workspace config changes between approval and apply, the plan hash does not catch it. The plan displays venues as operator-facing mitigation, but config-hash binding is still future work.

The SQLite database is local state, not protected state. Back it up before upgrades if a workspace matters.

The lock is slug-scoped and cooperative. It prevents two applies for the same slug from running at once. It does not defend against arbitrary filesystem edits inside .blog-agent/.

Those limitations do not make the system useless. They define the boundary. m0lz.01 is not a general-purpose secure agent runtime. It is a local publishing workflow with receipts, phase gates, and enough friction to keep me from shipping whatever the assistant wrote first.

Install

Install the CLI globally and verify that the binary resolves:

npm install -g m0lz-01
blog --help

Create a dedicated workspace outside your project repos:

mkdir -p ~/blog
cd ~/blog
blog init

Edit .blogrc.yaml with your hub site repo, base URL, content directories, author handles, and optional project map. Then edit .env with DEVTO_API_KEY if Dev.to publishing is enabled.

Use Codex from the m0lz.01 repo when you want the local command-wrapper workflow:

.codex/commands/prime.md
.codex/commands/plan-feature.md <topic>
.codex/commands/execute.md <plan-file>
.codex/commands/evaluate.md <plan-file>

Use the packaged Claude Code plugin when you want the /blog skill:

claude --plugin-dir "$(npm root -g)/m0lz-01/.claude-plugin"

Then start with the real question:

/blog launch a new project

Source is at github.com/jmolz/m0lz.01. The repo includes the CLI, the packaged Claude plugin, the Codex command wrappers, and the regression tests that keep the publishing path honest.