Forem: Alex Kwiatkowski

Keel: The Leash for Agentic SDLC Management

Alex Kwiatkowski — Wed, 15 Apr 2026 02:46:38 +0000

Most conversations about agentic engineering focus on two things: the model and the harness.

The model reasons, writes code, and makes decisions. The harness orchestrates tool calls, manages context, and coordinates multi-step workflows. Together, they're powerful.

Together, they're still missing something.

The missing layer is the host.

Not the model. Not the harness. The persistent, rule-enforcing system that governs what work is valid, what evidence is required, and what constitutes done: regardless of who or what is doing the work.

Without a host that enforces shared physics, you don't have a governed system. You have a capable-but-unaccountable agent.

Keel is that host. It's a CLI state machine that both humans and agents must operate through. The leash isn't a constraint on what the model can reason about, it's a guarantee of what the system will enforce.

The model and harness cover two-thirds of the picture

A language model brings extraordinary capability. A harness brings coordination. But neither answers the foundational questions that govern a software delivery team:

What rules constrain the work? Not prompt-level suggestions — structural invariants that persist across sessions and actors.

What constitutes done? Not the model's assertion that it's complete — verifiable evidence the system can check.

Where is human judgment required? Not inferred from context — explicit gates that halt execution until a human decides.

What's the shared state? Not reconstructed from a context window — persistent, git-auditable truth that survives session boundaries.

Without a host layer that answers these questions structurally, every session starts with renegotiation. The model re-reads the codebase. The harness reconstructs intent from conversation history. Humans re-verify things they already approved. The agent loop restarts from scratch.

This isn't a capability problem. It's a governance problem.

The design pattern: CLI state machine as host

The pattern Keel implements is deceptively simple: every actor — human or agent — interacts with the board through the same CLI state machine. The machine defines the physics. The machine enforces the transitions. The machine records the evidence. Neither the model's capability nor the harness's coordination changes what the machine will allow.

# The same turn loop for humans and agents

keel turn                   # orient: inspect charge, health, flow, and doctor

keel next --role operator   # pull: role-scoped work from the delivery lane

keel story start <id>       # begin: transitions the story to in-progress

keel story record <id>      # proof: attach verifiable evidence to the story

keel story submit <id>      # close: gate check — all ACs must have proofs

The agent doesn't get a special path. It uses the same commands, hits the same gates, and produces the same evidence as a human operator. There's no side door, an agent that tries to skip a gate gets a hard rejection from the same machine a human would hit.

ADRs as physics

The clearest demonstration of host-layer governance is the Architecture Decision Record. In Keel, an ADR in proposed state is a blocking constraint. Work in the governed bounded context halts — not because the model was told to check for ADRs, but because the state machine rejects the transition.

A proposed ADR covering authentication means no agent can start a story that touches the auth boundary. keel next --role operator won't surface that work. The machine won't allow the transition.

This matters because it changes what the human needs to supervise. You're not reading every pull request hoping the agent respected the architectural decision you mentioned in a system prompt three sessions ago. You're looking at a binary gate: the ADR is accepted or work doesn't happen.

ADRs aren't documentation the model might read. They're physics the machine enforces.

Transitions require evidence, not assertions

The most common failure mode in agentic development is the confident wrong answer. The model completes a task, asserts it's done, and moves on. The harness accepts the assertion and records completion. Nobody checks the evidence because there's no structural requirement for evidence to exist.

Keel's Verified Spec Driven Development (VSDD) changes this at the state machine level. A story cannot transition from in-progress to needs-human-verification without all acceptance criteria having recorded verification proofs. The model can't assert completion. It must demonstrate it.

# Attempt to submit without proofs — the gate rejects it

keel story submit <id>

# → Error: acceptance criteria AC-01 has no recorded verification proof

#   All acceptance criteria must have evidence before the story can be submitted

The agent loop adapts to this naturally. Instead of asserting "I implemented the feature," it must run the verification, capture the output, record it against the specific acceptance criterion, and then submit. The human reviewing the submitted story sees a traced chain from implementation to acceptance criteria to recorded proof, not a summary the model wrote.

This is a different kind of trust. You're not trusting the model's judgment about completeness. You're trusting the state machine's check that the evidence exists.

Separating human judgment from agent execution

Keel resolves the human-in-the-loop tension through a 2-queue pull model with lane topology. The management lane holds decisions that require human judgment: proposed ADRs, stories awaiting acceptance, voyages needing planning. The delivery lane holds work agents can execute.

The two lanes don't intersect — a manager role never returns implementation work, an operator role never returns governance decisions.

# Human pulls from the management lane — never gets implementation work

keel next --role manager

# → accept: Story VE3IAG4jZ needs your verification

# Agent pulls from the delivery lane — never gets governance decisions

keel next --role operator

# → work: Story VDmdk1uib is ready to start

This is where CLI visual fidelity starts earning its keep. The keel flow --scene dashboard answers the question any participant needs answered at a glance: where is the board right now, and what requires attention? Management lane queue depths, delivery lane progress, and blocking conditions all surface in a single terminal render.

High-fidelity CLI output matters for agentic workflows in a way it never quite did for purely human ones. A human can squint at abbreviated output or ask a clarifying question. A model parsing terminal output to construct its next action cannot. Every ambiguity in the output is an opportunity for the model to misread board state and proceed on a wrong assumption.

Zero drift as a continuous invariant

Long-running agentic work accumulates structural debt: stories without SRS references, voyages missing design documents, ADRs that governed a context that has since changed. In a model-plus-harness system, this drift is invisible until it causes a failure.

In Keel, keel doctor checks structural coherence across the entire board. Agents run it first, before every session, and fix what it reports before doing anything else.

The drift categories are structural, not heuristic. These aren't warnings to consider — they're blocking conditions the state machine treats as hard failures. An agent that introduces scaffold drift on Monday cleans it up on Tuesday before starting anything new. Entropy is structurally bounded.

Files as truth, git as the audit log

All board state lives in markdown files with YAML frontmatter. There's no hidden database, no daemon that can fall out of sync, no service to query. git log is the complete audit trail of every board mutation.

When an agent submits a story, the evidence files and lifecycle transitions are committed to the repository as part of the sealing commit. The git history is the immutable record of what the agent claimed and what evidence it recorded. No retroactive amendment. No separate audit service to maintain.

The knowledge graph surfaces the artifact web visually. Each node is a file. Each edge is a structural relationship: a story linked to its epic, an ADR scoping a bounded context, an SRS reference anchoring an acceptance criterion. A missing edge is missing evidence. A disconnected node is an orphaned artifact.

High-fidelity CLI output is not a UI nicety for agents: it's how the model reads board state without hallucinating the parts it can't see.

This is the deeper argument for investing in CLI visual fidelity in agentic tooling. A human operator adapts to ambiguous output. A model does not — it fills gaps with inference, and inference compounds into drift. Visual fidelity in the interface is structural reliability in the agent loop.

What this changes about trust

You verify everything important in agentic systems because you can't be sure the model held to the intent you had when you started the session. That's the actual cost of contextual trust: the human becomes the consistency check, session after session, because nothing else is holding the invariants.

The host layer makes trust structural. You trust the state machine because it enforced the gate checks. You trust the evidence because the transition required it. You trust the scope because the ADRs blocked everything outside it. You trust the audit trail because git holds it immutably.

The model's capability matters — a better model executes delivery lane work more effectively. The harness's coordination matters — a well-designed harness surfaces the right context at the right time. But neither determines whether the work was governed correctly. The host does that independently of both.

The leash isn't a limit on what the model can do. It's a guarantee that what the model does was supposed to happen.

The physics are real

The design pattern that emerges from Keel is this: place a CLI state machine between every actor and the work. Humans and agents are first-class participants in a governed system. The host defines the physics. The model executes within them. The harness coordinates around them. But neither the model nor the harness makes the rules, the CLI state machine does, persistently, across every session.

This is not a new idea in software engineering. Formal state machines, enforced transitions, and evidence-gated acceptance are standard in high-assurance systems. What's new is applying that discipline to the agentic layer, building the host so that a language model operating at full autonomy is still operating within a structure a human designed and a machine enforces.

The shared way of working doesn't emerge from the model being aligned. It emerges from the physics being real.

Sift: Local Hybrid Search Without the Infrastructure Tax

Alex Kwiatkowski — Tue, 10 Mar 2026 04:33:59 +0000

sift is a local Rust CLI for document retrieval. Point it at a directory, ask a question, and it runs a full hybrid search pipeline, BM25, dense vector, fusion, optional reranking, and returns ranked results. No daemon. No background indexer. No cloud. One binary.

It's built for agents and developers who need reliable, repeatable search over raw codebases, docs, and mixed-format corpora without spinning up infrastructure to get there.

You can install it now on Mac, Windows & Linux.

The retrieval pipeline

Every query runs through four stages:

Expansion — query variants are generated to broaden recall before retrieval begins.
Retrieval — BM25 (keyword), phrase match, and dense vector retrieval run against the corpus. Each method captures different signal.
Fusion — results are merged using Reciprocal Rank Fusion (RRF), balancing signal across retrieval methods without manual weight tuning.
Reranking — optional local LLM reranking via Qwen applies semantic disambiguation on the fused candidate set.

Each stage is independently tunable. Skip the vector pass if you only need BM25 speed. Run the full stack for best precision.

Architecture

The implementation is split into domain and adapters: domain objects model search plans, candidates, and scoring outputs; adapters implement the concrete BM25, phrase, vector, and reranking backends. A shared search service executes the
same strategy model for CLI, benchmark, and eval flows — nothing changes between a dev run and a CI eval pass.

Performance is local-first by design:

SIMD-accelerated dot-product for vector scoring on CPU-heavy workloads.
Zig-inspired incremental cache — a two-layer design borrowed from Zig's build system. A manifest store tracks filesystem metadata (inode, mtime, size) mapped to BLAKE3 content hashes, so sift knows exactly which files have changed without re-reading them. A content-addressable blob store holds pre-extracted text, pre-computed BM25 term frequencies, and pre-embedded dense vectors — meaning repeat queries never touch the neural network at all. Identical files across different projects share a single blob entry. The result: search performance bounded by dot-product speed, not inference latency.
Per-query embedding reuse across multi-stage pipelines.
Mapped I/O and tight tokenization hot loops to keep latency low on large corpora.

One concrete tradeoff during development: lowering embedding max_length from 48 to 40 recovered latency budget while keeping quality above the BM25 baseline — a good example of how evidence-driven tuning beats guesswork.

Full internals are documented in ARCHITECTURE.md.

Evaluation

Comparative strategy run over 5,185 SciFact documents (~7.8 MB) on an AMD Ryzen Threadripper 3960X:

Strategy	nDCG@10	MRR@10	Recall@10	p50 (ms)
bm25	0.7262	0.7000	0.8000	5.41
legacy-hybrid	0.7893	0.7250	1.0000	50.29
page-index	0.7000	0.6667	0.8000	16.79
page-index-hybrid	0.5701	0.4367	1.0000	41.09
page-index-llm	0.7893	0.7250	1.0000	41.28
page-index-qwen	0.7893	0.7250	1.0000	41.18
vector	0.8262	0.7667	1.0000	25.94

A few things worth noting:

BM25 at 5.41ms p50 is the right default for latency-constrained cases where keyword recall is sufficient.
Vector achieves the best nDCG@10 (0.8262) and perfect recall at 25.94ms — the most balanced strategy for most workloads.
LLM reranking (page-index-llm, page-index-qwen) matches legacy-hybrid quality at comparable speed, validating the local Qwen path as a practical alternative to heavier hybrid pipelines.
page-index-hybrid is the only strategy that underperforms BM25 on nDCG — a useful reminder that adding complexity doesn't always improve quality.

Cache hit rates (100/0/100%) confirm the caching layer is working correctly across all strategies. Verbose output (-v, -vv) surfaces cache hit rates, phase timings, and ranking metadata directly in the CLI.

Why this matters for agents

For agents, latency and reliability are requirements, not nice-to-haves. Tooling loops fail hard when search is slow, drops context, or depends on services that may be unavailable.

sift removes that friction: retrieval is local, deterministic, and cheap to repeat. No daemon to health-check. No embedding service to rate-limit against. No cloud dependency to manage. The binary ships with Homebrew and static Linux
artifact support, so agents can rely on a pinned version without environment drift.

How it was built

This shipped in a focused, nearly uninterrupted 24-hour push — implementation, eval design, benchmarking, performance tightening, packaging, and release prep in one sustained flow. Every major unit had acceptance criteria and measurable
evidence attached before it was marked done.

What made that pace possible is something I'm not ready to talk about in detail yet. But sift is the first real proof that it works at speed, under real constraints, without cutting corners. More on that soon.

Get started

README — installation and basic usage
CONFIGURATION — strategy and model settings
EVALUATION — running your own corpus evals
ARCHITECTURE — internals deep dive

Code: github.com/rupurt/sift