Forem: Dariusz Newecki

Four Gates. One Governor. Zero Code Written. CORE Is Autonomous.

Dariusz Newecki — Wed, 13 May 2026 12:10:27 +0000

When I defined A3 fourteen weeks ago, I wrote: "The daemon runs continuously, the Blackboard clears, the codebase converges, and every action is visible." Today all four gates that operationalize that definition are closed. I want to be precise about what that means — and honest about where the evidence is still accumulating.

What A3 Actually Is

A3 is not a version number. It is a state the system either is in or isn't.

I defined it with four gates because "autonomous" is a claim that's easy to make and hard to prove. Each gate closes one dimension of the proof. You can't skip one and still make the claim honestly.

The four gates:

G1 — Loop closure. An autonomous fix lands end-to-end on a real example. Finding detected → proposal created → proposal approved → execution succeeded → re-audit confirms resolution. Not against a toy. Against the live codebase.

G2 — Convergence. Sustained state where the rate of finding resolution exceeds the rate of finding creation. This is what makes "autonomous" mean something rather than describing a system that runs forever without making progress.

G3 — Consequence chain. Every action is traceable. Finding → Proposal → Approval → Execution → File changes → New findings — all six edges materialized as queryable rows. The governor doesn't have to read source code to know what happened. The chain is the answer.

G4 — Governance in .intent/. No enforcement logic, path mappings, or policy thresholds live in src/. All of it lives in .intent/ — human-authored files, read-only to CORE at runtime, never written by autonomous workers. This gate is the reason the governor role is real rather than nominal.

All four are closed.

The One That Took Longest to Get Right

G3 closed first — May 1. G1 was proven during the 79-second self-heal I wrote about last week. G4 closed May 10, after a campaign that moved 32 operational config sections out of hardcoded src/ literals and into governed YAML, touching 113 files.

G2 was the last one, and the most careful.

The structural piece was a circuit-breaker. After N consecutive identical-signature proposal failures, the affected findings are marked DELEGATE and a hazard finding is posted to the Blackboard. What this does: it converts systematic errors — an LLM producing the same wrong output over and over, a rule with no valid automated fix — into governance signals rather than infinite churn. The system doesn't spin. It escalates.

That's the architecture of convergence. The daemon can't get stuck in a loop it can't exit. Every unmappable pattern eventually surfaces as a human decision.

I closed G2 on May 12. Band D — 107 issues, fourteen weeks of engine integrity work — closed the same day.

What the Audit Shows

Current state: core-admin code audit returns PASS, 20 findings.

Fourteen weeks ago, before Band D started, the audit returned findings in the hundreds across namespaces we didn't even have rules for yet. The findings weren't noise — they were governance debt we couldn't see because the instruments weren't built yet.

That's the counterintuitive thing about this kind of system. Adding a rule doesn't fix violations. It makes violations visible. When ADR-031 landed — no hardcoded runtime directory paths — it surfaced 40 pre-existing violations in one run. The audit went from PASS to FAIL. That FAIL was progress.

20 findings at PASS is not a clean codebase. It's a codebase where every remaining finding is known, tracked, and either queued for autonomous remediation or parked as a deliberate human decision. The difference between "has findings" and "has uncontrolled findings" is the entire value proposition.

The Governor Role, Fourteen Weeks In

I am not a programmer. I have not written implementation code during this project.

What I've done: defined constitutional rules, authored ADRs, reviewed proposals that required architectural judgment, held the line on decisions where the system wanted to go one way and the architecture required another. One example: when modularity.class_too_large kept triggering on PathResolver, the autonomous path wanted to split it. The architectural answer was an exclusion in governance config, with a documented removal condition. That decision belongs in .intent/. It takes three lines of YAML, not a code change.

The G4 gate is what makes this possible. When governance lives in src/, changing it requires a programmer. When it lives in .intent/, it requires a governor.

What "Done" Honestly Means

The machinery is complete. The empirical evidence is young.

G2's structural guarantee — the circuit-breaker — is real. What I don't yet have is weeks of daemon logs showing sustained convergence across diverse rule namespaces, under varied load, with a full autonomous approval cycle running. The gate is closed by architecture. The demonstration is still accumulating.

I'll write about that when the logs are there to show. The series has been honest about the distance between "designed to work" and "observed working." This is no different.

What's Next

The system is autonomous. The next question is whether it's legible — to someone who isn't its author.

That's Band E. The outward-facing work: making the consequence chain readable to a stranger, making the governor role demonstrable rather than described, making the case that a regulated-industry team could operate this without understanding the source code.

The 79-second self-heal was the internal proof. The external proof is what comes next.

CORE is a governed software factory, actively built by the method it describes — source on GitHub. If you're building in the governed-AI or regulated-software space and this resonates, comments are open.

79 Seconds: Our AI Governance System's First Autonomous Self-Heal

Dariusz Newecki — Sat, 09 May 2026 13:57:13 +0000

I am not a programmer. I wrote zero lines of code today. The system fixed itself.

We've been building CORE — a deterministic governance runtime that surrounds AI with constitutional law so that AI mistakes are detectable, traceable, and recoverable. The pitch is simple: a non-programmer governor holds the why, AI and workers handle the how, and the constitution ensures nothing unauthorized happens.

Today we proved it works. Not in a demo. Not against a toy example. Against a real system that had been stuck for four days.

The State of Things This Morning

The autonomous loop — detect violation → propose fix → approve → execute → verify — hadn't produced a successful commit in four days. The dashboard said last_consequence: 4d ago. The blackboard (our shared state surface) had 55 open findings, none of which the loop could act on. Proposals were being generated and immediately rejected as structurally incoherent.

From the outside it looked alive. Twenty active workers, sensors firing, heartbeats posting. But nothing was moving.

The Investigation

We didn't start by writing code. We started by asking questions of the system itself.

The first query revealed the shape of the problem: 150 failed proposals, 0 executed today, the last consequence three days old. Dig deeper: 128 of those 150 failures were the same error — a constitutional gate blocking the same action, over and over. That's not a bug in the traditional sense. That's the system correctly enforcing its own laws while an upstream generator keeps producing proposals that violate them.

Then: the 55 "open" findings the remediator was supposed to act on — what were they actually? Mostly blackboard.entry_stale meta-findings. The loop was trying to remediate its own observability noise. The actual code violations — 25 of them, confirmed by audit — were invisible, blocked by their own historical entries sitting in abandoned status, which the sensor dedup treated as permanent silencers.

Seven distinct root causes, nested. Each one blocking the diagnostic of the next.

What We Fixed

In order of discovery:

The stale-finding storm. The BlackboardShopManager was scanning all entry types for SLA violations — including heartbeats with a 10-minute SLA. Every daemon restart, thousands of old heartbeat entries immediately exceeded their SLA. One line added to the WHERE clause: AND entry_type IN ('finding', 'proposal'). Storm stopped. Zero new stale findings in 3 minutes versus 3 per minute before.

The consequence chain gap. When a proposal completed successfully, the findings it had addressed stayed in deferred_to_proposal status forever. The failure path had a revival method. The success path had nothing. New method: resolve_deferred_entries_for_completed_proposal(). Symmetric with the failure path. Twelve lines of code.

The proposal collapse. The proposal generator was creating proposals for N files but only including one action — always targeting scope.files[0]. A proposal claiming to fix 8 violations would touch exactly one file and leave 7 untouched. The fix: one ProposalAction per affected file, ordered 0 through N-1. The executor already supported multi-action proposals. Nobody had ever wired the generator correctly.

The DELEGATE routing gap. modularity.class_too_large violations — class-level refactors that require human judgment — were marked PENDING in the remediation map. PENDING entries are excluded from the active map by the loader. So those findings were claimed, found unmappable, and released back to open every 60 seconds. Forever. The fix was a YAML status change: PENDING → DELEGATE. The loader already handled DELEGATE entries. One word changed.

The permanent-silence bug. When we cleared the stale queue, we used abandoned status. What we didn't know: abandoned is treated the same as open by the sensor dedup logic. "Already represented on the blackboard, do not re-post." So the violations we'd cleaned up were now permanently invisible. Filed as a design-level issue — abandoned and "deliberately suppressed" need to be different states. Immediate fix: flip the cleaned-up audit.violation:: entries to resolved, which the sensor correctly treats as "re-detectable."

13:16:18

With the queue clean, the sensors unblocked, and the DELEGATE routing live, the loop had something to work with. A needs_split violation appeared. The remediator created a proposal. We approved it — the first manual approval of the day.

At 13:16:18, ProposalConsumerWorker picked it up. fix.modularity ran. The LLM took 33 seconds to analyze the file. It returned a plan.

The plan had one module. The validator requires at least two for a split.

mark_failed ran. The file changes were reverted. The proposal was marked failed.

Then: revive_findings_for_failed_proposal ran. The deferred finding flipped back to open.

At 13:17:37 — 79 seconds after failure — the finding was re-claimed, a new proposal was created, and it was sitting in the approval inbox.

The loop had self-healed. Without intervention. Traceable at every step.

What "Self-Heal" Actually Means

The LLM produced bad output. The system caught it, reverted the change, put the work back in the queue, and asked again. No data was corrupted. No state was left inconsistent. The governor's role was to review the next proposal and decide whether to approve it.

This is the regulated-industry argument for this kind of governance. You don't need AI to never fail. You need failure to be:

Detectable. The validator caught a 1-module "split" plan before anything was committed.
Bounded. The gate order — Conservation Gate, IntentGuard, plan validator — ensures AI output can't bypass constitutional constraints even if it tries.
Recoverable. The revival mechanism returned the system to a known-good state. The finding was exactly as it was before the failed attempt.
Traceable. Every step — finding posted, claimed, deferred, proposal created, approved, executing, failed, revived, re-claimed — is a timestamped row in a queryable table.

The audit trail isn't bolted on. It's how the loop works.

The Governor Role

I am not a programmer. I wrote zero lines of code today.

What I did: asked questions of the system, recognized when an answer pointed to a design gap rather than a bug, held the line on architectural decisions (backbone workers don't get split autonomously, regardless of what the violation detector says), and approved one proposal when the conditions were right.

The rest was diagnosis, sequencing, and constitutional reasoning. The code came from Claude Code on the development machine, prompted by the analysis. The analysis came from reading the system's own outputs — queries, logs, dashboard — not from reading source files.

That's the governor role. Not "I don't code therefore I'm not involved in technical work." The opposite: deeply involved in technical decisions, operating at the right level of abstraction, with a system that surfaces the right information to make those decisions.

The 79-second self-heal wasn't despite the governance architecture. It was because of it.

What's Next

The loop machinery is sound. The next bottleneck is fix.modularity's prompt — the LLM needs to be told explicitly to produce at least two modules and given responsibility-grouping context from the audit findings. That's prompt engineering work, not infrastructure.

When that's fixed, CORE will autonomously split files, verify the split, commit, re-audit, and confirm the finding is resolved — without a human writing a line of code.

We're close.

CORE is a governed software factory, actively being built by the method it describes — source on GitHub. If you're building in the governed-AI or regulated-software space and this resonates, comments are open.

CORE Closed Its Audit Trail. Then Found 18 Engine Gaps It Couldn't See Before.

Dariusz Newecki — Fri, 01 May 2026 21:35:48 +0000

Six weeks ago I published a post here titled "Your Agent Has Two Logs. One of Them Doesn't Exist Yet."

This week, Band B closed. CORE's second log exists.

Here's what that actually means — and why closing it immediately made things harder.

The two-log problem, briefly

Every autonomous system that touches production code has two logs whether it admits it or not.

Log one: what happened. Files changed, tests ran, commits landed.

Log two: why it happened. What finding triggered what proposal. What approval authorized what execution. What execution caused what file change. What file change produced what new finding.

Log two is the audit trail. In a regulated environment, log two isn't optional — it's the difference between a system you can defend and one you can't.

CORE had log one. Log two was missing.

What Band B actually required

Eight issues. Four ADRs. Seven coordinated write-path decisions — where in the code does attribution get written, in what shape, guaranteed by what gate.

The hard part wasn't the code. It was making the causality chain complete. Every link had to be present:

Finding → which proposal claimed it (and when)
Proposal → which execution consumed it (and what commit resulted)
Execution → which new findings it produced

Miss one link and the chain is decoration, not evidence.

196 commits in April. 25 issues closed. Band B: 8 closed, 0 open.

What happened immediately after

Band D opened with 18 issues.

Not because we introduced regressions. Because closing Band B made the engine's integrity gaps visible in a way they weren't before. You can't measure attribution fidelity until attribution exists. Once it does, you can see exactly where the engine fails to populate it correctly.

This is the convergence principle working as designed. The system gets more capable. It immediately finds more problems with itself. The audit PASS holds — 19 active workers, findings are warnings about modularity, not governance failures. But the work queue doesn't shrink when a band closes. It shifts.

What "GxP-load-bearing" means in practice

I've been building CORE in part for environments like pharmaceutical manufacturing — where an AI system that modifies code or configuration needs to prove it acted within authorized boundaries, on authorized intent, with a complete audit trail.

GxP (Good Practice regulations) doesn't care what your system can do. It cares what your system can prove it did.

Band B is the difference between CORE being a capable tool and CORE being a defensible tool. The second log is what makes it defensible.

What's next

Band D: engine integrity. 18 open issues. The system that now has a complete audit trail needs its engine tightened before those traces are fully trustworthy.

Then Band E: external validation. CORE governing a repository it didn't build.

The second log exists. Now we make sure everything it records is true.

CORE is open source: github.com/DariuszNewecki/CORE

Previous in this series: Your Agent Has Two Logs. One of Them Doesn't Exist Yet.

My Audit Caught My Audit Being Wrong

Dariusz Newecki — Sat, 25 Apr 2026 22:16:52 +0000

And that's exactly what it's supposed to do.

A few days ago I ran a diagnostic on CORE — the governance system I'm building that supervises AI-generated code. The diagnostic was supposed to investigate why a specific audit rule appeared to be silently failing. Not firing. Producing zero findings against files it should have flagged.

I ran the investigation carefully. Stage by stage. I came to a conclusion.

The conclusion was wrong.

And I only found that out because the system itself told me so.

What I thought was happening

CORE has an audit rule called autonomy.tracing.mandatory. It checks that any class ending in Agent contains a mandatory call to self.tracer.record. The logic is straightforward: if an autonomous agent produces work, that work must be traceable. No tracing call — the rule flags it.

My notes said the rule was firing zero findings against SelfHealingAgent — a class with, in fact, zero tracer references. A rule designed to catch exactly that situation, catching nothing.

That's a governance gap. If a rule exists and silently fails, you don't have an audit system. You have a theatrical one.

So I investigated.

What I actually found

The rule was firing. Correctly. Both findings were present, cleanly, in reports/audit_findings.json:

{
  "check_id": "autonomy.tracing.mandatory",
  "severity": "warning",
  "message": "Line 51: missing mandatory call(s): ['self.tracer.record']",
  "file_path": "src/will/agents/self_healing_agent.py"
}

The system wasn't broken. The diagnostic's starting assumption was broken.

Here's where it came from. CORE's audit output is rendered through Rich — a Python library that produces beautiful terminal tables with color, alignment, and spacing. Rich also truncates long strings to fit columns. So autonomy.tracing.mandatory becomes autonomy.tracing.mandat… on screen.

When I ran grep 'tracing.mandatory' against the captured terminal output to verify the finding, I got zero matches. Not because the finding wasn't there — because Rich had silently eaten the last four characters of the rule name, and my grep pattern was looking for the full string.

I used display output as an oracle. Display output lied.

The JSON source of truth never did.

The stage-by-stage result

I re-ran the diagnostic properly, going to primary sources instead of rendered output:

Stage	Status
Rule loaded and mapped	PASS — rule extracted, bound to `ast_gate` engine
Scope resolution	PASS — `self_healing_agent.py` in scope
Engine dispatch	PASS — engine ran against the file
Auto-ignore	PASS — zero suppressions, nothing dropped silently
Finding emitted	PASS — present in `audit_findings.json`

Every stage passed. The investigation had no failure to explain, because there was no failure. It was investigating a ghost.

Direct engine invocation confirmed it independently:

# Standalone check — no orchestrator involved
for node in ast.walk(tree):
    if GenericASTChecks.is_selected(node, selector):
        err = GenericASTChecks.validate_requirement(node, requirement)
        print(type(node).__name__, getattr(node, 'name', '?'), '->', err)

# Output:
# ClassDef SelfHealingAgent -> missing mandatory call(s): ['self.tracer.record']

Same verdict. No ambiguity.

Why this matters more than "I made a mistake"

I'm building a system where AI generates code and a deterministic governance layer audits it. The entire value proposition is that the governance layer is trustworthy. Not smart — trustworthy. You need to be able to look at a finding and know it reflects reality. You need to be able to look at a clean audit and know the system actually checked.

That's called instrument qualification. In regulated industries — pharmaceuticals, medical devices, aerospace — you don't just validate the product. You validate the instruments you used to measure the product. A thermometer that reads 37°C when the actual temperature is 39°C isn't a minor inconvenience. It's a systematic lie that compounds silently across every reading it ever produces.

I accidentally demonstrated the same principle in software.

When I used grep against Rich-rendered terminal output, I was reading from an instrument I hadn't qualified. Rich is a display library. It's not a data source. It's designed to make things readable to humans, not parseable by machines. Using it as a source of truth for a diagnostic is exactly as reliable as doing a medical measurement with a ruler.

The JSON report is the qualified instrument. It's the canonical output. It doesn't truncate. It doesn't wrap. It doesn't abbreviate for column fit. It says what the system found.

A passing audit with many findings is less honest than a failing audit with fewer real ones. An instrument that gives you clean-looking output that misrepresents reality isn't helping you — it's flattering you.

What I changed

Two things.

One: I added the stale references explicitly to the diagnostic record. My notes had two wrong module paths that would have caused anyone running the diagnostic in the future to hit ImportError immediately. AuditorContext is not in mind.logic.engines.ast_gate.base — it's in mind.governance.audit_context. I documented both as stale references, with the correct paths. Constitutional debt is honest debt. Hiding it helps no one.

Two: I documented the grep-against-Rich anti-pattern. Not as a personal failure, but as a category. If I did it, someone else will do it, or I'll do it again in six months under pressure. The pattern needs a name so it can be recognized.

The uncomfortable version

Here's the uncomfortable version of this story: I almost propagated the wrong conclusion.

If I'd stopped at "zero grep matches, rule is not firing," I would have written a finding that said the governance system had a blind spot. I might have gone looking for a fix in the wrong place. I might have introduced a workaround that solved a problem that didn't exist, while leaving a different problem — the unreliable diagnostic method — completely intact.

In a system that supervises autonomous AI code generation, a wrong finding about your audit rules is worse than a missing finding. A missing finding is a gap. A wrong finding is a confidence injection. You become more certain the system is broken in a specific way, and that certainty guides you away from the actual state.

That's the failure mode I'm most worried about in AI-supervised systems generally. Not that the AI is wrong — everyone accepts the AI might be wrong. The failure mode is when the verification layer produces plausible-looking output that you stop checking.

CORE is built on the assumption that every layer lies until verified. Including the diagnostic layer. Including me.

I'm not a programmer. I'm closer to a lawmaker than a coder. I built a governance system because I understand governance better than I understand AST traversal. Swimming against a current you can't even see clearly is exactly the situation where you need your instruments to be honest. Flattery is the thing that drowns you.

The system didn't flatter me. That's not a bug. That's the only thing I actually need it to do.

CORE is an open-source, deterministic governance runtime for AI-generated code. You can find it at github.com/DariuszNewecki/CORE.

The First Test CORE Ever Wrote For Itself

Dariusz Newecki — Sat, 18 Apr 2026 14:40:25 +0000

And why it was wrong — and why that's exactly the point.

Today, at 16:24 CET, my system wrote a test file for itself.

Not a test I wrote. Not a test a developer wrote. A test that CORE — my constitutional governance runtime — autonomously detected was missing, proposed to generate, waited for my approval, and then wrote using its own CoderAgent.

The test was wrong. The methods it tested don't exist. The API it assumed was hallucinated.

And I'm more excited about this than if it had been perfect.

What CORE is (briefly)

CORE is a deterministic governance runtime that surrounds AI code generation with constitutional law. AI produces code, but every output is verified against rules, audited, and must pass governance gates before execution. The human role is governor — not programmer.

I've written about this system before. The previous milestone was when CORE blocked itself — a rule violation preventing its own remediation from executing. Today's milestone is different. Today, the system grew a new autonomous capability.

Stream B: closing the test loop

CORE already has a working autonomous loop for code quality:

AuditViolationSensor detects violation
  → ViolationRemediatorWorker creates proposal
  → ProposalConsumerWorker executes fix
  → Sensor re-runs — finding resolves

Stream B was the same loop, but for test coverage:

TestCoverageSensor detects missing test
  → TestRunnerSensor confirms (pytest)
  → TestRemediatorWorker creates build.tests proposal
  → ProposalConsumerWorker executes → CoderAgent writes test
  → TestRunnerSensor re-runs — pass or fail finding posted

The components didn't exist. We built them today.

What we built

TestCoverageSensor — scans src/ for Python files with no corresponding test file. Posts test.run_required:: findings to the Blackboard. Critically: the scan parameters (source root, test root, excluded filenames) are read from .intent/enforcement/config/test_coverage.yaml at runtime. No paths hardcoded in Python. Changing what gets scanned is a constitution edit, not a code change.

TestRunnerSensor — already existed, just paused. Consumes test.run_required:: findings, runs pytest, posts test.missing or test.failure. Activated today.

TestRemediatorWorker — new acting worker. Claims test.missing and test.failure findings, groups by source_file, creates one build.tests proposal per file. Per-file deduplication: two concurrent proposals for different files are valid and don't block each other.

build.tests AtomicAction — already existed in the registry. Takes source_file, calls CoderAgent, runs auto-heal pipeline (fix.imports, fix.headers, fix.format), IntentGuard validation, writes the test file.

Four components. One closed loop.

The bugs we hit

I'm going to be honest about the path here, because the bugs were instructive.

Bug 1: entry_id vs id.
The BlackboardService contract is clear — all finding dicts use key "id". Somewhere along the way, three files in the codebase had finding["entry_id"] — confusing a local variable name with the dict key. Same fix three times: finding["id"]. The lesson: a contract stated only in docstrings is a contract that will be violated. CORE's next step should be a schema-level enforcement.

Bug 2: Subject prefix mismatch.
ViolationRemediatorWorker only claims findings with prefix audit.violation::. test.missing:: findings sat on the Blackboard unclaimed — the remediation map had the right entries but the worker never saw them. Option A (widen prefix) was ruled out: the worker's core loop reads payload["rule"] for routing, and test findings have no rule key. Option C (dedicated worker) was the right call. TestRemediatorWorker was built. Single responsibility, clean separation.

Bug 3: action_executor not available in daemon context.
build.tests calls core_context.action_executor. At CLI bootstrap time, this attribute is monkey-patched onto CoreContext. The daemon doesn't do this — it passes a bare context. The fix was a hasattr guard, already canonically established in ViolationExecutorWorker with a comment explaining exactly this failure mode. Before applying it, I asked Claude Code to assess the blast radius: three sites in daemon paths were affected. We fixed the blocking one now; the other two go on the Phase 4 queue. Surgical over broad.

The first test

class TestBlackboardAuditor(unittest.TestCase):
    def test_audit_with_valid_data(self):
        mock_data = {
            "entries": [
                {"id": 1, "content": "Task 1", "status": "pending"},
            ]
        }
        result = self.auditor.audit(mock_data)
        self.assertIn("summary", result)

BlackboardAuditor has no audit() method. It has run(), run_loop(), SLA-tier checking, stale entry detection. The LLM invented an API from the class name alone.

Why am I not disappointed?

Because this is iteration zero. The infrastructure works — detection, proposal creation, approval gate, execution, git commit. The quality of the generated test is a separate concern, and it's an addressable one. CoderAgent generated tests without reading the source file first. The fix is to pass the source content as context before generation. That's a build_tests_action.py improvement for the next session.

More importantly: the system caught its own mistake. TestRunnerSensor will run, the tests will fail, test.failure findings will be posted, a repair proposal will be created. The loop continues.

What "autonomous" actually means here

I approved the proposal. I didn't write the test. I didn't write the sensor. I didn't wire the pipeline. I didn't debug the entry_id bug — I read the trace, stated the contract, Claude Code applied the fix.

My role today was:

Architectural decisions (Option A vs B vs C for the subject prefix problem)
Scope control (one file, not 741)
Approval gating (three proposals created, three reviewed, two rejected for cause, one approved)
Quality judgment (the test is wrong — that's useful signal, not a failure)

That is the governor role. Not programming. Governing.

The honest state

What works: The loop closes. Coverage gap detected → test proposed → human approves → test written → failure detected → repair proposed. End-to-end autonomous.

What doesn't yet: The generated tests are hallucinated. CoderAgent wrote tests for an API that doesn't exist because it had no context about what BlackboardAuditor actually does. The path mapping between src/ and tests/ is also hardcoded in two of the three pipeline files — a drift risk I'm aware of and haven't fixed yet.

What's next: The fix is the same pattern CORE already uses for code remediation: build a context package first. Read the source. Understand the architectural role. Then generate. ViolationRemediator calls RemediationInterpretationService.build_reasoning_brief_dict() before invoking any LLM — it passes actual method signatures, constitutional role, and import graph as the reasoning brief. build.tests skips this step entirely. The infrastructure exists. It just isn't wired yet. Fix that, fix the path mapping to read from .intent/ everywhere, then open the scope beyond one file.

The ratio today: one file with tests that fail. Tomorrow: the same loop repairs them.

On instrument qualification

I've written before about the GxP principle I apply to CORE: an instrument must be qualified before you trust its readings. An audit with 252 findings that passes is less trustworthy than one with 78 findings that fails.

Today's first test is wrong. But the instrument that detected "this file has no tests" is correct. The instrument that detected "this test fails" will also be correct.

The loop doesn't need perfect tests to be useful. It needs honest sensors.

CORE is open source. The architecture documents, constitutional rules, and implementation are all public at github.com/DariuszNewecki/CORE. Documentation at dariusznewecki.github.io/CORE.

Previous article in this series: The AI That Refused To Ship Its Own Fix

When My Governance System Governed Itself Wrong

Dariusz Newecki — Tue, 14 Apr 2026 20:08:04 +0000

I built a sensor to detect import order violations. It found 152. The fixer found 0. One of them was lying.

Background

CORE is a deterministic governance runtime I'm building around AI code generation. The core idea is simple: AI produces code, but AI is never trusted. Every output passes through constitutional rules, audit engines, and remediation loops before anything touches the codebase.

One of those loops works like this:

AuditViolationSensor detects violation
    → posts finding to Blackboard
ViolationRemediatorWorker claims finding
    → dispatches AtomicAction (fix.imports, fix.ids, fix.headers, etc.)
Sensor runs again
    → confirms violation gone or re-posts

This is the convergence loop. The goal is that the Blackboard empties over time as violations get fixed. That's what I call A3 — the daemon runs continuously and the codebase converges without me touching anything.

This session I was closing sensor coverage gaps. Several fix actions in dev sync had no corresponding sensor, meaning the daemon was blind to those violations and a human had to run dev sync manually to keep things clean. Not autonomous. Not A3.

One of the gaps was style.import_order. I wrote the sensor, wired it up, restarted the daemon.

152 findings.

The Problem

The sensor was using an AST-based implementation — check_import_order — that classifies imports into groups: future, stdlib, third_party, internal. It then checks that the groups appear in the right order.

The fixer uses ruff --select I, which does the same job but reads its configuration from pyproject.toml:

[tool.ruff.lint.isort]
known-first-party = ["api", "body", "cli", "features", "mind", "services", "shared", "will"]
section-order = ["future", "standard-library", "third-party", "first-party", "local-folder"]

I ran fix.imports --write to clean up before activating the sensor. Zero violations after. Then I activated the sensor. 152 violations.

The sensor and the fixer disagreed on what "correctly ordered imports" means.

Finding the Root Cause

I picked the simplest failing file — src/cli/resources/admin/patterns.py — violation at line 7:

import typer                              # third_party → idx 2
from shared.cli_utils import core_command # internal   → idx 3
from .hub import app                      # ???

The sensor's _classify_root function takes the module name and classifies it. For from .hub import app, a relative import, stmt.module is "hub". "hub" is not in stdlib_names and not in internal_roots, so it falls through to third_party — index 2.

But shared was classified as internal — index 3.

Index 2 after index 3 → violation.

Ruff treats relative imports as local-folder, which comes after first-party in the section order. So ruff considers this file clean. The sensor considers it broken.

Two problems:

Problem 1 — relative imports. The sensor had no concept of them. Any from .something import X got classified as third_party because the module name (something) didn't match any known root. Fix: detect stmt.level > 0 in ast.ImportFrom and classify as local with the highest order index.

Problem 2 — internal roots mismatch. The sensor hardcoded ["shared", "mind", "body", "will", "features"]. Ruff's known-first-party includes ["api", "body", "cli", "features", "mind", "services", "shared", "will"]. Missing: api, cli, services. When a file imports from cli after importing from body, ruff sees two first-party imports in any order — fine. The sensor sees third_party after internal — violation.

Fix: pass internal_roots as a parameter in the enforcement mapping so the sensor reads from configuration rather than hardcoding.

After both fixes: 0 violations. Sensor and fixer agreed.

The Architectural Lesson

This is an instrument qualification problem.

In GxP-regulated environments (pharma, medical devices), before you trust a measurement instrument, you qualify it. You verify that it measures what it claims to measure, using a known reference. An unqualified instrument is not a trusted instrument — even if it produces numbers.

I deployed a sensor without qualifying it against the fixer. The sensor was measuring something real (import order), but measuring it differently than the tool that fixes it. The result was 152 false positives — governance debt that looked real but wasn't.

A sensor that disagrees with its corresponding fixer is worse than no sensor. It creates noise, erodes trust in the Blackboard, and — if the remediator were running — would dispatch fix actions that produce no change, loop, and dispatch again.

The correct pattern before activating any new sensor:

Run the fixer in dry-run mode. Collect what it would change.
Run the sensor. Collect what it would flag.
Verify the two sets agree on the same files.
Only then activate.

CORE doesn't enforce this yet. The gap is now in the backlog as governance.sensor_fixer_coherence — a meta-rule that validates governance components against each other before they're trusted.

What Got Fixed

Three separate changes at three separate levels:

AST logic (src/mind/logic/engines/ast_gate/checks/import_checks.py):

# Before: relative imports fell through to third_party
# After: detect stmt.level > 0 and classify as local (idx=4)
if isinstance(stmt, ast.ImportFrom) and stmt.level > 0:
    grp = "local"
    idx = 4  # always last — after internal

Configuration (.intent/enforcement/mappings/code/style.yaml):

style.import_order:
  engine: ast_gate
  params:
    check_type: import_order
    internal_roots: ["api", "body", "cli", "features", "mind", "services", "shared", "will"]

Tooling — a new core-admin workers blackboard purge command to clear stale findings when a sensor produces false positives before a fix is applied.

Current State

7 sensors active. 52 rules. 0 findings. Blackboard clean.

The convergence loop is running. The daemon detects violations, the remediator dispatches fixes, the sensor confirms they're gone. That's A3.

The sensor-fixer coherence check doesn't exist yet. Until it does, every new sensor I add needs manual qualification before activation. That's a human step where CORE should eventually do the work itself.

Which is the point of the whole project.

CORE is open source: github.com/DariuszNewecki/CORE
Previous posts in this series cover the constitutional model, the autonomous loop, and the ViolationExecutor implementation.

PASSED with 252 findings. FAILED with 78. Which audit would you trust?

Dariusz Newecki — Tue, 07 Apr 2026 21:00:32 +0000

A story about instrument qualification, false positives, and why honest governance sometimes means failing on purpose.

The paradox

This morning, CORE's audit system reported 252 findings and returned a verdict of PASSED.

This evening, it reported 78 findings and returned a verdict of FAILED.

Nothing in production changed. No bugs were introduced. No architecture was violated.

The sensors were fixed.

Finding	Befr	Aftr	Delta
Total findings	252	78	-174
Orphan files	91	0	-91
Modularity (blunt score)	100	0	-100
needs_split	—	19	new
needs_refactor	—	27	new
File size (redundant rule)	29	0	-29
Verdict	PASS	FAIL	honest

The FAILED verdict is the correct one. The PASSED verdict was a compliance illusion.

The instrument qualification problem

In GxP-regulated environments — pharmaceutical manufacturing, medical devices, clinical software — you do not run an assay on an uncalibrated instrument and trust the result. Before any measurement is taken seriously, the instrument must be qualified: it must demonstrably measure what it claims to measure, within defined tolerances, under defined conditions.

This principle is so fundamental that it precedes any discussion of the data itself. Bad data from a qualified instrument is a finding. Bad data from an unqualified instrument is noise — and acting on noise has a name: it is a deviation.

Software governance systems face the same problem. An audit engine that produces findings is an instrument. If that instrument has not been qualified — if its detectors produce false positives, if its thresholds are miscalibrated, if its rules conflate distinct problem classes — then the findings it produces are not evidence. They are noise with a compliance label.

Acting on that noise with automated remediation is not governance. It is confident, expensive, wrong work.

Case 1: The orphan file detector

CORE uses a static import graph traversal to detect source files unreachable from any declared entry point. The principle is sound: if no entry point can reach a file, that file is dead code and should be removed.

The detector flagged 91 files as orphans.

All 91 were false positives.

Static import graph traversal is a deliberate choice — deterministic, auditable, no runtime dependency. The tradeoff is that dynamically-loaded components must be explicitly declared as entry points. That declaration is itself a governance artifact: it makes the implicit loading contract explicit and versioned. The detector was not wrong — the contract was incomplete.

An automated agent pointed at those 91 findings would have deleted live production code. The agent would have been operating correctly within its mandate. The mandate was wrong.

The fix was not to make the detector smarter. It was to declare the dynamically-loaded directories as explicit entry points — converting an implicit runtime convention into a versioned, governed contract. Functionally this resembles static linking. Constitutionally it is different: the declaration is law, subject to change control, with documented rationale. The detector enforces the contract. The contract is owned by governance, not by the build system.

entry_points:
  - "src/will/self_healing/"
  - "src/will/test_generation/"
  - "src/shared/infrastructure/"
  # ... 10 more directories

After the fix: zero orphan findings. Zero code deleted. The codebase did not change. The instrument was qualified.

Case 2: The modularity score

Four rules were producing 100 findings collectively:

modularity.single_responsibility
modularity.semantic_cohesion
modularity.import_coupling
modularity.refactor_score_threshold

All four were proxies for a single composite score. All four mapped to the same remediation action: fix.modularity. All four carried the same enforcement level: reporting.

The problem is that they were measuring two fundamentally different things and treating them identically.

Problem class A: a file is too long with a single coherent responsibility.
This is a mechanical problem. The file does one thing but does too much of it. The solution is splitting — redistributing logic across smaller files along natural seams. No discipline boundaries are crossed. No architectural judgment is required. An automated system can propose and execute this split safely, subject to a Logic Conservation Gate that verifies no logic was lost.

Problem class B: a file mixes distinct architectural disciplines.
A file that combines CLI rendering, database access, and business logic in 300 lines is not a size problem. It is an architectural violation. Resolving it requires a human to decide where each responsibility belongs in the constitutional layer structure. An automated system cannot make that decision safely — not because AI is incapable of generating a proposal, but because the decision carries architectural authority that must remain with a human until the boundaries are formally established.

Conflating these two problems in a single score means the governance system cannot distinguish between what it is allowed to fix autonomously and what it must escalate. That distinction is not a technical nicety. In regulated environments, it is the difference between an approved automated action and an unauthorized architectural change.

The fix was to retire the four proxy rules and replace them with two precise sensors:

{
  "id": "modularity.needs_split",
  "enforcement": "reporting",
  "rationale": "Automatable. Mechanical redistribution, no discipline boundaries crossed."
},
{
  "id": "modularity.needs_refactor",
  "enforcement": "blocking",
  "rationale": "Requires human judgment. Autonomous action prohibited until architectural decision is approved."
}

The blocking enforcement on needs_refactor is the point. It is not a warning. It is a constitutional stop. The system will not proceed autonomously until a human has reviewed and authorized the architectural boundary decision.

This is why the audit now returns FAILED. Twenty-seven files contain mixed-discipline violations. They are real findings. They require real decisions. The system is correctly refusing to act without authorization.

The verdict paradox

A governance system that always passes is not a governance system. It is a reporting system with a green checkbox.

PASSED with 252 findings meant: the system detected many things, none of them were classified as blocking, therefore no action is required. The 91 false positives contributed to a picture of busyness without actionability. The composite modularity score produced findings that the automated remediator could not distinguish from each other. Everything was flagged, nothing was escalated.

FAILED with 78 findings means: the system has detected 27 architectural violations that require human decisions before any automated action proceeds. It has identified 19 files that can be split autonomously, subject to validation gates. Every finding in the report corresponds to a specific, actionable condition.

The failure verdict is evidence that the governance system is functioning correctly. It is not a regression. It is an honest measurement.

The principle

Governance quality is not measured by finding count. It is measured by finding accuracy.

In regulated environments, the difference between a false positive acted upon and a true positive ignored is not a technical footnote. It is a compliance failure. Instrument qualification is not overhead — it is the precondition for trusting any measurement that follows.

Before you ask what your audit found, ask whether your audit can be trusted.

CORE is an open-source constitutional governance runtime for AI-assisted software development. Architecture, governance rules, and enforcement mappings are public.

github.com/DariuszNewecki/CORE

I Spent a Saturday Cleaning My Own Repo. CORE Made Me.

Dariusz Newecki — Sat, 04 Apr 2026 19:42:23 +0000

Not because I wanted to.

Because the system I built demands that everything it touches is defensible. And when I looked honestly at my own repository — the README, the docs, the .gitignore — they weren't.

So I fixed them.

The broken command nobody noticed

It started with a README.

The Quick Start section told anyone who cloned CORE to run:

poetry run core-admin check audit

That command doesn't exist. The correct command is:

poetry run core-admin code audit

One word difference. But anyone who followed that instruction would get an error on their very first interaction with the project. First impression: broken.

The CLI had evolved. The legacy verb-first pattern (check audit) was purged months ago when CORE's command structure was redesigned around resource-first architecture. The README hadn't kept up. It was documenting a command that no longer existed.

"If the docs lie, the system lies."

This is the thing about building a governance runtime: you can't enforce standards on AI-generated code while your own documentation ships broken commands.

CORE's entire thesis is:

Never produce software you cannot defend.

Not rhetorically. Technically, legally, epistemically, historically.

If I can't defend my own README — if the first thing someone tries doesn't work — then I'm not living by the standard I built into the system.

That's not a philosophical problem. It's a credibility problem. And a consistency problem. And those are exactly the problems CORE exists to solve.

What a Saturday of self-governance looks like

Here's what actually got done:

README:

Fixed the broken audit command (check → code)
Removed a stale metric (0 blocking violations) that may or may not have been current
Removed an acknowledgment that no longer reflected the project's direction
Replaced a buried, collapsible workflow diagram with a cleaner conceptual flow — visible immediately, no click required

CONTRIBUTING.md:

Updated the CI description (it had said "smoke testing" — it does more than that now)
Added the audit command so contributors know how to verify compliance locally before opening a PR

.gitignore:

Found that logs/* was missing — only !logs/.gitkeep existed, with no corresponding exclusion rule. Any non-.log file landing in logs/ would have been tracked silently.
Added proper logs/* and reports/* exclusions with the same pattern used for var/ and work/

docs/ — complete rewrite:

The docs site had 111 files across 30 directories, most of them written at various stages of development, not reflecting current architecture
I replaced all of it with six files: index.md, how-it-works.md, autonomy-ladder.md, getting-started.md, cli-reference.md, contributing.md
Every CLI command in the reference was verified against the actual source code — not inferred, not remembered, not guessed

That last point matters. The first draft of cli-reference.md was written by an AI assistant — from inference, not from source. I caught it, pushed back, and made it search the actual command registrations before writing anything. Same standard I apply to everything else.

The CLI reference problem is the whole problem in miniature

The first draft of cli-reference.md was written by an AI assistant — from inference, not from source.

It had wrong subcommands. Plausible ones, but wrong.

core-admin proposals inspect <id> — doesn't exist. It's show.

core-admin inspect status — legacy verb-first pattern, purged months ago. It's core-admin admin status.

core-admin governance coverage — wrong group entirely. It's core-admin constitution status.

Three wrong commands in one file. All confident. All wrong.

I caught it. Pushed back. Asked the assistant to search the actual source code before writing anything. It did. The commands got fixed.

The irony was not subtle: an AI assistant producing plausible but unverified output, in documentation for a system that exists specifically to prevent AI from producing plausible but unverified output.

That's not a documentation problem. That's an epistemic problem. And it's the same one that lives in .intent/northstar/core_northstar.md:

Nothing is assumed silently. All assumptions must be explicit, owned, and traceable. Reasoning requires citation. If CORE cannot point to evidence, it cannot act.

What this has to do with autonomy

CORE is currently at A2+ — governed generation, universal workflow pattern. I'm working toward A3 — strategic autonomy, where CORE identifies and proposes architectural improvements without being asked.

For A3 to be trustworthy, the system has to be clean. Not just the code — the whole project. The README someone reads before cloning. The docs they follow when getting started. The .gitignore that determines what gets committed.

If those are wrong, the foundation is wrong. And you can't build autonomous operation on a wrong foundation.

Cleaning the repo isn't glamorous. It doesn't advance the autonomy ladder. But it's the kind of work the system's own philosophy demands — and that I'd been quietly deferring.

The self-referential part

There's something almost uncomfortable about this.

I built a system that enforces: you cannot ship what you cannot defend. And then I had a README with a broken command, a .gitignore with a missing rule, and a documentation site with 111 files of outdated content.

The system couldn't enforce standards on its own repository — it doesn't govern Markdown files. That's a human responsibility.

Which means the human has to do it.

That's not a failure of CORE. That's the design. .intent/ is human-authored and immutable at runtime. CORE can never write to it. The constitution is mine to maintain.

The same is true for everything outside the autonomy lanes — the README, the docs, the project presentation. CORE governs the code. I govern the rest.

And today I did.

If you're curious

The repo is at github.com/DariuszNewecki/CORE.

If you've looked before and bounced — the docs are cleaner now. The commands in the Quick Start actually work.

If you're new: read .intent/ before the source. That's where the law lives.

Previous in this series: My AI Has 22 Workers, 2,470 Resolved Violations, and Still Can't Call Itself Autonomous. Here's the Gap.

The AI That Refused To Ship Its Own Fix

Dariusz Newecki — Wed, 01 Apr 2026 18:15:43 +0000

Or: what constitutional governance actually looks like in practice

I spent today doing something unusual. I was improving an AI system — and the system kept stopping me from making mistakes.

Not because it was clever. Because it was governed.

The Setup

CORE is a constitutional governance runtime for AI coding agents. The short version: instead of letting AI write code freely and checking quality afterwards, CORE requires every decision to be traceable to declared law before execution begins.

Every rule lives in .intent/. Every action requires authority. Every mutation is defensible or it doesn't happen.

Today's session was about advancing CORE toward A3 — full autonomous operation on Tier 1 violations. Eight items on the A3 readiness checklist were marked ❌. By end of session: one remained.

Here's what the path looked like.

The Gate That Wasn't Wired

Early in the session I wanted to promote two entries in the autonomous remediation map from PENDING to ACTIVE. The Logic Conservation Gate had been implemented — a protection against AI refactors that delete more than 50% of a file's logic. The comment in the YAML said "gated until Logic Conservation Gate implemented." Gate implemented. Promote.

I gave Claude Code the prompt.

It came back with a finding I hadn't expected:

"The gate exists in ComplexityRemediationService. ModularityRemediationService — the actual handler these entries call — never touches it. The comment in the YAML is still accurate."

The gate was implemented. Just not in the right place.

We wired it. Then we promoted.

That's not a dramatic story. That's exactly how it should work. The system forced verification before the claim. No amount of "the gate exists somewhere" was sufficient. It had to exist here, in this path, or the promotion was a lie.

83 Silent Failures, Now Loud

Overnight, 83 proposals failed. Each showed execution_results: {} — empty. The handlers were running but returning nothing.

Three months ago this would have been invisible. The handlers returned ok=True unconditionally. Internal errors were swallowed. The proposal consumer would mark everything COMPLETED and move on.

Yesterday we fixed that. Wrapped every handler in try/except. Derived ok from actual outcomes instead of hardcoding success.

So this morning: 83 failures instead of 83 false completions.

That's progress. Honest failure is worth more than dishonest success. CORE's constitution says exactly this:

"CORE must never produce software it cannot defend."

A system that lies about its own outcomes cannot defend them.

319 Stuck Findings

The blackboard showed 319 entries in claimed status. All with claimed_by = NULL.

Legacy entries — claimed before we added atomic claiming with worker identity. The fix was one SQL statement. But finding it required reading the blackboard, querying claimed_by, and tracing the pattern.

No amount of assuming "the system is fine" would have found this. The evidence had to be read. The constitution demands it:

"Memory without evidence is forbidden."

After the fix, a new batch of 319 appeared — this time with a real UUID. The worker was claiming findings, finding no handler for them in the remediation map, and leaving them stuck.

Another fix: release unmappable findings immediately at claim time.

Each fix revealed by the system's own honesty about its state.

What Makes This Different

Most AI coding tools measure success by output volume. Lines written, tickets closed, PRs merged.

CORE measures success by defensibility. Can you explain why this change was made? Under what authority? With what evidence? What happens if it's wrong?

Today we made 14 commits. Each traceable to a checklist item. Each verified by the system before and after. The daemon either ran clean or it didn't. The blackboard either showed stuck entries or it didn't.

The AI didn't just write code. It was governed while writing code. And when the governance caught a mistake — the gate that wasn't wired, the handler that lied about success, the findings that stayed claimed forever — we fixed the governance, not just the symptom.

That's the mind shift. Not "AI writes code faster." But:

"Law governs intelligence. Defensibility outranks productivity."

Who This Is For

CORE is not for everyone. It's explicitly not for casual app builders or speed-only workflows.

It's for regulated environments. Safety-critical systems. Teams where "the AI decided" is not an acceptable answer in a post-mortem.

If that's your world — the architecture is open. The constitution is public.

🔗 github.com/DariuszNewecki/CORE

And if you think in terms of governance rather than just generation — I'm looking for collaborators. Not necessarily programmers. People who understand that software systems need to be able to explain themselves.

Written the same day the session happened. The daemon is running clean as I type this.

Your Agent Has Two Logs. One of Them Doesn't Exist Yet.

Dariusz Newecki — Mon, 30 Mar 2026 20:38:06 +0000

Earlier this week I read Daniel Nwaneri's piece on induced authorization — the observation that agents don't just do unauthorized things, they cause humans to do unauthorized things. His central example: an agent gives advice, an engineer widens a permission based on that advice, the agent's action log shows nothing unusual. The exposure is real. The log is clean.

He named this the induced-edge problem, and the framing is sharp enough to deserve a concrete answer.

Here's the answer CORE gives — and the gap it still doesn't close. And why that gap is not an oversight, but a frontier.

Two Logs, Not One

Most agent governance architectures assume one audit object: the action log. What did the agent do?

The induced-edge problem reveals that this is the wrong unit. There are actually two logs:

Log 1 — The action log. What the agent executed directly. This exists. For CORE, it's the blackboard: every worker posts findings, proposals, and outcomes to a PostgreSQL append-only record. Nothing is retracted. Nothing is amended. The audit trail is a fact of the architecture, not a feature someone remembered to add.

Log 2 — The consequence log. What happened in the world as a result of the agent's output. This doesn't fully exist yet. Not in CORE. Not in most systems.

The distinction matters because these two logs decay differently. Direct edges — things the agent did itself — are trackable and can be pruned when unused. Induced edges — state changes that the agent's output caused a human to make — don't decay on the same clock. A widened permission persists independently of whether the agent keeps referencing it. It's effectively permanent until someone explicitly reconciles it.

What CORE Gets Right

Before the gap, what the design actually solves.

CORE's blackboard is append-only by architecture. Workers post findings; they cannot revise or retract them. This matters because of a question Daniel raised that I think is underappreciated: what keeps the audit history honest while a materiality classifier is being trained on it?

His concern: if the classifier's training data can be gamed — even slowly, even unintentionally — you don't just end up with a gameable threshold. You end up with a corrupted evidence base. The reconciliation record loses its value at the same rate the classifier loses its integrity.

CORE's append-only constraint answers this structurally. The blackboard accumulates an honest history not because workers are trustworthy, but because the architecture makes revision impossible.

The proposal lifecycle adds another layer. When a proposal requires human approval, CORE records the approver identity and timestamp against the proposal record. approved_by, approved_at — persisted to the database, queryable, part of the chain. This isn't incidental. A dangerous proposal that executes without a recorded approver fails validation. The authorization chain is a first-class fact.

This is what Daniel meant when he argued the reconciliation record is the primary value — more important than whatever threshold sits on top of it. CORE was built around this conviction. The Final Invariant — CORE must never produce software it cannot defend — is not about catching violations. It's about maintaining a record from which any decision can be reconstructed and justified. The defense is the record.

Where The Gap Actually Lives

CORE logs that I approved a proposal, and when. What it doesn't log is what I did as a consequence of that approval.

If applying a proposal required me to also amend a .intent/ rule, add a new service account, or widen a scope — those downstream actions are outside the perimeter. CORE's record ends at "approved by Dariusz at timestamp X." The induced state change that followed is not in the log.

But here's the thing: that gap is not an oversight in CORE's design. It's a consequence of something more fundamental.

`.intent/` Is a Shim

.intent/ — CORE's constitutional layer, the directory of rules, worker declarations, and workflow stages that governs everything CORE does — is currently hand-authored YAML. Every file in it was written by me, under time pressure, as a working approximation of what a proper governance-policy tool would generate.

The CORE-policy-governance-writer doesn't exist yet. I haven't had time to build it. What exists instead is a shim: YAML that holds the shape of the intended thing while the tooling catches up.

This matters because of what .intent/ is in the architecture. It's not configuration for CORE. It's the constitutional layer that CORE is. The rules, the worker mandates, the workflow invariants — these are the law CORE enforces. And the Final Invariant applies there with equal force:

Every change to .intent/ must be as traceable, as defensible, as auditable as any change to src/.

Right now, it isn't. A change to .intent/ is a git commit I made. It's not a governed action. It's not logged against a finding. It's not traceable to a rule that authorized it. The authorization chain that CORE enforces rigorously for code changes simply doesn't exist yet for the policy layer itself.

That's the induced-edge gap, stated precisely. It's not that CORE fails to track what humans do after approving proposals. It's that the policy layer that would make human decisions about governance trackable hasn't been built yet. The two-log problem is the symptom. The missing tool is the cure.

The Architecture Is Self-Similar By Design

What the CORE-policy-governance-writer would be, once built, is exactly this: CORE, applied one layer up.

The same principle that governs src/ — every change must be authorized, logged, traceable, defensible — applies to .intent/. The governance writer would produce .intent/ changes as governed artifacts: triggered by a finding, authorized by a rule, logged to the blackboard, approvable or rejectable through the same proposal lifecycle that code changes use.

This is the architecture completing itself. Not a new feature bolted on, but the same invariant applied consistently at every layer. CORE governing code. A governance writer governing CORE's constitution. The loop closes.

Until that tool exists, the current .intent/ captures intent without capturing the consequences of acting on it. The authorization graph — declared scopes, permitted scope changes, human decisions that modified them — is implicit in my head and in git history, not in any structure CORE can reason over.

Daniel asked whether runtime constitutional enforcement shrinks the attack surface for induced risks. The honest answer: completely, for direct edges. Structurally, for induced edges — once the policy layer is expressive enough to represent scope changes as first-class governed events. That work is ahead of me, not behind.

What The Thread Built

Daniel's induced-edge framing gave me precise language for something I knew was missing but hadn't named cleanly. The two-log problem. The different decay rates of direct versus induced edges. The irreversible-by-default bootstrap.

CORE's append-only blackboard and proposal authorization chain solve the honest-history problem for direct edges. The CORE-policy-governance-writer — when it exists — will extend the same guarantee to the policy layer itself. Every action traceable. Every authorization defensible. Including the ones that write the law.

One person building it. Logic is consistent. The shim holds.

Daniel's piece: Agents Don't Just Do Unauthorized Things. They Cause Humans to Do Unauthorized Things.

CORE on GitHub: github.com/DariuszNewecki/CORE

The Day CORE's Author Finally Followed CORE's Own Rules

Dariusz Newecki — Sun, 29 Mar 2026 19:52:50 +0000

I spent my Sunday not writing a single line of code.

It was the most productive day I've had in months.

A bit of honest backstory

I am not a programmer. I'm more of an architect. Maybe a philosopher who ended up with a codebase.

I built CORE because I got angry. Angry at LLMs that hallucinate. Angry at context drift. Angry at tools that produce output nobody can defend. I wasn't trying to disrupt an industry — I was trying to solve my own problem.

I built it for myself. On a server called lira. In Antwerp. With PostgreSQL and local embeddings because I couldn't afford to throw GPU clusters at the problem.

Necessity is the best creativity source. Eastern Europe will teach you that.

Somewhere along the way the thing I built for myself started looking like something real. Something that might matter beyond my own island.

But today I realised something uncomfortable.

CORE demands something from every project it touches

CORE has a rule. Actually, it's stronger than a rule — it's a constitutional requirement:

Constitution before code. Always.

When CORE encounters a repository — whether it's a half-finished mess or a five-million line enterprise codebase — it doesn't start writing. It starts understanding. It maps. It asks. It surfaces contradictions. It refuses to proceed on guesswork.

And before any implementation begins, CORE produces a constitution for the target system. A .intent/ directory. A set of requirements. A statement of what the software must do and why.

CORE will not write a single line until that exists.

I wrote this rule. I enforce this rule. I believe in this rule deeply enough to build an entire governance system around it.

So guess what I had never written for CORE itself.

Yeah.

CORE had a NorthStar document — a philosophical manifesto about why it exists, what it believes, what it will never do. Beautiful document. I'm proud of it.

But a User Requirements document? Stating concretely what CORE must deliver to a user?

Not written.

I had been violating my own constitution. Quietly. For months.

How I finally got there

It didn't happen the way I expected.

I wasn't sitting at a desk with a blank document thinking "right, requirements time." I was having a conversation. A rambling, honest, philosophical Sunday conversation about CORE, about the industry, about building things on a shoestring from Eastern Europe, about my mother telling me decades ago that she couldn't follow what I was thinking.

And somewhere in that conversation, someone asked: "If a stranger landed on your GitHub right now, what would they understand in three seconds?"

I had to think about it.

The answer, when it came, was embarrassingly simple:

A thing that uses AI to write perfect applications.

Seven words. Obvious. To me. Completely invisible to everyone else because I had been leading with the architecture instead of the purpose.

My mother would have told you the same thing. She did, in fact. Many times.

Writing the requirements properly

Once I started actually articulating requirements — not philosophy, not principles, but obligations — eight of them emerged:

UR-01: Universal Input Acceptance
CORE accepts anything. Conversation. Spec. Single file. Enormous repository. No assumptions about quality or structure.

UR-02: Comprehension Before Action
CORE must be able to say "I know what this does and I understand it" — backed by evidence — before taking any action. Not assumed. Earned and declared.

UR-03: Gap and Contradiction Reporting
CORE asks when things are missing. CORE stops when things contradict. CORE never guesses.

UR-04: Constitution Before Code
The target system's .intent/ is CORE's first deliverable. Not a byproduct. A precondition. Implementation without an established constitution is malpractice.

UR-05: Output is Working Software
Working means: satisfies the stated requirements. Stack is the user's choice. Perfection is a quality indicator, not a deliverable. Correctness against declared intent is the only measure.

UR-06: Continuous Constitutional Governance
CORE does not distinguish between "build" and "maintain." That's an industry distinction — two teams, two budgets, two problems. CORE asks one question continuously: does this satisfy what you said you wanted?

UR-07: Defensibility is Non-Negotiable
Every output traces to a requirement, a rule, or an explicit human decision. If none exist, CORE stops and asks. CORE never produces software it cannot defend — technically, legally, epistemically, historically.

UR-08: Judgement Belongs to the Human
CORE enforces coherence, not morality. What the software is for is the user's responsibility. CORE flags contradictions. CORE surfaces missing decisions. CORE does not judge purpose.

Where does the document live?

This is the part that made me genuinely happy.

The answer was obvious. It could only go in one place: .intent/northstar/

Not constitution/ — it's not a foundational principle.
Not papers/ — it's not philosophy.
Not rules/ — it's not enforcement law.

northstar/ — because the User Requirements document is the NorthStar made concrete. Same authority. Same scope. Same rule: human-authored, runtime-immutable. CORE can read it. Reason from it. Vector-search it for context. Never touch it.

When your architecture tells you where something belongs without you having to think about it — that's good constitutional design.

The irony isn't lost on me

CORE demands constitution before code from every project it touches.

Today its own author finally wrote the requirements document that should have anchored CORE's own constitution from the beginning.

I was the user. The conversation was CORE. Messy, human, philosophical input went in. Precise, defensible, constitutionally-placed output came out.

No code written. Real progress made.

The governing invariant

All eight requirements trace back to one line, declared in the NorthStar:

CORE must never produce software it cannot defend.

Not defend emotionally. Not defend rhetorically.

Defend technically, legally, epistemically, and historically.

Everything else is a consequence of that one sentence.

If any of this resonates — or if you want to tell me I'm doing it wrong — the repo is at github.com/DariuszNewecki/CORE.

CORE is picky. CORE is honest. CORE is written to survive audits, post-mortems, and time.

If that makes it unsuitable for most people, so be it.

I did not build CORE to be liked. I built it to be right.

How one missing line broke months of autonomous self-improvement — and what fixing it revealed about constitutional software governance

Dariusz Newecki — Sat, 28 Mar 2026 15:09:23 +0000

CORE is a system I've been building for over a year. Its purpose is to govern and autonomously improve software codebases using constitutional rules—not loose AI reasoning, but deterministic policies enforced by auditors and workers in a layered architecture (Mind/Body/Will).

The self-improvement loop is straightforward:

AuditViolationSensor detects violation
  → posts finding to blackboard
    → ViolationRemediatorWorker creates proposal
      → ProposalConsumerWorker executes fix
        → AuditViolationSensor confirms resolved

For months, every proposal failed silently with the message: Actions failed: fix.logging. The autonomous pipeline—the core feature—was completely broken, and the failure left almost no useful trace.

The Investigation

The first red flag was that execution_results on every failed proposal was an empty dict {}. This meant the failure occurred before ActionExecutor even ran. Proposals were created correctly, approved correctly, and then died in the execution path.

Working backwards:

ProposalConsumerWorker picked up approved proposals.
It called ProposalExecutor.execute().
ActionExecutor.execute("fix.logging") crashed with: AttributeError: 'NoneType' object has no attribute 'write_runtime_text'

write_runtime_text belongs to FileHandler, which meant file_handler was None.

I checked the daemon startup code in src/will/commands/daemon.py. It wired up git_service, knowledge_service, cognitive_service, and qdrant_service—but file_handler was never set. CoreContext defaults it to None.

The CLI path worked fine because the @core_command decorator wires the full CoreContext (including FileHandler). The daemon startup simply forgot it.

Root cause: one missing line. Months of silent failure.

The Fix That Got Rejected

The obvious patch looked like this:

# In src/will/commands/daemon.py
from shared.infrastructure.storage.file_handler import FileHandler
ctx.file_handler = FileHandler(str(BootstrapRegistry.get_repo_path()))

CORE's own audit immediately rejected it:

ERROR | architecture.boundary.file_handler_access
src/will/commands/daemon.py:199 — Forbidden import:
'from shared.infrastructure.storage.file_handler import FileHandler'

daemon.py lives in the Will layer. FileHandler lives in shared.infrastructure.storage—Body layer territory. The Will layer is not allowed to import Body infrastructure directly.

The system enforced its own architectural boundaries against its author.

The Correct Fix

Add get_file_handler() to ServiceRegistry (in the Body layer, where it belongs), then call it from the daemon:

# Proper approach
ctx.file_handler = service_registry.get_file_handler()

The import now lives where the constitution says it should.

What Happened Next

Once the pipeline started running, two more bugs surfaced:

Double write argument — ProposalExecutor passed write=True explicitly and unpacked {"write": True} from proposal parameters → TypeError: got multiple values for keyword argument 'write'.

Fixed by stripping write from parameters before unpacking.
Layer detection in LoggingFixer — The autonomous fixer converted console.print() to logger.info() in CLI-layer files because its path detection only checked for body/cli, missing src/cli/. It was breaking its own rendering.

Fixed by updating the path detection logic.

First Autonomous Completion

After the fixes:

status: completed
fixes_applied: 28
files_modified: 4
dry_run: False

No human approval. No human review. The governance pipeline ran end-to-end for the first time.

What This Revealed About Constitutional Governance

Three lessons stood out:

Silent failures are a design smell.

The fix.logging action returned ActionResult(ok=True) unconditionally, regardless of outcome. The pipeline reported success even when it wasn't. Observability isn't optional—it's a constitutional requirement.
Layer boundaries matter at runtime, not just during review.

The daemon missed file_handler because the bootstrap path (used by CLI) was never wired into the daemon path. Two code paths, one constitutional contract, and one path that didn't honour it.
The governance system catching your own mistake is the point.

The audit rejection of the wrong-layer import wasn't an obstacle. It was the system working as designed. The constitution exists precisely for moments when you're moving fast and tempted to "just make it work." The system slowed that impulse and enforced the right fix.

Where CORE Stands Now

We're not at full A3 autonomy yet. There's still an honest checklist of 8 remaining gaps—including rollback safety on partial execution failures, DB-level concurrency constraints, and the Logic Conservation Gate from an earlier test.

But Tier 1 autonomous actions now work. The loop has closed.

The system that was broken used its own constitutional rules to enforce the fix that unblocked it.

GitHub: https://github.com/DariuszNewecki/CORE

Previous posts in the series:

My AI Has 22 Workers, 2,470 Resolved Violations, and Still Can't Call Itself Autonomous. Here's the Gap.
My AI Governance System Passed Its Own Audit. Then I Wrote One Rule. Now It Fails. That's the Point. (and the rest of the CORE series)

This version should slot right into your series. It's around 4–5 min read, maintains your voice, and emphasizes the self-referential governance aspect that readers seem to engage with. The code blocks are focused and helpful without overwhelming the narrative.

If you want any tweaks (e.g., more/less detail on a specific bug, different emphasis on a lesson, or adjusting the previous posts links), just say the word.

Forem: Dariusz Newecki

Four Gates. One Governor. Zero Code Written. CORE Is Autonomous.

What A3 Actually Is

The One That Took Longest to Get Right

What the Audit Shows

The Governor Role, Fourteen Weeks In

What "Done" Honestly Means

What's Next

79 Seconds: Our AI Governance System's First Autonomous Self-Heal

The State of Things This Morning

The Investigation

What We Fixed

13:16:18

What "Self-Heal" Actually Means

The Governor Role

What's Next

CORE Closed Its Audit Trail. Then Found 18 Engine Gaps It Couldn't See Before.

The two-log problem, briefly

What Band B actually required

What happened immediately after

What "GxP-load-bearing" means in practice

What's next

My Audit Caught My Audit Being Wrong

What I thought was happening

What I actually found

The stage-by-stage result

Why this matters more than "I made a mistake"

What I changed

The uncomfortable version

The First Test CORE Ever Wrote For Itself

What CORE is (briefly)

Stream B: closing the test loop

What we built

The bugs we hit

The first test

What "autonomous" actually means here

The honest state

On instrument qualification

When My Governance System Governed Itself Wrong

Background

The Problem

Finding the Root Cause

The Architectural Lesson

What Got Fixed

Current State

PASSED with 252 findings. FAILED with 78. Which audit would you trust?

The paradox

The instrument qualification problem

Case 1: The orphan file detector

Case 2: The modularity score

The verdict paradox

The principle

I Spent a Saturday Cleaning My Own Repo. CORE Made Me.

The broken command nobody noticed

"If the docs lie, the system lies."

What a Saturday of self-governance looks like

The CLI reference problem is the whole problem in miniature

What this has to do with autonomy

The self-referential part

If you're curious

The AI That Refused To Ship Its Own Fix

The Setup

The Gate That Wasn't Wired

83 Silent Failures, Now Loud

319 Stuck Findings

What Makes This Different

Who This Is For

Your Agent Has Two Logs. One of Them Doesn't Exist Yet.

Two Logs, Not One

What CORE Gets Right

Where The Gap Actually Lives

.intent/ Is a Shim

The Architecture Is Self-Similar By Design

What The Thread Built

The Day CORE's Author Finally Followed CORE's Own Rules

A bit of honest backstory

CORE demands something from every project it touches

So guess what I had never written for CORE itself.

How I finally got there

Writing the requirements properly

`.intent/` Is a Shim