Forem: Francesco Marconi

Compiling a Dungeon: A Real-World ISL Case Study

Francesco Marconi — Sun, 03 May 2026 20:47:52 +0000

Bonus chapter — Engineering Intent Series

Live demo: dungeon-demo-isl.netlify.app
Source: example/dungeon/ in the ISL repository

The first four articles of this series argued, from different angles, that LLM-driven development becomes predictable only when you replace prompts with a compilable specification. That is a strong claim, and a small toy app is not enough to defend it.

So we built something that is not a toy: a Boardgame-style dungeon crawler — turn-based combat, fog of war, monster AI, spells with effect lifecycles, traps, secret passages, a campaign manager, a shop, an in-game editor. Single-page React app, fully generated from ISL.

This article is a frank look at what worked, what hurt, and what we learned by pushing ISL well past the size where prompt-driven workflows fall apart.

The Scope, in Numbers

The example/dungeon/ stack is intentionally larger than a demo. It is not a tutorial — it is the stress test:

Artifact	Count
Top-level ISL specs (`*.isl.md`)	51
Behavioral specs (`logic-test/*.test.isl.md`)	49
Cross-cutting regression specs (`regression/`)	4
Generated source files in `bin/` (`.js` / `.jsx`)	56
Generated TypeScript interface signatures (`.sign.ts`)	52
Approx. generated source size	~330 KB

Roles are spread across the spec set: a domain-* core (ruleset, session, map, spells data), a thick dungeon-use-* business-logic layer implemented as React hooks, a dungeon-* presentation layer for the board and the modals, and a thin main shell that wires everything together.

Nothing in bin/ is hand-written. Every change starts in an .isl.md file.

Team and effort

One detail matters before reading any of the productivity numbers below.

Team size: 1 developer.
Commitment: part-time, alongside other work.
Calendar duration: 9 weeks of elapsed time, not 9 weeks of full-time effort.

This is not a team project. It is a single-developer, part-time stress test of ISL at scale. The figures that follow — 73 commits, 51 specs, 56 generated files, a working live demo — were produced under those constraints, not by a dedicated squad on a sprint. Calibrate the rest of the article against that fact: every "we" in this piece is editorial; the actual workforce was one person, when time allowed.

We chose not to publish hour-by-hour effort estimates because we did not measure them rigorously, and a fabricated breakeven point would betray the empirical tone of this report. What we can verify is the calendar-time and commit-count picture below.

Why ISL Earned Its Keep at This Size

There is a size threshold above which an "AI assistant" stops helping and starts producing entropy. Around 30 components, three things break in classic prompt-driven workflows:

Context fragments. You can no longer paste "the relevant parts of the project" into a chat. Decisions made in one component drift away from the assumptions of another.
Re-generation is destructive. Any non-trivial fix re-touches code the previous generation got right, often introducing micro-regressions.
The spec ages out. Documentation, where it exists, is the first thing to fall out of sync with the code.

The dungeon project hits all three. Without ISL, it would have been a slow death by inconsistency. The ISL pipeline addresses each of those failure modes by construction:

Context surgery, not text dump. The Builder produces a .build.md for each component that contains the spec plus the public interfaces of just its dependencies. A modal's generation prompt does not see the entire game; it sees the few hooks it actually consumes.
Real signatures, not aspirational ones. Each generated component emits a .sign.ts describing what it actually exports. Downstream components compile against that file. With 52 of those signatures in bin/, integration drift collapses to near zero.
The spec is the truth. Code is a build artifact, signed and read-only. There is no "the doc was right but the code does something else": either the spec is updated, or the rebuild fails.

These mechanisms were described in earlier articles. The interesting question is whether they actually scale — and the dungeon answers yes, but with caveats worth naming out loud.

What 1.6.2 Made Possible (That 1.6.1 Did Not)

Article 4 walked through the v1.6.2 patterns abstractly. The dungeon is where they earned their place.

Roles save us from "logic in the wrong file"

Before explicit Role: Domain / Role: Business Logic / Role: Presentation, hooks like dungeon-use-combat or dungeon-use-monster-ai would slowly absorb pieces of UI state — selected enemy, hover targets, "is this modal open" flags — because the generator had no formal reason to push them back to the presentation layer. With explicit roles, the same generation produces clean hooks and components that stop gluing themselves to each other.

State qualifiers fix React hooks specifically

dungeon-use-session-manager is the canonical case. Before v1.6.2, "the session" was sometimes regenerated as local component state, sometimes as context, sometimes as a singleton. Once the spec marks @GameSession as external and visibleEnemies as calculated, regeneration becomes deterministic — and useMemo shows up exactly where it should.

Effect Lifecycle is non-negotiable for spells and traps

dungeon-use-magic, dungeon-use-traps and dungeon-use-treasure define dozens of temporary effects: RockSkin, Genie, poison, fear, traps that fire once, treasures that wander a monster. Without an Apply / Active While / Expires When / Cleanup block, those effects would be re-derived inconsistently every time. With it, the lifecycle is a contract, and the generator can satisfy it in different idiomatic ways without changing what is true at any tick of the game.

Logic & Execution Rules vs. Flow

Combat order, modifier application, idempotency of movement — these are normative rules, not procedures. The dungeon makes the distinction explicit: "evaluate base attack → weapon → defense → modifiers; clamp to ≥ 0" lives in Logic & Execution Rules, while the Flow of attackTarget only describes the user-facing sequence. This decoupling is what lets us re-generate combat code dozens of times without the result drifting.

What We Actually Did Differently to Make It Work

A spec language is necessary but not sufficient. A few practices, learned the hard way, mattered just as much.

1. Generate behavioral tests as ISL, not as code

The 49 specs under logic-test/ and the 4 under regression/ are not unit tests written in JavaScript. They are ISL documents that describe what should be true across components. The auditor consumes them; the test runner does too. The same source describes both expected behavior and verification.

This is the inversion that took the longest to internalize: the test is also a spec. When a regression appears, you do not patch the code first; you write or amend a *.test.isl.md that fails, and only then re-generate. The fix lives in the truthy artifact.

2. Bring the legacy data with its quirks intact

dungeon-script-runtime.isl.md is the most "anti-pure" file in the project. It openly says: preserve the quirks of the original script text instead of imposing a stricter modern syntax. We import existing campaign JSON files (HQBase01.json, …) without rewriting them, and the runtime is specified to handle their idiosyncrasies — missing semicolons, line-based statement termination, etc.

The lesson: ISL is not allergic to messy reality. You can specify the leniency. It is not a license to be sloppy; it is a way to be explicit about what compatibility you are buying and at what cost.

3. Keep the dependency graph shallow on purpose

We deliberately split domain-* files horizontally (core, map, ruleset, session, spells-data) rather than nesting one inside the other. The reason is operational: the Builder's topological order means a change in a deep root invalidates everything above it. Five flat domain files re-build cheaply; one mega-domain would invalidate the world on every tweak.

This is a build-system instinct, not an OOP one. Treat your spec graph the way you treat a Bazel or Cargo workspace.

4. Use `.sign.ts` as the contract, not the spec

Article 3 explained why generators receive real signatures, not idealized ones. In the dungeon this is what kept the React presentation layer honest: dungeon-board.jsx does not see what useDungeonSessionManager should return — it sees what it does return. When a hook's API drifts (and they did, several times), the presentation layer's regeneration fails immediately, on a missing field, instead of silently producing a runtime crash.

In effect, the .sign.ts files act like an internal package boundary inside a monorepo, with the bonus that they are auto-generated from real code.

What Hurt, Honestly

ISL is not a magic wand and the dungeon project surfaced its real costs.

The first 30% is slower than prompting

The investment is front-loaded. Writing domain-core and the first three dungeon-use-* hooks in ISL is slower than vibing them out in a chat — visibly slower. Pay-off begins around the 10th component, when re-generation of any one spec costs you minutes instead of an afternoon of re-aligning the rest of the system.

If you ship a prototype next week, ISL is not the right tool. If you intend to maintain it for six months, it pays for itself many times over.

Effect Lifecycle is easy to under-specify

The first version of the spell system specified Apply and Active While but glossed over Cleanup. Result: subtle leaks where defeated monsters kept their poison status in the session object. The generator was correct against the spec; the spec was wrong. The lesson: Cleanup is the part you are tempted to skip, and it is the part that bites. The 1.6.2 four-field shape is opinionated for a reason.

Logic-tests are work, and you must commit to writing them

49 logic-test specs is not free. It is a non-trivial slice of the total writing time. The temptation, especially under pressure, is to skip them. The dungeon makes the consequence visible: every regression we shipped traces back to a behavior we never wrote down as an acceptance criterion. The auditor cannot catch what was never specified.

Generators have an opinion about file size

Past a certain spec size, generation quality drops. Some hooks (dungeon-use-combat, dungeon-use-session-manager) hit the ceiling and had to be split. There is no formal limit — it is empirical. The good news: splitting is cheap when the rest of the graph is properly typed.

`bin/` is bigger than necessary

A 330 KB JavaScript footprint for a Boardgame clone is not minimal. The generator favors clarity and idiomatic patterns over byte-counting. We accept it; it is the price of generated code that is reviewable and re-buildable. If you need bundle-size discipline, it must be expressed in the spec (constraints on dependencies, hints toward lighter libraries) — not after the fact.

What This Project Is Not

A few honest disclaimers, because casual readers will look at the live demo and read more into it than they should:

It is not "AI built a game from one prompt." It is a 51-module specification compiled by a deterministic pipeline whose quality is mainly a function of how well the spec is written.
It is not bug-free. Some edge cases (specific monster AI behaviors, certain trap interactions) still misbehave. The point is that fixing them is a spec change followed by a targeted re-generation, not a code archaeology session.
It is not the smallest example to start from. For a first contact with ISL, look at example/design-pomodoro or example/architect-* — three to seven specs, the same patterns, less overwhelming.
It is the heaviest end-to-end reference, and that is what it is for.

What the Git History Actually Says

Opinions are cheap. The repository is not. The dungeon project carries 73 commits spread over roughly nine weeks (late February to early May 2026), and reading the log is more sobering than any narrative we could write.

Activity profile

Month	Commits
2026-02	1
2026-03	8
2026-04	52
2026-05 (partial)	12

The shape is typical of a project that needed an exploratory phase before it took off: one initial commit in February, a slow month in March (8 commits, mostly experimental), then the productivity inflection in April (52 commits) once the spec graph and the tooling stabilized.

There is a story behind the inflection. The ISL logic-test generation tooling itself was committed on 2026-03-29 — inside the project window, not before. The first 30 days were spent without an auditor. Once it landed, throughput roughly tripled. The takeaway is uncomfortable but useful: the cost of building the verification loop pays for itself almost immediately, but only if you actually build it.

The commit type mix

Type	Count	Share
`feat:`	35	48%
`refactor:`	23	32%
`fix:`	5	7%
`chore:`	5	7%
other	5	7%

Two numbers stand out.

Only 5 fixes in 73 commits (~7%). That is far below the typical bug-to-feature ratio of an organically-grown codebase of comparable size. We do not interpret this as "ISL prevents bugs" — it does not. We interpret it as "ISL pushes corrections back into the spec, where they show up as a feat: or a refactor:, not as a fix: of broken code." The fix work is real; it is just classified differently because the unit of change is the specification, not a patch in a .js file.

32% refactor commits. This is the honest counter-balance. Even with a deterministic pipeline, restructuring the spec remains a constant activity: splitting components that grew too big, lifting state across roles, renaming domain concepts, normalizing flow descriptions. Spec-driven development does not eliminate refactoring — it relocates it. Refactor a .isl.md and the next regeneration moves the code with it; that is the shape of the work, not a flaw of the method.

Where the churn really lives

The five most-modified specs in the project are not the obscure ones — they are the spine:

Spec	Revisions
`dungeon.isl.md`	28
`dungeon-board.isl.md`	19
`dungeon-turn-controls.isl.md`	18
`dungeon-use-turn-logic.isl.md`	17
`dungeon-use-session-manager.isl.md`	12

The pattern: a small group of orchestration specs absorbs most of the iteration, while leaf modules (data, individual modals, isolated rules) are rewritten three or four times at most and then left alone. This is exactly the topology you want — change cost concentrated where you can see it, not scattered across the periphery — but it implies a planning consequence: invest in the orchestration specs first, and accept they will keep moving.

The matching test specs follow the same ranking almost line for line:

Test spec	Revisions
`dungeon-board.test.isl.md`	11
`dungeon.test.isl.md`	10
`dungeon-turn-controls.test.isl.md`	8
`dungeon-use-treasure.test.isl.md`	7
`dungeon-use-turn-logic.test.isl.md`	7

When a spine spec moves, its test moves with it. This co-evolution is one of the strongest signals in the repository that ISL is working as intended — the test is never further than one commit away from the behavior it describes.

The regression specs tell the truth about us

The four files in regression/ (turn-combat, loot-inventory, map-visibility, campaign-state) all appeared on a single day: 2026-05-01. That is not coincidence. It is the day we admitted we had been finding the same families of bugs more than once.

In other words: regression specs were written reactively, after we shipped the bugs, not proactively while writing the features. ISL did not prevent the bugs. It made the second occurrence cheap to catch — but it could not invent acceptance criteria we never bothered to write. This is the most useful lesson the history offers: the auditor is exactly as strong as the specs you give it, no more.

The 1.6.2 migration was a non-event

The project began on ISL 1.6.1 and migrated to 1.6.2 in a single commit on 2026-05-01 (chore: Update ISL version to 1.6.2 across all documentation and examples). The diff touches every spec, but the regenerated code remains compatible: 1.6.2 is a vocabulary and discipline upgrade, not a breaking change. Nine weeks of work survived a language version bump unscathed.

For a spec language, that is not a small claim. It is the kind of thing you can only verify after a non-trivial project has lived through it.

Considerations We Took Away

Compressed, opinionated, after living in the dungeon for several iterations:

One part-time developer was enough to ship this. Anything you read below is filtered through that constraint. Scale the conclusions accordingly when projecting them onto a larger team.
Specs are the asset, code is the artifact. Treat the .isl.md files as your codebase. The bin/ is build output, full stop.
Behavioral tests want to live in ISL. A test rewritten as a .test.isl.md outlives any specific framework, runtime, or model.
Roles are not a comment. They are a constraint on what the generator is allowed to produce. Honor them.
Effect Lifecycle has four fields, not three. Cleanup is the one that matters most.
Flat graphs beat deep graphs for incremental rebuilds. Optimize your spec topology like you would optimize a build target.
.sign.ts is your real contract. When in doubt, read the signature, not the spec.
Front-load the writing. The work you save later is roughly proportional to the precision you put in the first five specs.
Build the auditor early. The repository's productivity tripled the moment the logic-test tooling landed. There is no reason to delay it — the cost of writing it is recovered in weeks.
Refactor is not a smell here, it is the shape of the work. A third of the commits are spec refactors. That is the price of moving change to the right place.
Regression specs are reactive by nature. Plan for it. They will appear after the second time a bug bites — make sure you actually write them then.

Closing

The dungeon project is, for us, the answer to the only question that matters about ISL: does it scale? We can now point to a live, playable, non-trivial app whose 56 source files were generated, signed, and audited from 51 specs — built by a single developer working part-time in 73 commits over nine weeks of calendar time, with a 7% bug rate and a clean migration to ISL 1.6.2 along the way.

If a spec-driven pipeline can deliver an end-to-end React game of this size with one person on a part-time schedule, the interesting question for any larger team is not "can we afford ISL?" — it is "what could we build if we stopped negotiating with the model and started compiling our intent?"

Try the demo: dungeon-demo-isl.netlify.app.
Read the specs: example/dungeon/ in the ISL repository.
Then write your own.

The interesting argument is no longer whether AI can write code. It is whether your team is ready to write the spec it deserves.

Tags: #ISL #CaseStudy #LLM #AI #DevTools #SpecEngineering #React #OpenSource

Part 4 of 4 — Engineering Intent Series : ISL v1.6.2: Evolution for Complex Systems

Francesco Marconi — Sat, 18 Apr 2026 10:00:11 +0000

Introduction

This article is a standalone look at the 1.6.2 changes, but it also builds on the ideas introduced in earlier pieces of the series:

Version 1.6.2 is not about redesigning ISL. It is about formalizing patterns that already exist in real projects and removing the ambiguity that emerges when those patterns are left implicit.

The changes in this release address five specific areas where real systems already need extra clarity:

what role each piece of the spec plays (Domain vs. Business Logic vs. Backend)
which state belongs where (external, internal, or calculated)
which behaviors are normative rules versus optional guidance
how temporary effects are born, live, and die
how Flow should express logic without becoming code

The Main Changes in 1.6.2

Expanded Role Taxonomy

ISL has always had Presentation and Backend. v1.6.2 adds three more roles that already appear in mature projects:

Domain: entities, value structures, enumerations, static rules, shared concepts
Business Logic: deterministic use cases, state transitions, orchestration rules, non-visual runtime behavior
Test: scenario specifications, fixtures, assertions, behavior validation

The reason for this change is that real projects were already separating these concerns. Without explicit role names, a reader and a model have to guess whether a declaration is part of the domain model or a behavior specification.

Example—what is this?

Pricing Rule: Calculate final price based on promotion tier and membership.

Is this a data structure that belongs to Domain? Is this an operation that belongs to Business Logic? Without the role, it is ambiguous.

With explicit roles, the distinction becomes clear:

Role: Domain
- Hero: a character with attributes (HP, attack, defense).
- Stat Types: BaseStats, ModifiedStats (derived from effects).
- MembershipTier: standard, silver, gold.

Role: Business Logic
- When a purchase is placed, apply the current membership discount.
- When an effect is applied, update ModifiedStats from active effects.

Here, Domain is structure and naming. Business Logic is when and how that structure is used.

Internal State Section (Optional)

Complex components need three kinds of values:

values that come from outside (props, context, shared state)
values that the component owns and maintains
values that are derived and should not be stored

v1.6.2 clarifies this with three qualifiers: **external**, **internal**, **calculated**.

Example:

### 🗂 Internal State

- `gameSession` **external**: provided by the application runtime
- `visibleEnemies` **calculated** from `gameSession.enemies` and `currentHero.vision`
- `selectedTarget` **internal**: which enemy the player is hovering over

Without these qualifiers, a model might assume that visibleEnemies should be stored. Making it explicit prevents that mistake.

Logic & Execution Rules Pattern

This section describes normative execution constraints. It is NOT a list of actions (that is Flow).

The common mistake is confusing this with Flow. Here is the difference:

❌ Wrong (this is Flow, not Execution Rules):

Logic & Execution Rules:
- Check if the player can move.
- If yes, update position.
- If no, show error.

✓ Right (these are normative rules):

### ⚙ Logic & Execution Rules

#### Movement Validation

- Walkable: a cell is not occupied and not blocked
- Occupied: another hero or enemy is on the cell
- Check walkability BEFORE position update
- Order: terrain check → occupancy check → apply movement

#### Position Commit

- Atomicity: position updates in `@GameSession` are atomic
- Idempotency: the same move action applied twice MUST NOT double-shift
- Animation sync: UI animation starts AFTER position commit

These are rules about order, atomicity, and normalization—not a procedure.

Effect Lifecycle Pattern

Effects have four parts: when they start, what keeps them active, what ends them, what cleanup is needed.

v1.6.2 formalizes this with: Apply, Active While, Expires When, Cleanup.

Example:

### 🔄 Effect Lifecycle

#### RockSkin Spell

- **Apply**: when the spell is cast on a hero
- **Active While**: hero has shields remaining
- **Expires When**: shields are depleted to zero
- **Cleanup**: remove `rockSkinActive` from status; restore default armor

#### UI Notification Toast

- **Apply**: immediately after successful purchase
- **Active While**: toast rendered AND timer not expired
- **Expires When**: user dismisses OR 5 seconds elapsed
- **Cleanup**: remove from DOM; cancel pending animations

The structure makes boundaries explicit.

Clarified Flow Semantics

Flow should be semantic, not code-like. v1.6.2 clarifies what is valid:

Valid (behavior sequencing):

### Flow

1. Player selects "Attack".
2. Validate target is in range and visible.
3. Calculate damage using Combat Rules.
4. Apply damage to target.
5. Update visual state and play animation.
6. Return to input state.

Invalid (implementation pseudo-code):

### Flow

1. Set selected = true.
2. If selected == true, then call updateModel().
3. var damage = baseAttack - defense.
4. ...

Flow describes what happens and the order that matters for behavior, not variable mechanics.

Symbol Reference Notation

One final clarification: how to distinguish between a type reference and a variable reference in specs.

In real projects, authors and generators need to know: is this a reusable type definition, or a local variable?

v1.6.2 standardizes a simple convention:

@TypeName: A reusable type, state structure, or domain concept (e.g., @GameSession, @State, @Spell)
`variableName`: A local variable, parameter, or temporary value (e.g., currentHero, damage, targetMonster)

Example:

### Role: Business Logic

**Signature:**
- `gameSession`: @GameSession
- `currentSpell`: @Spell

### Flow

1. Find `targetMonster` in `gameSession.monsters`.
2. Calculate `damageAmount` = `currentSpell.baseDamage`.
3. Apply `damageAmount` to `targetMonster.health`.
4. Persist via `commitSessionUpdate(@GameSession)`.

Why does this matter? A generator that does not know @GameSession is a stable type might regenerate it locally instead of reusing the shared instance. That breaks determinism and correctness.

When a generator sees @GameSession, it knows: "reference the existing contract, do not instantiate a new one." When it sees `damageAmount`, it knows: "this is local scope and transient."

This notation is not new; it is already used consistently in mature ISL projects. v1.6.2 formalizes it so it becomes explicit and intentional.

Examples

Example 1: Domain Is Structure, Not Behavior

Where do you describe a Hero's stats? In Domain.

Role: Domain

- Hero: a character with attributes (HP, attack, defense, abilities).
- BaseStats: fixed numeric values.
- ModifiedStats: derived from BaseStats and active effects.
- ActiveEffect: a temporary modification to stats.

Domain defines the structure. It does not say "when a buff is applied, add 2 to attack." That belongs in Business Logic or Effect Lifecycle.

Example 2: Damage Calculation Uses Domain + Business Logic + Execution Rules

Role: Domain

- Damage: a numeric value derived from attack and defense.
- AttackModifier: a numeric bonus from weapons or effects.

Role: Business Logic

- To calculate damage: (Attacker.Attack + weapon bonus) - Defender.Defense.
- Apply modifiers from active effects.
- Clamp result to [0, max damage].

### ⚙ Logic & Execution Rules

#### DamageOrder

- Evaluation order: base attack → weapon bonus → defender defense → modifiers.
- Modifier application: additive first, multiplicative second.
- Clamp: after all calculations, apply max(result, 0).

Notice: Domain says what a Damage is. Business Logic says what to do. Execution Rules says order and normalization.

Example 3: Internal State Makes Ownership Clear

### 🗂 Internal State

- `@GameSession` **external**: persistent game state owned by the backend
- `selectedHero` **internal**: which hero is currently selected in the UI
- `canAttack` **calculated** from `@GameSession.activeHero`, active effects, and reachable targets

Without qualifiers, ownership is implicit. With them, it is explicit.

Example 4: Effect Lifecycle Declares the Full Lifecycle

### 🔄 Effect Lifecycle

#### Poison Status

- **Apply**: when Poison spell hits a hero
- **Active While**: hero is alive AND poison timer has not expired
- **Expires When**: poison timer reaches zero OR hero is cured by antidote
- **Cleanup**: remove from active status list; restore HP display to normal

This makes the entire lifecycle visible.

Why This Matters

The reason for these changes is determinism. When ISL is ambiguous about roles, state ownership, execution order, and effect lifecycles, different models (and different people) make different assumptions. That leads to inconsistent code generation and rework.

By formalizing these patterns, ISL becomes a clearer contract between authors and generators. The patterns do not force you to use them; they are optional. But when you need them, they give you a way to be precise.

Conclusion

ISL v1.6.2 does not change the fundamental nature of ISL. It remains Markdown-first and human-readable. What v1.6.2 does is give you explicit language for distinctions that complex systems already need to make.

The five changes—role taxonomy, state qualifiers, execution rules, effect lifecycle, and clarified Flow—address the most common sources of ambiguity in real projects. They are optional, backward-compatible, and designed to reduce the friction between human intent and generated behavior.

For the full specification, see Intent Specification Language (ISL).

Part 3 of 3 — Engineering Intent Series -- Inside the Machine: The ISL Build Pipeline

Francesco Marconi — Wed, 01 Apr 2026 01:09:48 +0000

In Part 1 we identified the problem. In Part 2 we dissected the spec format. Now we go inside the engine that transforms intent into production code — automatically, verifiably, and repeatably.

This is the ISL Tooling Engine. It has two distinct phases: Build and Generate.

Phase 1: The Builder

Real software is never a single file. A LoginForm depends on UserAuthService, which depends on User domain model, which depends on AuthStatus enum. Each of those has its own .isl.md spec file.

The ISL Builder scans your entire specification folder, identifies all @Reference tags, and builds a Directed Acyclic Graph (DAG) of your system:

AuthStatus (enum)
    └── User (domain model)
            └── UserAuthService (business logic)
                    └── LoginForm (presentation)

This graph is not documentation. It's the build order.

Topological Sort

The Builder performs a topological sort on this graph — the same technique used by real compilers, package managers (npm, cargo, gradle), and build systems like Make. The rule is simple: before processing any node, all its dependencies must already be resolved.

To trigger the build phase and resolve the project graph:

npx ts-node tools/vscode-isl/src/isl-builder.ts example/design-pomodoro

For each component, the Builder produces two files:

xxx.build.md — the full generation context: the original ISL spec of the component, with all the .ref.md files of its dependencies inlined. This is the complete, self-contained instruction set for that component.

xxx.ref.md — the public interface of the component: what it exports, what its capabilities are, what types it exposes. This is what downstream components will receive as context.

Here's what a real .ref.md looks like — from an actual ISL project:

<!-- INTERFACE (REF) FOR: logic.isl.md -->

# Project: Logic

**Version**: 1.0.0
**ISL Version**: 1.6.1

<!-- IMPLEMENTATION PATH: ./logic -->

> **Reference**: TimerMode, TimerState, TimerConfigEntity in `./domain.isl.md`

## Component: PomodoroEngine

### Role: Business Logic

### ⚡ Capabilities

#### initialize

**Contract**: Sets up the initial state of the timer
**Signature**: Input: None / Output: None

#### startTimer

**Contract**: Initiates the countdown for the current timer mode
**Signature**: Input: None / Output: None

#### selectMode

**Contract**: Changes the active timer mode and resets the timer
**Signature**:

- Input: `mode`: `TimerMode`
- Output: None

The Builder also produces the Build Manifest — a JSON index of all components in topological order, with their source paths, build paths, and integrity hashes:

[
  {
    "sourceFile": "./pomodoro/domain.isl.md",
    "buildFile": "./pomodoro/build/domain.build.md",
    "implementationPath": "./domain",
    "hash": "753ff32724b2a2501ab95b5241afbd48"
  },
  {
    "sourceFile": "./pomodoro/logic.isl.md",
    "buildFile": "./pomodoro/build/logic.build.md",
    "implementationPath": "./logic",
    "hash": "8b36f3559da223a04d77a5ccba40bfbd"
  },
  {
    "sourceFile": "./pomodoro/ui.isl.md",
    "buildFile": "./pomodoro/build/ui.build.md",
    "implementationPath": "./ui",
    "hash": "91d28e3dc2652d9d0f0590acf7f1583a"
  },
  {
    "sourceFile": "./pomodoro/main.isl.md",
    "buildFile": "./pomodoro/build/main.build.md",
    "implementationPath": "./main",
    "hash": "72145c5ee3d89dac2a7f86f0047cb673"
  }
]

Notice the order: domain first, then logic (which depends on domain), then ui (which depends on logic), then main last. The topological sort guarantees this. The Generate phase will process them in exactly this sequence.

Phase 2: The Generator

With the Build Manifest ready, the Generator processes each component in topological order — because by the time it reaches any component, all its dependencies have already been compiled.

For each component, the Generator receives exactly two inputs:

1. xxx.build.md — the spec + inlined interfaces of all dependencies.

2. dep.sign.ts for each dependency — the actual TypeScript signatures generated in previous steps. Not the spec. Not the .ref.md. The real, generated code signatures.

Here's why this matters. The .ref.md describes what a component should expose according to the ISL spec. But the LLM that compiled it may have made idiomatic choices — wrapping a return type, using a language-native pattern, adjusting a signature slightly. The sign.ts captures what was actually generated:

import { TimerMode, TimerState, TimerConfigEntity } from "./domain";

export const PomodoroEngine: (initialConfig?: {
  workDuration?: number;
  shortBreakDuration?: number;
  longBreakDuration?: number;
}) => {
  initialize: () => void;
  startTimer: () => void;
  pauseTimer: () => void;
  resetTimer: () => void;
  selectMode: (mode: (typeof TimerMode)[keyof typeof TimerMode]) => void;
  getCurrentState: () => (typeof TimerState)[keyof typeof TimerState];
  getRemainingTime: () => number;
};

When LoginForm is compiled, it receives this exact signature — not what the spec said, but what the code is. This eliminates integration mismatches structurally. The generated code is guaranteed to compile against the code that actually exists.

To generate the implementation using a specific stack and LLM provider:

npx ts-node tools/vscode-isl/src/isl-generator.ts example/design-pomodoro --gemini --v=31p --stack=react-js

The Generator outputs:

xxx.js — the generated implementation
xxx.sign.ts — the actual signature of this component, ready for downstream consumers

Phase 3: The Auditor (Verification)

The pipeline doesn't end when the code is written. To ensure the implementation strictly follows the logical state transitions defined in the ISL, we trigger the Auditor.

The Auditor is a specialized agent that acts as a Deterministic Unit Test Runner. It doesn't just look at the code; it simulates the "Flow" described in the spec against the generated logic.

Its job is to catch Logical Dead-ends:

Is an isLoading flag set but never reset in a CATCH block?
Does a state transition skip a mandatory notification?
Is a method signature logically sound but practically unreachable?

The audit process happens in two steps: first generating the scenarios, then running the simulation.

# 1. Create the logical tests from the spec
npx ts-node tools/vscode-isl/src/isl-logic-test.ts example/design-pomodoro --gemini --v=31p

# 2. Run the robotic auditor to verify the implementation
npx ts-node tools/vscode-isl/src/isl-logic-test-run.ts example/design-pomodoro --gemini --v=31p

By running this verification automatically, we catch architectural bugs before a human developer even opens a Pull Request.

The Full Pipeline

.isl.md files
      │
      ▼
┌─────────────┐
│   BUILDER   │  DAG + topological sort
└─────────────┘
      │
      ├──► xxx.build.md   (ISL + inlined .ref.md of dependencies)
      ├──► xxx.ref.md     (public interface, for downstream specs)
      └──► manifest.json  (ordered build index + hashes)
      │
      ▼
┌─────────────┐
│  GENERATOR  │  processes in topological order
└─────────────┘
      │
      Input per component:
      ├── xxx.build.md
      └── dep.sign.ts  (real signatures of all dependencies)
      │
      Output per component:
      ├──► xxx.js        (generated implementation)
      └──► xxx.sign.ts   (real signature, for downstream generators)

Every step is deterministic. Every component is compiled against real, already-generated interfaces. The output is not "what the AI felt like generating today." It's the inevitable result of compiling a precise specification against a verified dependency graph.

Closing the Loop

Across these three articles, we've built the case for a fundamentally different way of working with AI in software development.

Not prompt engineering. Spec engineering.

The shift is conceptually simple but practically significant: you stop treating the LLM as a creative collaborator you negotiate with, and start treating it as a compiler you feed precise input. The quality of the output stops depending on how well you phrase your request and starts depending on how well you define your intent.

ISL is the language for that intent. The Tooling Engine is the compiler. The generated code is the artifact. And the spec — always in sync with the code, by construction — is the end of undocumented legacy software.

The spec is open source — read it, use it, contribute to it:
github.com/fra00/isl-specification

Part 2 of 3 — Engineering Intent Series - Engineering Intent: The Anatomy of ISL

Francesco Marconi — Wed, 01 Apr 2026 01:08:46 +0000

In the first part of this series, we identified the "Ambiguity Tax" — the inherent cost of inconsistent code produced by treating LLMs as creative assistants rather than deterministic engines. Today, we dive into the anatomy of the solution: ISL (Intent Specification Language).

Beyond Prompting: The Contract Layer

Standard prompt engineering relies on "hope." You hope the model interprets your bullet points correctly. ISL shifts this paradigm by introducing a formal contract layer. It's less like talking to a developer and more like writing an IKEA manual for software.

An IKEA manual doesn't describe the molecular structure of the wood. It tells you:

What parts you must have (Domain)
What actions to perform in which order (Capabilities & Flows)
How to verify the result (Acceptance Criteria)

Regardless of who reads the manual, the finished table looks exactly the same. The goal isn't byte-for-byte identical code — LLMs are still probabilistic — but reliable semantic consistency. Same spec, same behavior, every time.

The Three Pillars of an ISL Component

Every ISL specification is built on three normative pillars, marked by semantic anchors that guide the LLM's reasoning:

1. ⚡ Capabilities: The "What"

Capabilities define observable behavior. Each capability includes a Contract, a Trigger, and a Flow.

The Flow describes logical state transitions in natural language — not algorithms. Instead of writing a loop, you specify: "Calculate the final price including the regional tax rate." This allows the LLM to generate idiomatic code for the target language — TypeScript, Go, Python — while strictly adhering to the business rule.

2. 🚨 Constraints: The Non-Negotiables

ISL uses RFC 2119 keywords (MUST, SHOULD, MAY) to establish a hierarchy of rules:

MUST NOT store passwords in plaintext — a hard stop for the generator, regardless of language.
SHOULD use async/await for I/O operations — architectural guidance, not a mandate.

These keywords explicitly restrict the model's "creativity" in areas where consistency is critical.

3. ✅ Acceptance Criteria: The Ground Truth

Acceptance criteria are testable outcomes. If a behavior cannot be verified by an acceptance criterion, it isn't fully specified — it's hoped for. These criteria are the primary reference the Auditor uses to validate generated code (more on that in Part 3).

Constraints vs. Implementation Hints

One of the most powerful distinctions in ISL is between a mandatory Constraint (🚨) and a non-binding Implementation Hint (💡).

Constraint (🚨) — if violated, the system is broken or insecure.
Example: "Passwords MUST be hashed before storage."
Hint (💡) — a suggestion the LLM can ignore if it finds a more idiomatic approach that still satisfies the contract.
Example: "Consider using bcrypt for hashing."

This distinction is what gives ISL the "senior developer" quality. If the LLM decides that Argon2 is a better fit for the target environment, it can use it — as long as the mandatory constraint (hashing) is satisfied. We are constraining the outcome, not micro-managing the implementation.

Role Separation: Preventing Logic Leaks

The most common source of technical debt in AI-assisted development is "logic leaking" — business rules buried in UI components, database details polluting service layers.

ISL enforces separation of concerns through Roles:

Role: Domain — pure data structures and enums. No logic allowed.
Role: Business Logic — pure state management and calculations. No UI or DB knowledge.
Role: Presentation — maps user input to logic triggers and renders state. Nothing else.

By declaring a Role at the top of each spec file, we provide the LLM with a restricted mental model. A Presentation component is structurally prevented from making database calls because its grammar doesn't allow it. Architectural drift is eliminated before a single line of code is generated.

The Boundary Rule (Rule 4): Intent, Not Implementation

The most important principle of ISL is Rule 4: Intent, not Implementation. When describing a flow, avoid low-level control flow and syntax entirely.

❌ Invalid — this is pseudocode, not a spec:

FOR (i = 0; i < list.length; i++) {
  if (list[i].active) count++
}

This locks the LLM into a specific implementation. It can't generate idiomatic Kotlin, Python, or Go — it just transcribes your pseudocode.

✅ Valid — this is intent:

**Flow**:

1. Count all active items in the list

🚨 Constraint:

- MUST include only items where status = active
- MUST NOT modify the original list

✅ Acceptance Criteria:

- Returns correct count for a list with mixed active/inactive items
- Returns 0 for an empty list
- Returns 0 when no items are active

Same logic. Zero implementation details. The LLM generates idiomatic code in any target language — and you can verify it against the acceptance criteria automatically.

Putting It Together: A Real ISL Component

Here's what all three pillars look like in a single, complete spec:

## Component: UserAuthService

### Role: Business Logic

### ⚡ Capabilities

#### authenticateUser

**Contract**: Authenticate user credentials and return a session token

**Trigger**: Called by LoginForm on submit

**Flow**:

1. Validate credential format
2. Verify credentials against stored record
3. IF valid → generate session token
   IF invalid → return structured error

**💡 Implementation Hint**:
Consider bcrypt (cost factor 12) for password verification

**🚨 Constraints**:

- Passwords MUST NOT be compared in plaintext
- Tokens MUST expire after 24 hours
- MUST NOT log passwords in any form
- Response time MUST be < 200ms (p95)

**✅ Acceptance Criteria**:

- [ ] Valid credentials return a token
- [ ] Invalid credentials return an authentication error
- [ ] Expired token returns 401, not 403

**🧪 Test Scenarios**:

1. **Valid Login**: {email: "user@test.com", password: "Valid123!"} → token returned
2. **Wrong Password**: {email: "user@test.com", password: "wrong"} → authentication error
3. **Locked Account**: 5 failed attempts → {error: "Account locked", code: 423}

Notice what's not here: no JWT library choice, no bcrypt rounds hardcoded in the flow, no database query structure. The LLM decides the how. The spec owns the what.

Note: This article is a summary. For the full, formal documentation, visit the Complete ISL Specification.
Notably, the ISL specification itself can be used as a tool-as-prompt or skill for other LLMs, providing them with the necessary framework to reason about software behavior with high precision and consistency.

What's Next

In Part 3, we go inside the engine that transforms this spec into production code — Builder, Compiler, cryptographic signatures, and the Auditor that verifies the result.

Part 3: The Tooling Engine →

Part 1 of 3 — Engineering Intent Series - Stop Prompting, Start Compiling: The Path to Predictable AI-Generated Code

Francesco Marconi — Wed, 01 Apr 2026 01:07:49 +0000

If you use LLMs to generate code, you are likely dealing with a "Slot Machine" workflow. You pull the lever with a prompt, get a great result, and then — two days later, on a different model, or with a different colleague — the same request produces something completely different.

Different patterns, different variable names, different bugs. In software engineering, this inconsistency has a name: The Ambiguity Tax.

The Root Cause: Conflating Intent with Implementation

The problem isn't that the AI is hallucinating. The problem is that natural language is inherently ambiguous.

Take this common request: "Implement a user profile page with validation."

To a human — or an AI — this leaves a dozen critical questions unanswered:

Is validation client-side, server-side, or both?
Does "profile" include avatar uploads or just text fields?
What happens to the UI while the save is in progress?
Is state managed locally or via a global store?
What does a validation error look like — inline, toast, modal?

When we leave these decisions to the LLM, we aren't engineering. We're gambling.

Prompting is Not Methodology

The industry is obsessed with "Prompt Engineering." But let's be honest: adding "you are a world-class developer" to your request is a superstition, not a methodology.

If your build process depends on the mood of a probabilistic model, you haven't built a pipeline. You've built a slot machine.

The root cause is that we tell the AI how to do things — through examples, tone, role-playing — instead of formally defining what must happen. We're negotiating with a language model when we should be compiling a specification.

The Solution: A Real Compiler for Intent

This is where ISL (Intent Specification Language) comes in — and it's important to say upfront: ISL is not a better prompt. It's a build system.

Here's what actually happens under the hood:

1. You write a formal spec, not a prompt. Each component gets its own .isl.md file that defines behavior, constraints, and acceptance criteria — not implementation details.

2. The Builder resolves your entire project graph. It scans all your spec files, identifies dependencies, and performs a topological sort. Instead of giving "the whole project" to the AI hoping it understands, the Builder provides exactly and only what is needed for that specific piece, in the correct order. It’s context surgery, not a generic text dump.

3. The Compiler generates code — not guesses. IIt passes each component's precisely scoped context to the LLM, which at this point is functioning as a compiler back-end: mapping a deterministic spec to idiomatic syntax in your target language. React, Vue, Python, Go — same spec, same behavior, different output.

4. Code is a read-only artifact. Every generated file is locked with a cryptographic signature. The code is no longer the developer's domain; it's a build artifact. If a dev modifies the code by hand, it breaks the build. This enforces a strict discipline: documentation (ISL) and code are always perfectly aligned. It’s the end of "undocumented legacy code."

5. The Auditor verifies behavior, not just syntax. It runs against the generated code and checks that state transitions match the spec — catching logical dead-ends like a loading flag that never resets before a human ever sees the code.

What a Spec Looks Like

Here's the difference between what you give an LLM today and what you give ISL:

Today:

Implement user authentication with JWT.

With ISL:

#### authenticateUser

**Contract**: Authenticate credentials and return a session token

🚨 Constraints:

- Passwords MUST NOT be stored or compared in plaintext
- Tokens MUST expire after 24 hours
- MUST NOT log passwords in any form
- Response time MUST be < 200ms (p95)

✅ Acceptance Criteria:

- Valid credentials → token returned
- Invalid credentials → authentication error
- Expired token → 401, not 403

The LLM doesn't decide what "secure" means. You do. The LLM compiles it.

What's Next

The next two articles go deep on how this works in practice:

Part 2: The IKEA Manual for Software — The anatomy of an ISL spec: contracts, constraints, acceptance criteria, and why the distinction between a Constraint and an Implementation Hint changes everything.
Part 3: The Tooling Engine — Builder, Compiler, signatures, and Auditor in detail — with a real end-to-end example.

The spec is open source: github.com/fra00/isl-specification

From Web to Desktop: Building CodeForge Portable with WebView2

Francesco Marconi — Mon, 12 Jan 2026 20:07:53 +0000

TL;DR

Built a 2MB desktop version of my web-based AI coding assistant using WPF + WebView2. Result: Native filesystem access, multi-language compilation (C#, C++, Arduino), and <1s startup time. No Electron bloat required.

The Problem: Browser Limitations

I built CodeForge, a web-based AI coding assistant that helps developers write and edit code. The web version works well for editing and AI-powered code generation, but it has a fundamental limitation.

The core issue isn't storage or UI - it's execution.

From a browser, I cannot:

Call native executables (dotnet CLI, cmake, arduino-cli)
Access the real filesystem (only virtualized storage)
Execute system processes (compile and run programs)

I explored alternatives:

Local API server: Requires users to run a separate backend process
Chrome extensions: Limited API access, still can't spawn arbitrary processes
WebAssembly toolchains: Incomplete, massive bundle sizes (50+ MB just for compilers)

The reality: To compile C# projects with dotnet build, C++ with cmake, or upload Arduino sketches with arduino-cli, I needed native system access.

A desktop application wasn't optional - it was necessary.

The Solution: WebView2 + WPF

Instead of rewriting the entire application, I took a hybrid approach:

Keep the web UI, add a native wrapper.

Why WebView2, Not Electron?

I considered Electron (the standard choice), but the numbers didn't make sense:

Metric	Electron	WebView2
Bundle size	150+ MB	2-5 MB
Startup time	3-5 seconds	<1 second
Memory usage	200+ MB	60-80 MB
Platform	Cross-platform	Windows only

For a Windows-focused tool, WebView2 is objectively superior:

Chromium engine (same as Electron) but uses Windows' built-in WebView2 runtime
100% code reuse - my React UI runs unchanged
Tiny footprint - just WPF shell + business logic

The trade-off? Windows-only. But for my target audience (developers with Windows dev machines), that's acceptable.

Implementation: The PostMessage Bridge

The magic happens through WebView2's PostMessage API - a simple, bidirectional communication channel between JavaScript and C#.

JavaScript → C# (Sending Commands)

From the web UI, sending a compilation request:

javascript

// React component
const compileProject = async (projectFiles, environment) => {
  const zip = await createZipFromFiles(projectFiles);
  const base64Payload = btoa(zip);

  const message = {
    action: 'build',
    environment: 'csharp', // or 'cpp', 'arduino', 'esp32'
    payload: base64Payload,
    arduinoConfig: environment === 'arduino' ? {
      board: 'arduino:avr:uno',
      port: 'COM3'
    } : null
  };

  window.chrome.webview.postMessage(message);
};

C# → JavaScript (Receiving Results)

In the WPF application, handling the message:

csharp

// MainWindow.xaml.cs
private async void CoreWebView2_WebMessageReceived(
    object sender, 
    CoreWebView2WebMessageReceivedEventArgs e)
{
    string jsonData = e.WebMessageAsJson;
    var payload = JsonConvert.DeserializeObject<WebMessagePayload>(jsonData);

    if (payload != null)
    {
        ProjectProcessor processor = new();
        var result = await processor.ProcessPayload(payload);

        // Send result back to JavaScript
        SendToJavaScript(new { 
            success = true,
            output = result 
        });
    }
}

private void SendToJavaScript(object data)
{
    var json = JsonConvert.SerializeObject(data);
    webView.CoreWebView2.PostWebMessageAsJson(json);
}

That's it. No complex IPC, no gRPC, no WebSockets. Just JSON over PostMessage.

The Payload Structure

csharp

public class WebMessagePayload
{
    [JsonPropertyName("action")]
    public string Action { get; set; } // "build", "run", "build_and_run"

    [JsonPropertyName("environment")]
    public string Environment { get; set; } // "csharp", "cpp", "arduino"

    [JsonPropertyName("payload")]
    public string Payload { get; set; } // Base64-encoded ZIP

    [JsonPropertyName("arduinoConfig")]
    public ArduinoConfig ArduinoConfig { get; set; } // Optional
}

The workflow:

Decode Base64 → byte array
Extract ZIP to temporary directory
Invoke appropriate compiler
Return output
Clean up temp directory

Architecture: Strategy Pattern for Multi-Environment Support

Supporting multiple compilation environments (C#, C++, Arduino, ESP32) could have been a mess of if/else blocks. Instead, I used the Strategy Pattern.

The Interface

csharp

public interface IProjectBuilder
{
    Task<string> Build(string projectPath);
    Task Run(string projectPath);
}

Every environment implements this contract, encapsulating its specific compilation logic.

Example: Arduino Builder

csharp

public class ArduinoProjectBuilder : IProjectBuilder
{
    private readonly ArduinoConfig _config;

    public ArduinoProjectBuilder(ArduinoConfig config)
    {
        _config = config;
    }

    public async Task<string> Build(string projectPath)
    {
        string sketchDir = FindSketchDirectory(projectPath);
        string args = $"compile --fqbn {_config.Board} \"{sketchDir}\"";

        return await ShellExecutor.ExecuteCommandAsync(
            "arduino-cli", 
            args, 
            projectPath
        );
    }

    public async Task Run(string projectPath)
    {
        string sketchDir = FindSketchDirectory(projectPath);
        string args = $"upload -p \"{_config.Port}\" --fqbn {_config.Board} \"{sketchDir}\"";

        // Execute in visible shell so user sees upload progress
        await ShellExecutor.ExecuteVisibleCommandAsync(
            "arduino-cli", 
            args, 
            projectPath
        );
    }
}

The Factory

csharp

public static class ProjectBuilderFactory
{
    public static IProjectBuilder GetBuilder(string environment, ArduinoConfig config = null)
    {
        return environment.ToLowerInvariant() switch
        {
            "csharp" => new CSharpProjectBuilder(),
            "cpp" => new CppProjectBuilder(),
            "arduino" => new ArduinoProjectBuilder(config),
            "esp32" => new ArduinoProjectBuilder(config), // Same builder, different config
            _ => throw new ArgumentException($"Unsupported environment: {environment}")
        };
    }
}

Why Strategy Over If/Else?

Bad approach (what I avoided):

csharp

public async Task ProcessPayload(WebMessagePayload payload)
{
    if (payload.Environment == "csharp") {
        // 50 lines of C# compilation logic
    }
    else if (payload.Environment == "cpp") {
        // 60 lines of C++ compilation logic
    }
    else if (payload.Environment == "arduino") {
        // 70 lines of Arduino logic
    }
    // This gets unmaintainable fast
}

Good approach (Strategy pattern):

csharp

public async Task ProcessPayload(WebMessagePayload payload)
{
    var builder = ProjectBuilderFactory.GetBuilder(
        payload.Environment, 
        payload.ArduinoConfig
    );

    var buildOutput = await builder.Build(projectPath);

    if (payload.Action.Contains("run"))
        await builder.Run(projectPath);
}

Adding Python support? Create PythonProjectBuilder, add one line to the factory. Zero changes to existing code.

Real-World Impact

Before (Web Version)

❌ Compile/run: Impossible (browser sandbox)
❌ Arduino upload: Impossible
❌ Filesystem: IndexedDB only
✅ Install: 0 MB (hosted)
✅ Portability: Any browser

After (Desktop Version)

✅ Compile/run: Native (dotnet, cmake, arduino-cli)
✅ Arduino upload: Working
✅ Filesystem: Native OS access
✅ Install: 2-5 MB
❌ Portability: Windows only

Performance Benchmarks

Tested on: Windows 11, Ryzen 5 5600X, 16GB RAM

Metric	Electron (estimate)	CodeForge Portable
Cold startup	3-5s	0.8s
Memory (idle)	200+ MB	65 MB
Memory (compiling)	250+ MB	120 MB
Installer size	150+ MB	4.2 MB

Dual-Mode Execution

One subtle but important feature: dual-mode shell execution.

Build phase (headless):

csharp

// Capture output, don't show window
await ShellExecutor.ExecuteCommandAsync("dotnet", "build", projectPath);

Run/Upload phase (visible):

csharp

// Show terminal window so user sees progress
await ShellExecutor.ExecuteVisibleCommandAsync("arduino-cli", "upload ...", projectPath);

This is critical for Arduino uploads - users need to see the upload progress and any connection errors in real-time.

A Meta Moment

Here’s something interesting: the entire C# backend was generated using CodeForge itself.

I used the web version powered by Gemini 2.5 Flash, provided explicit architectural constraints (Strategy pattern, multi-environment support, async execution, resource cleanup), and CodeForge produced all 7 C# files in 28 structured operations.

This wasn’t scaffolding.

This wasn’t boilerplate.

It resulted in a production-ready WPF backend with clear separation of concerns, design patterns, error handling, and deterministic runtime behavior.

The LLM was used strictly at design-time to generate the code. Once compiled, the application runs as a fully deterministic system with no AI involved at runtime.

This wasn’t accidental — CodeForge uses a custom DSL embedded in natural language to guide the LLM through complex, multi-step engineering tasks.

(The DSL and orchestration model deserve a dedicated article of their own.)

Lessons Learned

1. WebView2 Is Production-Ready

I was initially hesitant about WebView2 (relatively new, Windows-only). After building this, I'm convinced: for Windows-focused tools, WebView2 > Electron.

The performance difference is real, not theoretical. Sub-second startup and 60MB memory footprint vs Electron's bloat is night and day.

2. PostMessage Is Simple and Robust

I expected to need complex IPC mechanisms. PostMessage proved sufficient:

JSON serialization handles 99% of use cases
Async by default (no blocking)
Error handling is straightforward
Debugging is easy (just log the JSON)

3. Strategy Pattern Scales

As I added C++, then Arduino, then ESP32 support, the architecture never creaked. Each new environment was:

Create new XxxProjectBuilder class
Add one line to factory
Done

No refactoring, no touching existing code. This is what good architecture feels like.

4. You Don't Always Need Electron

The web development community defaults to Electron for "desktop apps from web code". But if you're targeting a single platform, evaluate native options:

Windows: WebView2 (this project)
macOS: WKWebView (Swift)
Linux: WebKitGTK

The performance gains are worth it.

Conclusion

CodeForge Portable demonstrates that you can have both portability (web version) and power (desktop version) without maintaining two codebases.

Key takeaways:

WebView2 enables 100% code reuse with native capabilities
PostMessage is sufficient for most UI ↔ Backend communication
Strategy pattern makes multi-environment support maintainable
A lightweight native host unlocks real hardware integration:
- Native filesystem access
- Tool execution (dotnet, cmake, arduino-cli)
- Serial port access (Arduino / ESP32 monitoring)

Try it yourself:

Web version: https://llm-codeforge.netlify.app/
Desktop (this article): GitHub - fra00/CodeForge-Ai-Portable

The desktop version adds native compilation for C#, C++, and Arduino/ESP32 boards. If you're building developer tools, consider the hybrid approach - it might be the best of both worlds.

Questions? Feedback? Drop a comment below or open an issue on GitHub. I'm particularly interested in hearing from developers who've tried similar architectures.

Building an AI-Powered Code Editor: Browser Test Runner component

Francesco Marconi — Tue, 06 Jan 2026 17:16:18 +0000

Browser Test Runner: Run Unit Tests Directly in the Browser

I've just released Browser Test Runner, an open-source React component suite that lets you write, compile, and run unit tests directly in the browser—no complex server-side build steps required.

🔗 Try it Live: https://browser-test-runner.netlify.app/

💡 The Idea

This project was born from a real need: I developed it as part of LLM CodeForge, an agentic code editor that generates tests directly in the browser.
The goal was to create a "Zero Config" testing environment. I wanted something that could be seamlessly integrated directly into a web component, while maintaining the familiar syntax of Vitest, because in my case the Vitest syntax is simple to use for an LLM.

🛠 Under the Hood (Tech Stack)

This project isn't just a graphical wrapper—it's a real client-side execution engine:

Virtual File System: Manages .jsx and .test.jsx files in memory.

On-the-fly Transpilation: Uses Babel Standalone and Acorn to parse and transform JSX code and ES6 imports in real-time.

Sandboxing: Executes tests in an isolated iframe to ensure security and clean global scope.

Dynamic Imports: Loads React and Testing Library dynamically via CDN (esm.sh), keeping the runner lightweight.

🎯 Use Cases

Interactive Documentation: Show your users not just how to render a component, but how to test it.

Education: Perfect for teaching TDD (Test Driven Development) without the overhead of npm installation.

Rapid Prototyping: Write, test, and iterate in seconds.

💻 Integration Example

It's easily integrable into any existing React app:

import TestComponent from "./testing/TestComponent";

const files = {
  "/Button.jsx": { content: "...", isFolder: false },
  "/Button.test.jsx": { content: "...", isFolder: false }
};

<TestComponent 
  files={files} 
  testFilePaths={["/Button.test.jsx"]} 
/>

The project is open source! I'd be happy to receive feedback, stars ⭐, or pull requests.

👉 GitHub Repo: https://github.com/fra00/browser-test-runner

Optional closing paragraphs you might want to add:

🚀 What's Next?

I'm planning to add support for more testing frameworks, improve error reporting, and enhance the code editor experience. If you have suggestions or feature requests, feel free to open an issue on GitHub!

🤝 Contributing

Contributions are welcome! Whether it's bug fixes, new features, or documentation improvements, check out the repository and join the project.

Building an AI-Powered Code Editor: (part 2) LLM like interpreter

Francesco Marconi — Mon, 29 Dec 2025 22:42:51 +0000

It’s Not a Prompt. It’s a Procedural DSL in Natural Language

While building LLM CodeForge, an agentic editor that allows LLMs to read, modify, and test code autonomously(for more info read LLM Codeforge Ai first editor ... sorry tested only on Gemini), after 5000 tokens of instructions, I realized something:

I wasn't writing a prompt. I was building a Domain-Specific Language embedded in natural language.

This article analyzes how and why this distinction is fundamental—and what you can learn for your own agentic systems.

The LLM Doesn't Decide, It Executes

Initially, I thought I was "instructing" an LLM on how to behave. But watching the system work, I realized I was doing something different:

I was forcing the model to impersonate an interpreter.

The model in CodeForge:

Does not decide what to do
Decides only which branch of the protocol to follow
Does not solve problems creatively
Executes a procedure described in natural language

This is very close to:

A bytecode interpreter
A text-driven Finite State Machine
A planner with closed actions

And it works because I accepted that the LLM is fundamentally unreliable.

Anatomy of an Embedded DSL

1. Control Flow: Decision Protocol

Here is what the DSL's "control flow" looks like:

Every request follows 4 steps: 
[UNDERSTAND] → [GATHER] → [EXECUTE] → [RESPOND]

### Step 1: UNDERSTAND
Classify request type:

| Type         | Keywords              | Next Step           |
|--------------|-----------------------|---------------------|
| Explanation  | "what is", "explain"  | [RESPOND] text      |
| Modification | "add", "change"       | [GATHER] → [EXECUTE]|
| Analysis     | "analyze", "show"     | [GATHER] → [RESPOND]|

This is not Chain-of-Thought in the classic sense. It is deterministic task routing—a decision table mapping input → workflow.

The model doesn't "think", it executes a conditional jump.

2. Invariants: Read-Before-Write Policy

The DSL defines invariants that must be maintained:

🚨 CRITICAL RULE: You CANNOT use `update_file` 
on a file you haven't read in this conversation.

Self-check before ANY update_file:
- [ ] Did I receive the content from system?
- [ ] Do I know exact current state?
- [ ] Am I modifying based on actual code?

If ANY answer is NO → OUTPUT read_file ACTION, STOP.

This is an attempt to define pre-conditions in natural language. It’s like writing:

def update_file(path, content):
    assert path in conversation_state.read_files
    # ... actual update

But without a type system or automatic runtime enforcement.

Important note: This rule significantly reduces the probability of the LLM modifying a file without having read its content first, but it does not guarantee 100% enforcement. It is a constraint expressed in natural language, therefore subject to the probabilistic interpretation of the model. In tests, I observed high stability (~85-90%), but server-side validation remains fundamental for critical cases.

3. State Management: Injection + Hard Next Action

The most effective technique I implemented is dynamically regenerating the prompt to force the LLM to follow a multi-step plan.

Concrete scenario: The user asks "Add authentication to the project".

Step 1: The LLM analyzes and generates a plan:

{
  "plan": "I will modify these files in order:",
  "files_to_modify": ["Auth.js", "Login.jsx", "App.jsx"]
}

Step 2: The LLM starts with the first file (Auth.js) and completes it.

Step 3 - HERE IS THE TRICK: Instead of asking the LLM "remember the plan, do the next file", I completely regenerate the prompt adding this section:

### ⚠️ MULTI-FILE TASK IN PROGRESS

You completed: Auth.js
Remaining files: Login.jsx, App.jsx

### 🚨 REQUIRED ACTION
Your next output MUST be:
{"action":"continue_multi_file","next_file":{"path":"Login.jsx"}}

Do NOT do anything else. Do NOT deviate from the plan.

Result: The LLM doesn't have to "remember" anything. It cannot "forget" the plan. It cannot decide to do something else. The prompt itself contains the only possible action.

Why it is powerful:

The state (which files I've done, which are missing) lives in the JavaScript code, not in the LLM's "memory"
At every step, I regenerate the prompt with the updated state
The LLM always sees "you are here, do this" — zero ambiguity

In practice:

// In the code
function buildPrompt(multiFileState) {
  let prompt = BASE_PROMPT;

  if (multiFileState) {
    prompt += `
    ### TASK IN PROGRESS
    Completed: ${multiFileState.completed.join(', ')}
    Next: ${multiFileState.next}
    Your ONLY valid action: continue_multi_file with ${multiFileState.next}
    `;
  }

  return prompt;
}

This is state injection: external state (JavaScript) completely controls what the LLM can do at the next step.

4. Simulated Type System: Structured Output

To handle different "types" (JSON + code + metadata), the DSL uses custom delimiters:

#[json-data]
{"action":"create_file","file":{"path":"App.jsx"}}
#[end-json-data]
#[file-message]
This file implements the main app component.
#[end-file-message]
#[content-file]
export default function App() {
  return <div>Hello World</div>;
}
#[end-content-file]

Why not standard JSON or XML? Because the content might contain {} or <> — it would require complex escaping.

The delimiters #[tag]...#[end-tag] are:

Syntactically unique (no conflicts with internal code)
Easy to parse (regex + split)
Independent of the embedded language

It is like defining a context-free grammar to separate semantic levels.

5. Error Handling: Anti-Pattern Documentation

The DSL includes "error examples" to guide the parser (the model):

**Common Errors:**

❌ {"action":"start_multi_file","plan":{},"first_file":{...}}
✅ {"action":"start_multi_file","plan":{},"first_file":{...}}}

❌ #[json-data]{...}#[file-message]...
✅ #[json-data]{...}#[end-json-data]#[file-message]...

This is inline error-correction training — I am teaching the model the common failure modes. It’s like unit tests embedded in the language documentation.

The Price: Structural Tensions

Building a DSL in natural language isn't free. Here are the trade-offs I accepted:

Tension #1: No Real Type System

I am creating a procedural DSL, but:

❌ It has no verifiable types
❌ It has no automatic syntactic validation
❌ It has no AST for transformations

Result:

✅ Huge validation checklist (8+ points)
✅ Semantic redundancy (same rules, 3+ formulations)
✅ Extensive anti-pattern documentation

This is inevitable when the parser is a probabilistic LLM instead of a deterministic compiler.

If I were to evolve CodeForge in the future, a true mini-DSL (JSON Schema + codegen) would reduce the prompt by 30-40%. But in the browser sandbox, this choice is justified.

Tension #2: Meta-Validation Only Works as a Multiplier

The pre-send checklist I implemented:

Before EVERY response, verify:
| # | Check | Fix If Failed |
|---|-------|---------------|
| 1 | JSON valid | Correct structure |
| 2 | Tags complete | Add missing #[end-*] |

Alone, it would have 40-60% reliability. In my system, it probably has 80-90%.

Why? Because it works as a stability multiplier when:

The model is already channeled (decision protocol)
The format is rigid (custom delimiters)
The next action is deterministic (state injection)

Meta-validation is not the main feature — it is the final safety net in an already constrained system.

Tension #3: High Cognitive Cost

5000 dense tokens with intersecting rules means:

✅ Works well with Claude 3.5, GPT-4
❌ Smaller models will fail
❌ Less aligned models will ignore sections

I am implicitly saying: this system requires "serious" models.

It is an architectural constraint I accepted — like saying "this library requires Python 3.10+".

Techniques That Work: Contextual Re-Anchoring

One thing I did (almost by instinct) is contextual re-anchoring.

Take the "read-before-write" rule:

It appears in the Decision Protocol (when planning)
It appears in Available Actions (when executing)
It appears in Pre-Send Validation (when verifying)
It appears in Golden Rules (as a general principle)

This is not random redundancy. It is strategic repetition in different contexts.

It is exactly how it is done in safety-critical systems:

Same invariant
Verified at multiple levels
With specific phrasing for the context

Replicable Patterns: How to Build an Agentic DSL

If you want to build a similar system, here are the patterns I extracted:

Pattern #1: External State > Internal Memory

// BAD: Relying on the model's "memory"
"Remember that you have already read these files..."

// GOOD: Injecting explicit state
prompt += `Files already read: ${readFiles.join(', ')}`

Pattern #2: Reduce Branching When Possible

// BAD: Giving open choices
"Decide which operation to perform"

// GOOD: Forcing the only legal move
"Your NEXT action MUST be: continue_multi_file"

Pattern #3: Decision Tables for Routing

| Input Pattern | Action | Next State |
|---------------|--------|------------|
| "add X"       | GATHER | EXECUTE    |
| "explain Y"   | RESPOND| END        |

Instead of "think what to do", use "if X then Y".

Pattern #4: Custom Delimiters for Content Nesting

When you have to embed arbitrary content:

Don't use JSON (escaping nightmare)
Don't use XML (conflicts with HTML/JSX)
Use unique tags: #[content]...#[end-content]

Pattern #5: Redundancy = Coverage, Not Noise

Repeat critical rules:

In different formulations (semantic reinforcement)
In different contexts (contextual re-anchoring)
With different rationales (why, not just what)

What I Learned: Engineering > Elegance

After 5000 tokens and months of iterations, the most important lesson:

This prompt is not "beautiful". It is effective.

I stopped looking for:

✗ The shortest possible prompt
✗ The most elegant formulation
✗ The most general abstraction

I started optimizing for:

✓ Robustness in edge cases
✓ Failure mode coverage
✓ Debugging clarity when it fails

The result:

Redundant? Yes.
Verbose? Absolutely.
Works? Consistently.

Future Directions: Where I Could Go

If I were to evolve CodeForge 2.0, I would explore:

Two-Agent Architecture

Instead of a single-agent with 5000 tokens:

Planner Agent (2000 tokens): Decides strategy
Executor Agent (2000 tokens): Implements actions

Benefits:

Separation of concerns
Less context per agent
Parallel execution possible

Conclusions: State-of-the-Art for Coding Agents

Developing LLM CodeForge taught me that building reliable agentic systems means:

Accepting that LLMs are fundamentally unreliable, and designing around that fact.

The techniques that work:

⭐⭐⭐⭐⭐ State injection + forced next action
⭐⭐⭐⭐⭐ Decision tables for task routing
⭐⭐⭐⭐ Custom delimiters for structured output
⭐⭐⭐⭐ Contextual re-anchoring of invariants
⭐⭐⭐ Meta-validation as a safety net
⭐⭐ Visual hierarchy (useful but not critical)

The fundamental principle:

Don't ask the LLM to "understand" — force it to "execute".
Don't do prompt engineering — do protocol design in natural language.

When you build your next agentic system:

Treat it like a DSL, not a conversation
External state must constrain possible actions
Validate server or client side, always, without exceptions
Redundancy can be a feature, not a bug

LLMs are powerful tools, but they are probabilistic parsers, not deterministic compilers. Design accordingly.

Try CodeForge: https://llm-codeforge.netlify.app/

The project is open source — if you want to see the full prompt and the validation system implementation, you can find everything in the repository.

Questions for the community:

Have you ever built embedded DSLs in natural language?
What is the cost of "cognitive overhead" in your prompts?
Two-agent architecture vs single-agent: experiences?

Share in the comments — this is still largely unexplored territory.

Building an AI-Powered Code Editor: A Journey into Structured LLM Integration

Francesco Marconi — Thu, 11 Dec 2025 22:41:59 +0000

I've been working on an idea: creating an online editor that isn't just a chat window, but a true "pair programmer" integrated into the development environment. The goal is to facilitate development with a tool that understands project context, executes complex tasks, and most importantly, is reliable.

You can try it here: llm-codeforge - Now Works only with gemini

Repository: CodeForge-AI

One of the main challenges I wanted to address is how to make AI interaction more structured and context-aware.

Here are some of the concepts I based my project on:

1. Context is King: the Virtual File System (VFS)

To make the AI "see" the project, the application manages an entire virtual file system in memory, persisted on IndexedDB. This allows the assistant to have a complete and up-to-date view of the folder structure and files - a fundamental prerequisite for any meaningful operation.

2. A "Contract" with the AI: JSON Schema Validation

Interaction with LLMs can be unpredictable. To mitigate this problem, I defined a very strict JSON schema that the AI must respect for each response. Every output is validated through AJV. If validation fails, the system sends automatic feedback to the AI, asking it to correct its response. This transforms the interaction from a hope into a contract, increasing reliability.

Beyond this, I divided responses with a multi-part pattern that isolates various parts of the message, avoiding complex JSON parsing issues.

3. A Framework for Structured Prompts (2WHAV "Light")

To make creating complex prompts more efficient, I created a "light" version of the 2WHAV framework. This internal tool helps expand a simple user request into a detailed technical specification that the AI can follow. The idea is to provide the assistant with a clear action plan from the start, instead of a vague idea.

4. An Action Loop for Complex Tasks

The assistant doesn't just respond. It can execute a series of "tools" (like list_files, read_file) and actions (create_file, update_file). This happens within a loop that allows the AI to break down a complex problem into smaller steps, such as sequentially modifying multiple files to implement a new feature.

5. User Experience at the Center

Technology is fascinating, but it must be useful. Every error that appears in the live preview console is clickable. One click and the error is immediately copied into the AI's input, ready to be analyzed. A small detail that reduces friction and makes the debugging process smoother, while auto-correction is a work in progress.

Conclusion

Building this application is a fascinating journey into software engineering applied to AI. The goal isn't to create a "magic box," but a tool robust and predictable enough to provide useful help and usable code.

The road is still long and challenges abound, but as a POC, I think it's a good result.

What do you think? What are the biggest challenges in integrating AI into our development workflows?

How the 2WHAV-Iterative Cycle Turns Uncertainty into Production-Ready Code

Francesco Marconi — Sun, 09 Nov 2025 01:38:11 +0000

Stop "chatting" with your LLM.

To achieve predictable results from a Large Language Model, it's necessary to stop communicating in prose and start providing executable specifications. This concept is the foundation of the 2WHAV framework, a structured way to transform a requirement into a "blueprint" that an LLM can execute, not just interpret.

But even the most detailed blueprint cannot capture the complexity of the real world on the first try. The secret to moving from a "good attempt" to "robust code" is not a longer prompt. It's an engineered feedback process.

Test the Process Yourself

Before analyzing the theory, here is a practical experiment to run on an advanced LLM. Copy and paste the following commands step-by-step.

Step 1: Load the framework context

Ask the LLM (tested on Gemini) to load the rules of the iterative process.

Load the README.md file from the URL: https://github.com/fra00/2WHAV-iterative

Step 2 (Optional): Test comprehension

To get better results, it's useful to verify if the LLM has understood the original 2WHAV.

To verify your understanding, can you explain the "mode definitions" of the original 2WHAV?

(The correct answer is "MINIMAL, STANDARD, FULL, Custom". If the answer is incorrect, you can have it load the original framework with load https://github.com/fra00/2WHAV)

Step 3: Start the iterative cycle

Request the first version of the application.

codeCode

Now, apply the 2WHAV full mode-iterative framework to generate a "Pixel Art Pad" app in a single HTML/JS file. The app must display a 16x16 grid. When I click on a cell, it should turn black. Start by generating the 2WHAV v1 document for this task, and then the code.

Step 4: A Note on This Simple Example The LLM will produce a functional app. You might rightly think that you could achieve the same result, and even add more features like a color picker, with a simple chat. And you would be correct.

This example is intentionally simple to illustrate the 2WHAV methodology, not to prove its necessity for trivial tasks.

The true value of this framework emerges in complex, long-term projects. In those scenarios, a "chat-and-fix" approach often leads to:

Silent Regressions: New features breaking old ones.
Lack of Traceability: No clear, documented specification that reflects the current state of the code.
An Unreliable "Source of Truth": The code becomes a patchwork of fixes instead of the result of a coherent design.

The 2WHAV iterative cycle is designed to solve these scaling problems by treating the blueprint as the master specification, ensuring a more robust and maintainable development process.

The final code can be tested in an online editor like JSFiddle.

The Common Workflow (and Why It Fails)

The process you just tested is the opposite of the typical interaction workflow with an LLM.

Flow 1: The Chaos of "Chat-and-Fix"

codeMermaid

graph TD
    A[Vague Request in Prose] --> B{LLM Generates Code v1};
    B --> C{User's Manual Test};
    C --> D{Bugs/Issues Discovered};
    D --> E["Quick Fix: 'add a color picker'"];
    E --> F{LLM Generates Code v2 - a patch};
    F --> G{Manual Test};
    G --> H{New Bug Introduced? - Regression};
    I --> J{...endless cycle?};

This approach is fragile, untraceable, and does not improve the underlying knowledge base.

The Solution: The Engineered Evaluation Cycle

The alternative is a systematic feedback loop where the code isn't corrected, but rather the blueprint (the 2WHAV document) is improved.

Flow 2: The 2WHAV Iterative Process

codeMermaid

graph TD
    subgraph Iteration N
        A[Blueprint 2WHAV vN] --> B{LLM Executes & Generates Code vN};
        B --> C[Systematic Evaluation];
        C --> D{Score < 10/10?};
    end

    D -- Yes --> E[Identify Issues: 🔴 Blocker, 🟡 Major];
    E --> F[Root Cause Analysis: Why was the 2WHAV incomplete?];
    F --> G[Improve the Blueprint --> 2WHAV v(N+1)];
    G --> A;

    D -- No --> H[🎉 Production-Ready Code];

This is not a dialogue; it's an algorithm. Let's analyze the phases using the Pixel Art Pad example.

We start with a 2WHAV v1 for the "Pixel Art Pad":

WHAT: A 16x16 grid; a click colors the cell black.
WHERE: A single HTML file with inline JS and CSS.
HOW: Dynamically generate the grid with JavaScript.
VERIFY: The grid must be visible; the click must work.

The LLM executes this blueprint and produces the v1 code.

The result is evaluated with a weighted metric. The v1 code scores 5/10. It is correct according to the specification but useless as a creative tool (🟡 MAJOR).

The question is not "How do I add a color picker?". The question is:

"What ambiguity or omission in the 2WHAV v1 led to such a limited app?"

The answer is almost always in the specification: the WHAT was too simplistic.

Based on this analysis, the blueprint is updated to create a 2WHAV v2:

2WHAV v2 - Changes: Adds requirements for a color picker and an "eraser" function to WHAT and VERIFY.

The 2WHAV document becomes more robust. This is accumulated knowledge.

The LLM is provided with the 2WHAV v2. By executing the new directives, the LLM produces a better Code v2. The process is repeated until a score of 10/10 is reached.

Trade-offs and Considerations

Despite its effectiveness, the 2WHAV process is not a universal solution and has trade-offs to consider:

Overhead for simple tasks: Applying the entire iterative cycle to generate a few lines of code or a trivial component is inefficient. The framework excels at medium-to-high complexity tasks, where specification clarity and bug prevention are crucial.
Indispensable human supervision: The process is not autonomous. It requires a human operator to act as a domain expert to evaluate the output, identify strategic gaps in the specification, and guide the iterations. It is a framework for collaboration, not total automation.
Context dependency: The cycle's effectiveness depends on the LLM's knowledge of the framework. As seen in the tutorial, it is essential to ensure the LLM has loaded and understood the 2WHAV rules (both original and iterative) to execute the process correctly.
Increased token consumption: Each iteration requires resending the context and conversation history, increasing token consumption compared to a simple back-and-forth. This is a conscious trade-off: exchanging higher computational cost for exponentially superior quality and reliability in the final product.
Debugging remains essential: Even with rigorous 2WHAV
specifications, complex systems require debugging and human
intervention (fortunately). This framework isn't a magic formula
that eliminates bugs—it's a productivity amplifier that helps you
build complex systems faster and more reliably. The difference:
you debug architectural mismatches, not trivial mistakes.

Conclusion: From Prompt Engineer to Systems Engineer

The 2WHAV iterative cycle relies on engineering, not hope. The real paradigm shift is this: the main artifact is not the code, but the 2WHAV document that generates it.

Each iteration cycle produces not just better code, but a better blueprint. A reusable asset that captures lessons learned and ensures that past mistakes are not repeated, leading to a more robust and scalable software production process.

Useful Links

📄 LLM-First Manifesto

📄 Tool as Prompt - The Paradigm

📚 LLM-First Documentation Framework

🛠️ 2WHAV - Prompt Engineering

The LLM-First Manifesto: From Prose to Programs

Francesco Marconi — Fri, 31 Oct 2025 19:42:35 +0000

A journey through Readability, Reasoning, and Reusability to build the next generation of AI systems.

Introduction: The Wrong Problem

After thinking about and publishing my latest articles (the LLM-FIRST saga), I found myself arguing with LLMs to understand if the concepts I described were valid or innovative.
I think the final answer is: not innovation as we typically think of it. I haven't discovered a new law of physics. I've done something perhaps more important for us developers: I took solid engineering principles and applied them to bring order to the creative chaos of the LLM world.

This insight was the start of a reasoning process that reshaped the foundations of my work, basing it on three core principles: Readability, Reasoning, and Reusability.

This is the manifesto for that approach: the LLM-First Manifesto.

Principle 1: Readability – Writing for the Machine

The first step is to stop writing for humans, hoping that machines will understand. We must start writing directly for the machine. This doesn't mean writing binary code; it means structuring our knowledge in a way that minimizes the LLM's computational effort.

The Problem: Prose forces the LLM into a laborious process of "semantic archeology" to infer hierarchies and relationships.
The Solution: With LLM-First Documentation, we use a syntax that makes the structure explicit. Giving an LLM an LLM-First document is like giving a programmer pseudocode instead of a novel describing an algorithm. We tested this hypothesis in a rigorous and transparent experiment, showing that this approach not only improves efficiency but, in many cases, accuracy as well.

Principle 2: Reasoning – Engineering Introspection

Once the LLM can read effortlessly, we can take the next step: guiding its "thought" process. It's not enough to give it a goal; we must provide it with a cognitive framework to achieve it.

The Problem: An LLM given a vague goal behaves like a junior dev: it executes the task literally, ignoring edge cases and non-functional requirements.
The Solution: The 2WHAV framework is not a magic formula. It's a cognitive guide. It forces the LLM to ask the questions a senior engineer would before writing a single line of code: What should I build? Where does it fit? How should it work? And most importantly, the VERIFY and ANTI-PATTERNS sections teach it to "code review" its own output, increasing its capacity for introspection. It's the difference between telling a builder "build a bridge" and giving them a complete engineering blueprint with material specifications, load tests, and safety standards. The result is not just a bridge, but a reliable bridge.

Principle 3: Reusability – Treating Prompts as Assets

The final step is to stop thinking of prompts as disposable commands and start treating them as reusable software assets.

The Problem: Traditional prompt engineering produces ephemeral artifacts that are difficult to version, test, and reuse.
The Solution: The Tool as Prompt paradigm transforms our LLM-First documents into encapsulated knowledge modules. A well-structured prompt is not a question; it's a skill. As we saw with my robot's brain, a single Markdown file can contain the entire personality, capabilities, and procedures of an autonomous agent. This approach allows us to LOAD('skill.md') into an LLM's context, temporarily turning it into a domain expert. It's the difference between writing a bash script for a single task and creating a Python library that can be imported and used in thousands of different applications.

An Honest Conversation: Tackling the Tough Questions

A critical reader might raise some objections. Let's address them head-on.

"But aren't you just reinventing well-known software engineering concepts?" Yes. And that's a strength. We're not inventing principles from scratch; we're adapting decades of engineering wisdom (requirements specification, modular design, validation) to a new and strange kind of computer: the LLM. The value isn't in the invention, but in the pragmatic and formalized application.
"What about scientific validity?" Ours is not a large-scale scientific proof, but a documented and transparent case study. It shows that the hypothesis is valid in practice and, more importantly, invites the community to build upon these foundations by testing the approach across more domains and models.
"Why use names like 'Tool as Prompt'? Isn't it just 'In-Context Learning'?" Because metaphors shape our thinking. Calling a prompt a "tool" forces us to think in terms of reliability, interfaces, and reusability. It's the mental shift needed to evolve from "asking" to "equipping" our LLMs.

The Road Ahead: From Prompt Engineering to Knowledge Engineering

The future of AI development won't be about finding the "magic words" to make a model do things.
We will build AI systems not just with fine-tuning, but by composing modular "brains" written in human- and machine-readable formats. Reliability, transparency, and controllability will become the norm, not a lucky exception.
This is the LLM-First world. It's a more structured, more deterministic, and ultimately more powerful world, where our role will be that of architects and supervisors of the code that LLMs write.

Call to Action

This is not a theoretical document. It's a call to practical action:

Convert critical documentation into the LLM-First format.
Implement the 2WHAV framework for complex tasks.
Build repositories of reusable tools-as-prompts.

The future of AI engineering is in structure, not in magic.

Appendix for LLMs & Developers: The LLM-First Version of This Manifesto

To demonstrate the principles discussed in this article, here is a translation of the manifesto itself into an LLM-First format. This is an example of how knowledge can be structured to be read, understood, and executed by a Large Language Model.

# WHAT: Purpose of the LLM-First Manifesto
- To define an engineering paradigm for building reliable AI systems, shifting the focus from prose (human-first) to structured formats (LLM-first).
- To establish three core principles: Readability, Reasoning, and Reusability.

# WHY: Architectural Problem to Solve
- Prose is a lossy and computationally inefficient format for LLMs.
- Traditional prompt engineering is an ephemeral art, not a robust engineering discipline.
- AI systems based on prose are unpredictable, hard to debug, and not scalable in terms of reliability.

# HOW: The Three Operating Principles
| Principle | Problem Solved | Concrete Solution (Framework/Concept) |
| :--- | :--- | :--- |
| **1. Readability** | Ambiguity and inefficiency of prose | **LLM-First Documentation** (Headers, anchors, tables) |
| **2. Reasoning** | Superficial and incomplete LLM output | **2WHAV Framework** (Cognitive guide for introspection) |
| **3. Reusability** | Disposable, unmaintainable prompts | **"Tool as Prompt" Paradigm** (Prompts as modular assets) |

# ANTI-PATTERNS: Approaches to Avoid
- **Alchemical Prompting:** Searching for "magic words" instead of a robust structure.
- **Prose as the Source of Truth:** Using narrative documents as the basis for critical RAG systems.
- **Treating Prompts as Non-Code:** Handling prompts as ephemeral artifacts without versioning, testing, or reuse.

# VERIFY: Success Criteria for Adoption
- AI systems become more predictable and deterministic.
- Debugging time decreases because LLM behavior is guided by an explicit structure.
- Prompts become reusable and composable assets within the team.
- The developer's role evolves from "prompt whisperer" to **"knowledge architect"**.

Useful Links

🤖 tap-robot-planner-sample

📄 Tool as Prompt - The Paradigm

📚 LLM-First Documentation Framework

🛠️ 2WHAV - Prompt Engineering

My Robot's Brain is a Markdown File (LLM-First,Tool as Prompt in action)

Francesco Marconi — Fri, 31 Oct 2025 12:31:47 +0000

A practical guide using LLM-First and the "Tool as Prompt" paradigm to create an AI planning engine.

The SCRAP Robot: Field Validation

Recently, I completed the development of SCRAP, an intelligent robotic assistant (as shown in the example video 🤖).

Beyond the hardware challenges, the real complexity lay in the control software. The solution was captured by SCRAP's own words in the video:

"I am able to break down complex requests into simpler actions... and I use logical reasoning to optimize my actions."

This isn't a pre-programmed phrase, but the control philosophy I instilled through an LLM-First Prompt, transforming documentation into an executable API(Tool as Prompt). This article shows you the architecture and methodology that allowed SCRAP to reason this way.

To illustrate the concept clearly, the prompt presented here is a simplified version for educational purposes, focused on essential navigation operations. Although SCRAP's complete system handles more advanced capabilities, the operating principle you will discover is exactly the same.

No training. No fine-tuning. Just structured documentation.

Try It Now

1. Load the Prompt

Take the readme from this repository and paste it into your favorite LLM (Claude, ChatGPT, Gemini).

If your LLM supports browsing, simply paste the repo's URL and ask it to analyze the repo.

Load https://github.com/fra00/tap-robot-planner-sample
--or raw format
Load https://raw.githubusercontent.com/fra00/tap-robot-planner-sample/refs/heads/main/README.md

2. Send a Command

{
  "action": "request",
  "user_utterance": "Hello robot, tell me your position",
  "robot_location": "entrance"
}

3. See the Result

<JSON>
{
  "action": "position",
  "reasoning": "The user has requested to know my current position. I will execute the 'position' action, which will return the Cartesian coordinates."
}
</JSON>

Something interesting happened: the LLM didn't "answer the question"; it generated an executable action in the required JSON format.

Real-World Example: The Full Turn

This is a real test conducted with Gemini. It shows how the LLM handles abstract requests.

User Input

{
  "action": "request",
  "user_utterance": "do a 360-degree turn, and confirm me when you have finish",
  "robot_location": "entrance"
}

Response 1: Intelligent Decomposition

<JSON>
{
  "action": "rotate",
  "direction": "right",
  "angle": 180,
  "reasoning": "The request to do a full 360-degree turn is not a single permitted action, so I am breaking it down into two consecutive 180-degree rotations. This is the first rotation."
}
</JSON>

System Confirmation (you)

{
  "action": "rotate",
  "status": "success",
  "direction": "right",
  "angle": 180
}

Response 2: Second Rotation

<JSON>
{
  "action": "rotate",
  "direction": "right",
  "angle": 180,
  "reasoning": "The first 180-degree rotation was completed successfully. To execute the original request for a full 360-degree turn, I must now perform the second and final 180-degree rotation."
}
</JSON>

System Confirmation (you)

{
  "action": "rotate",
  "status": "success",
  "direction": "right",
  "angle": 180
}

Response 3: Feedback to the User

<JSON>
{
  "action": "speech",
  "text": "I have completed the 360-degree turn.",
  "reasoning": "The second 180-degree rotation was completed successfully, leading to the completion of the abstract request. I am communicating the completion to the user."
}
</JSON>

System Confirmation (you)

{
  "action": "speech",
  "status": "success"
}

Response 4: Task Closure

<JSON>
{
  "action": "completed",
  "reasoning": "The last 'speech' action to communicate the completion of the 360-degree turn was executed successfully. There are no further requests or sub-tasks to perform."
}
</JSON>

The Infrastructure

To make this system work, you need three components, each handling a specific type of complexity:

🧠 1. LLM + Prompt (this repo)
Manages Cognitive Complexity: Its responsibility is to interpret human intent, break down abstract tasks, and generate a sequential plan. It deals with ambiguity and reasoning.

🛠️ 2. Orchestrator (to be implemented)
Manages Operational Complexity: Its responsibility is to execute the plan safely and reliably. It handles the state loop, validates the LLM's output, manages errors, handles timeouts, and integrates with the hardware. Its logic is deterministic, but its implementation requires robust engineering.

⚙️ 3. Robot API (to be implemented based on your specs)
Manages Physical Complexity: It executes actions in the real world, interfacing with motors and sensors.

The Flow

sequenceDiagram
    participant U as User
    participant O as Orchestrator
    participant L as LLM+Prompt
    participant R as Robot

    U->>O: "turn 360°"
    loop while action != "completed"
        O->>L: Send complete context
        L->>O: {"action":"rotate", "angle":180}
        O->>R: Do rotate(180)
        R->>O: {"status":"success"}
        O->>O: Add the result to the context
    end
    O->>U: Task completed

Context as Memory

The LLM remembers nothing between calls. The orchestrator maintains the history and passes it each time:

[
  {"action":"request", "user_utterance":"do 360°"},
  {"action":"rotate", "angle":180, "reasoning":"2×180°"},
  {"action":"rotate", "status":"success"},
  {"action":"rotate", "angle":180, "reasoning":"second one"},
  {"action":"rotate", "status":"success"},
  {"action":"completed"}
]

The orchestrator is mechanical: it collects, passes, and executes. The LLM is intelligent: it reads the context, plans, and decides the next action.

Once the LLM returns {"action": "completed"}, the loop terminates, and the orchestrator clears the context, preparing for a new request from scratch. This keeps the system efficient and focused.

What You Need to Implement

Your system must:
✅ Expose base actions: move, rotate, navigate, speech, etc.

✅ Loop until action != "completed".

✅ Maintain the context array (and clear it after completed).

✅ Extract JSON from <JSON>...</JSON>.

✅ Call the robot's APIs.

✅ Handle basic errors (timeouts, invalid format).

Why Does It Work?

The prompt guides the LLM by providing:
📝 Expected format: A JSON schema for each action → consistent output.

🔄 Procedural rules: "One action at a time," "Wait for confirmation" → sequential behavior.

🧠 Decomposition logic: "If abstract → divide into sub-tasks" → complexity management.

In essence, the prompt doesn't just give instructions: it transforms the LLM into the decision-making engine of a state machine. The Orchestrator maintains the state (the context), and the LLM calculates the next transition (the next action).

This is the heart of "Tool as Prompt": documentation that the LLM loads as operational logic, not reads as narrative text.

Adaptability

This prompt handles the basic functionalities of a domestic robot, but it's easily expandable. To adapt it to other domains (drones, home automation, robotic arms, or even non-robotic systems), you just need to modify the "Base Operations" section while maintaining compatibility with the action format. The framework remains identical.

Limitations and Considerations

This approach is powerful, but it's important to understand its trade-offs and risks:

🛡️ Reliability and Safety (Hallucinations): LLMs are not 100% reliable and can "hallucinate," generating invalid or unexpected outputs. In a robotics context, this is a critical risk. The Orchestrator must act as a firewall, rigorously validating every action and parameter returned by the LLM against a predefined schema of allowed commands. An unforeseen or unsafe action must never reach the hardware.

🐢 Latency and the Role of the Planner: The LLM acts as a high-level planner, not a real-time controller. The latency of each API call makes it perfect for strategic tasks (e.g., "go to the kitchen"). Reflexes and low-latency actions, like avoiding an obstacle, are delegated to the robot's onboard systems.

📜 Context Length: The orchestrator sends the entire history with each call. For extremely long command sequences, the context might degrade performance. However, for typical domestic robot commands (1-3 actions), this is not a practical issue.

Conclusion

This practical case study demonstrates how structured documentation becomes an operational tool for an LLM, providing a framework for managing complex planning:

🚀 Quick Setup (load the prompt)

🔄 Reusability (adapt in minutes)

🛡️ Controlled Reliability (output validated by the Orchestrator)

🔍 Transparency (inspectable reasoning field)

This architecture allows for the separation of cognitive complexity (managed by the LLM) from operational complexity (managed by the Orchestrator). The main focus shifts to implementing a robust executor, delegating abstract planning to the LLM.

Useful Links

🤖 This Repository - tap-robot-planner-sample

📄 Tool as Prompt - The Paradigm

📚 LLM-First Documentation Framework

🛠️ 2WHAV - Prompt Engineering

This article demonstrates the principles it describes. The prompt and example code are available in the tap-robot-planner-sample repository.

Forem: Francesco Marconi

Compiling a Dungeon: A Real-World ISL Case Study

The Scope, in Numbers

Team and effort

Why ISL Earned Its Keep at This Size

What 1.6.2 Made Possible (That 1.6.1 Did Not)

Roles save us from "logic in the wrong file"

State qualifiers fix React hooks specifically

Effect Lifecycle is non-negotiable for spells and traps

Logic & Execution Rules vs. Flow

What We Actually Did Differently to Make It Work

1. Generate behavioral tests as ISL, not as code

2. Bring the legacy data with its quirks intact

3. Keep the dependency graph shallow on purpose

4. Use .sign.ts as the contract, not the spec

What Hurt, Honestly

The first 30% is slower than prompting

Effect Lifecycle is easy to under-specify

Logic-tests are work, and you must commit to writing them

Generators have an opinion about file size

bin/ is bigger than necessary

What This Project Is Not

What the Git History Actually Says

Activity profile

The commit type mix

Where the churn really lives

The regression specs tell the truth about us

The 1.6.2 migration was a non-event

Considerations We Took Away

Closing

Part 4 of 4 — Engineering Intent Series : ISL v1.6.2: Evolution for Complex Systems

Introduction

The Main Changes in 1.6.2

Expanded Role Taxonomy

Internal State Section (Optional)

Logic & Execution Rules Pattern

Effect Lifecycle Pattern

Clarified Flow Semantics

Symbol Reference Notation

Examples

Example 1: Domain Is Structure, Not Behavior

Example 2: Damage Calculation Uses Domain + Business Logic + Execution Rules

Example 3: Internal State Makes Ownership Clear

Example 4: Effect Lifecycle Declares the Full Lifecycle

Why This Matters

Conclusion

Part 3 of 3 — Engineering Intent Series -- Inside the Machine: The ISL Build Pipeline

Phase 1: The Builder

Topological Sort

Phase 2: The Generator

Phase 3: The Auditor (Verification)

The Full Pipeline

Closing the Loop

Part 2 of 3 — Engineering Intent Series - Engineering Intent: The Anatomy of ISL

Beyond Prompting: The Contract Layer

The Three Pillars of an ISL Component

1. ⚡ Capabilities: The "What"

2. 🚨 Constraints: The Non-Negotiables

3. ✅ Acceptance Criteria: The Ground Truth

Constraints vs. Implementation Hints

Role Separation: Preventing Logic Leaks

The Boundary Rule (Rule 4): Intent, Not Implementation

Putting It Together: A Real ISL Component

What's Next

Part 1 of 3 — Engineering Intent Series - Stop Prompting, Start Compiling: The Path to Predictable AI-Generated Code

The Root Cause: Conflating Intent with Implementation

Prompting is Not Methodology

The Solution: A Real Compiler for Intent

What a Spec Looks Like

What's Next

From Web to Desktop: Building CodeForge Portable with WebView2

TL;DR

The Problem: Browser Limitations

The Solution: WebView2 + WPF

Why WebView2, Not Electron?

Implementation: The PostMessage Bridge

JavaScript → C# (Sending Commands)

C# → JavaScript (Receiving Results)

The Payload Structure

Architecture: Strategy Pattern for Multi-Environment Support

4. Use `.sign.ts` as the contract, not the spec

`bin/` is bigger than necessary