Forem: André Ahlert

The Boring AI Is the Right AI

André Ahlert — Sun, 17 May 2026 16:00:00 +0000

At the AI Engineer Summit 2025 in New York, the mantra that got repeated from stage after stage was four words. Capability does not mean reliability. Speakers from finance, infrastructure, and consumer products converged on the same point: shipping an agent that demos well is now a solved problem, and shipping one that survives a Tuesday in production is not.

The data backs the room. LangChain's State of Agent Engineering report found that 89 percent of organizations running agents in production have had to add observability that their framework did not give them. Sixty-two percent had to build detailed tracing for individual agent steps. Honeycomb's O11yCon 2026 was themed, in full, as the observability conference for the agent era. Three different angles on the same pattern. Teams that took an agent to production had to build half an orchestrator on top of their framework.

The pattern has a name now. Last month, Kaxil Naik and Pavan Kumar Gopidesu shipped the Common AI Provider for Apache Airflow 3 with a sentence that articulates what hundreds of teams had been intuiting: "Not a wrapper around another framework, but a provider package that plugs into the orchestrator you already run." Both work at Astronomer, the commercial backer of Airflow, which is worth naming up front. The sentence is a diagnosis whether it came from Astronomer or anyone else.

The diagnosis is that the dominant design pattern of the last two years, treating the agent loop as a new runtime, was a category error. The agent loop is not a runtime. It is a workload. The runtime already exists.

What durable orchestration actually buys you

Three things a mature orchestrator gives an agent that a prototype framework does not.

The first is durable replay. The Common AI Provider post puts it bluntly: "When a 10-step agent task fails on step 8, a retry shouldn't re-run all 10 steps and double your API bill." Durable execution caches each model response and each tool result in object storage. A retry serves the cache instead of paying the LLM again. Anyone who has watched an agent loop burn a hundred dollars in a single Sunday night incident will recognize the value of that one line. Frameworks ship retry as a decorator. Orchestrators ship retry as a contract.

The second is observability that did not have to be invented. Airflow has had structured logging, run history, task duration metrics, and lineage tracking for years, because those features are how a data team trusts a pipeline at all. When the agent becomes a task, the agent inherits everything. There is no instrumentation project. The trace exists because it had to exist for ETL.

The third is the boring infrastructure that every framework eventually rediscovers. Authentication to three hundred and fifty backends. Role-based access control on who can approve which tool call. Secret management. Connection pooling. Cost attribution by team. None of these are agent features. All of them are required to ship one. Airflow has them because its core customers have been demanding them for a decade. A new framework starts at zero and rebuilds them, badly, in the months that follow its first production incident.

These three are why the conference circuit converged on reliability as the theme. The talks were not announcing a new problem. They were naming the rebuild bill.

The category error

Most agent frameworks were designed around the same wrong premise. The premise was that the new thing was the orchestration of LLM calls. If you accept that premise, you build a runtime. You write a scheduler, a retry layer, a state machine, an observability story, a permissions model. You ship the runtime as the framework and the framework owns the lifecycle of the application.

The premise was off by one. The new thing was the LLM call. Everything around it was already a solved problem. The orchestrator did not need to be invented. It has been in production since 2014. The agent loop is the carry, not the chassis.

The Airflow team had been building toward this correction for two years before the provider shipped. Airflow 3 reshaped the engine around assets rather than schedules, so a pipeline can react to data arriving instead of a clock ticking. The Common AI Provider is the surface layer on top of that foundation, not the diagnosis itself. The diagnosis was the engine work that came first.

Naik and Gopidesu's line, a provider package that plugs into the orchestrator you already run, is the cleanest articulation of the correction. It moves the agent from the center of the architecture to the edge. The center stays where it was. The provider model means a team running Airflow gets @task.agent and @task.llm as decorators next to the @task they have been using since Airflow 2.0, and the new code looks like the old code, because it is.

# Prototype-shape: the agent runtime is the application
from pydantic_ai import Agent

agent = Agent(model="openai:gpt-4o", tools=[query_db, read_s3])
result = agent.run_sync("Analyze churn for Q3")
# retry, logging, RBAC, secrets: your problem

# Production-shape: the agent is a task on the orchestrator
from pydantic_ai import Agent
from airflow.sdk import task

agent = Agent(model="openai:gpt-4o", tools=[query_db, read_s3])

@task.agent(agent=agent, llm_conn_id="openai_default")
def analyze_q3_churn(segment: str):
    return f"Analyze churn for {segment}"
# retry, logging, RBAC, secrets: the orchestrator's problem

Same agent definition. The difference is where it runs and what it inherits.

Where the framework still wins

Frameworks are not wrong. They are wrong in production. In every other phase of the work, they are correct.

Prototyping is faster in LangGraph or Pydantic AI than it will ever be in Airflow. The mental model is closer to the code, the iteration loop is shorter, the dependencies are lighter. Sketching a new agent shape when you do not yet know what tools it needs, the right tool is a notebook with a framework, not a DAG.

Exploratory work belongs there too. Research, evaluation harnesses, small internal tools one person runs once a week. None of these justify the operational weight of an orchestrator. None of them suffer from missing durable replay because they do not run unattended.

The framework wins everywhere the agent is not yet load-bearing. The provider wins the moment the agent has to survive a holiday weekend without you watching it.

The two-line decision

Here is the heuristic, two lines.

If the agent is not yet running unattended on a schedule and not yet paid for by a customer, keep it in the framework. If it is either of those, move it behind a provider on an orchestrator you already run.

That is the line. It is not a commitment to Airflow specifically. The same logic applies if your orchestrator is Dagster, Prefect, or Temporal. The principle is that durable execution is not a checkbox on a roadmap. It is a contract between the engine and the workload, and prototype frameworks ship engines that do not honor that contract.

What the pattern says about software

The pattern is not new. Rails won the web because it absorbed the request-response cycle until that cycle became invisible. Kubernetes won infrastructure because it absorbed the deploy-and-restart loop until the loop became invisible. Postgres absorbed twenty years of small databases because each of those small databases eventually rediscovered transactions, recovery, and indexing, badly. Every time the boring layer wins, it wins because newcomers underestimate how much the boring layer was already doing.

Production AI is in that moment. The boring layer is the orchestrator. The newcomer is the agent framework. The newcomer is not going away, because the newcomer is correct in the half of the lifecycle where the boring layer is too heavy. The boring layer is not going away either, because the moment the agent goes load-bearing, the rebuild begins, and the rebuild is the orchestrator. The signal that this has moved from contrarian read to industry consensus is the framing of Astronomer's State of Airflow 2026 report itself: "The Orchestration Layer is Uniting Data, AI, and Enterprise Growth." Two years ago the orchestrator was a deployment concern. In 2026 it is the consolidating layer.

Naming the pattern early is the move. Teams who name it spend their second quarter shipping product. Teams who do not spend it rebuilding retry logic and calling it agent engineering.

The boring AI is the right AI. Borrowed term, durable claim.

I am building Kilnx, a declarative backend DSL that pairs with htmx, and Provero, where a lot of the orchestration-shape decisions I write about are the day job. If the diagnosis here lands, that is the door.

André Ahlert is a product engineer. Contributor across Apache, Flyte, Backstage, HTMX, Hyperscript. Currently building Kilnx and Provero.

Why I Built a Language Instead of a Framework

André Ahlert — Sat, 16 May 2026 21:08:36 +0000

A friend asked me, around the third week of working on Kilnx, why I had not just written it as an Express plugin. The question was fair. It was also the same question my own brain had been asking me for a month.

The honest answer took me longer to find than I would like to admit. This piece is what I would have said to him in November if I had already understood what I was doing.

The push

I had been shipping backends. Some of them were small. Most of them were not. A multi-tenant CRM where every query had to filter by org_id or you leaked data across customers. An internal admin tool with role-based pages, scheduled jobs, outbound webhooks signed with HMAC. A SaaS dashboard with background workers, rate-limited APIs, an LLM call inside a critical path. None of these are blogs. All of them carried the same shape.

The interesting work in each project was the domain. The uninteresting work was identical. Auth setup. Session management. CSRF wiring. Connection pool tuning. Migration script naming. Multi-tenant guards on every query. Webhook signature verification. The right way to call Claude from a background job without burning the bill on retry. By the time I had a first feature shipped, I had touched a dozen files and made forty decisions, none of which had anything to do with what the customer wanted.

I started counting. Two thirds of the lines in those projects were about plumbing. The other third was the product. The numbers held across three projects. The numbers held across two stacks. The plumbing was not a project artifact. It was the toolchain's signature, and it was showing up in every project I touched.

That is what pushed me toward a language. Not a feeling. A pattern that did not move when I changed teams, customers, or stacks.

Constitution

The repo has a file called PRINCIPLES.md. The first principle, numbered zero because it predates the others, reads:

The complexity is the tool's fault, not the problem's. Most web apps are not complex. They are lists, forms, dashboards, CRUDs. The complexity comes from the tools we use, not from the problem we are solving. Kilnx exists to prove this.

That sentence is a claim. The rest of the language is the test of the claim. If you accept the premise, the design follows. If you reject the premise, nothing about Kilnx makes sense. The interesting argument is whether the premise is true, not whether the design is clever.

I think it is mostly true. Not entirely. Some web work is genuinely complex and would be complex in any tool. But the line between "complex problem" and "complex tool" runs further toward the tool side than most engineers want to admit, and the way to find out which side a given complexity sits on is to build a tool that subtracts itself and see what remains.

Here is what a working slice of the language looks like. Authenticated task list, htmx delete, paginated query, all in one file.

model task
  title: text required
  done: bool default false
  created: timestamp auto

auth
  table: user
  identity: email
  password: password
  login: /login
  after login: /tasks

page /tasks requires auth
  query tasks: SELECT id, title, done FROM task
               WHERE owner = :current_user.id
               ORDER BY created DESC paginate 20
  html
    {{each tasks}}
      <tr>
        <td>{title}</td>
        <td>{{if done}}Yes{{end}}</td>
        <td><button hx-post="/tasks/{id}/delete"
                    hx-target="closest tr"
                    hx-swap="outerHTML">Delete</button></td>
      </tr>
    {{end}}

action /tasks/:id/delete requires auth
  query: DELETE FROM task WHERE id = :id AND owner = :current_user.id
  respond fragment delete

That file is the whole app. Registration, login with bcrypt, sessions, CSRF on the htmx POST, parameter binding on every query, pagination, ownership check on delete. The Express equivalent is between four hundred and six hundred lines across eight files, depending on which middleware you copy versus extract.

The obvious objection

Build a framework, not a language. Rails exists. Phoenix exists. Django exists. Whatever you think is broken about backend work, somebody already wrapped your favorite host language in a thinner thing and called it a framework. Pick one and contribute.

That was the version of the argument I kept hearing, including from myself. It did not land for one structural reason. Frameworks always lose the constraint fight, and the constraint fight is exactly the fight Kilnx is trying to win.

A framework lives inside the host language. The host language gives you escape hatches at every level. You want to bypass the router? Reach for the HTTP server. You want to skip the ORM? Drop into raw SQL. You want to ignore the tenant guard? Comment it out, the language will let you. Every framework I have shipped real work on, by month six, has half its codebase in the official patterns and the other half in escape hatches. The escape hatches are not bugs. They are the price the framework pays to live as a tenant inside a general-purpose language.

I did not want a tenant. I wanted a contract.

What a language can refuse

Five things turned out to be impossible inside a framework that became natural inside a language.

Compile-time SQL safety. In any framework, your SQL lives in strings, or in an ORM that compiles strings, or in a query builder that pretends not to. The framework cannot validate your queries at compile time because the host language cannot see them at compile time. Kilnx queries are parsed by the same compiler that parses the rest of the program. A column rename in a model fails to compile every query that referenced it. SQL injection is not blocked by an escape function. It is blocked by the grammar refusing to interpolate untyped strings into SQL position.

Multi-tenant guards as syntax. Every SaaS backend I have shipped had the same bug class. Someone forgot to add WHERE org_id = :current_user.org_id to a query, and one tenant could read another tenant's data. The bug class exists because the host language sees the missing filter as legal code. In Kilnx, a tenant modifier on the model produces a fail-closed guard that the analyzer enforces on every query path. If you write a query that does not scope to the tenant, the compiler refuses to build the binary.

model invoice tenant
  amount: int required
  customer: text required

page /invoices requires auth
  query invoices: SELECT id, amount, customer FROM invoice
  html
    {{each invoices}}<p>{customer}: {amount}</p>{{end}}

$ kilnx check app.kilnx
error: query in page /invoices is missing tenant scope
  app.kilnx:5: query invoices: SELECT id, amount, customer FROM invoice
  hint: tenant model 'invoice' requires WHERE org_id = :current_user.org_id
  hint: or use `unscoped: explicit-reason` to opt out for a specific query

A framework can warn you. A language can refuse to build. The pattern is documented in the tenant rollout PR.

LLM agents as first-class language constructs. The piece I wrote last week argued that agents in production end up as tasks on the orchestrator you already run. The same logic applies one layer down. An agent inside a request handler is a task on the language you already run.

mcp linear
  command: linear-mcp-server
  env: LINEAR_API_KEY=:env.LINEAR_API_KEY

action /tickets/:id/triage
  agent classify
    prompt: "Classify ticket {ticket.body} into one of: bug, feature, support."
    permission-mode: plan
    max-budget-usd: 0.25
    max-turns: 3
    mcp: linear
  query: UPDATE ticket SET category = :classify.text, cost_usd = :classify.cost_usd
         WHERE id = :id
  respond fragment ticket-row

agent spawns a Claude CLI subprocess. :classify.text, :classify.session_id, :classify.cost_usd, :classify.stop_reason are bound for the rest of the action. max-budget-usd is enforced by the runtime. mcp: linear mounts the MCP server declared at the top of the file. The frame around the agent is grammar, not glue.

Migrations as a controlled surface. kilnx migrate detects drift across five dimensions: orphan columns, type mismatch, NOT NULL mismatch, single-column UNIQUE mismatch, DEFAULT presence mismatch. Migrations themselves are additive, never destructive. The language took a position on what is safe to do automatically and what requires the human to look.

$ kilnx migrate app.kilnx
applying schema...
warning: orphan column
  invoice.legacy_status (DB has it, model does not declare it)
  hint: drop manually after data migration
warning: type mismatch
  user.id (DB: integer, model: uuid)
  hint: requires data migration plus ALTER, not auto-generated
warning: NOT NULL mismatch
  task.due_date (DB: nullable, model: required)
  hint: backfill defaults before tightening
warning: UNIQUE mismatch
  account.slug (DB: not unique, model: unique)
  hint: dedupe rows before adding the constraint
warning: DEFAULT presence mismatch
  task.done (DB: no default, model: default false)
  hint: review before relying on the default in new code
migration applied with 5 warnings.

A framework can ship a migration tool. A language can make the migration tool part of the same compile pass that builds your routes.

Single-binary deploy. A framework runs on top of a runtime that you also have to deploy. Node. Python. Ruby. Each brings a package manager, a lockfile, a Dockerfile, a node_modules directory the size of a small operating system. Kilnx compiles a .kilnx file to a fifteen-megabyte binary that embeds the HTTP server, the database driver, the htmx JavaScript, and your application. scp it to a server and run it. The deploy story is ./myapp. A framework can shrink the deploy story. A language can collapse it.

Notice the pattern. A framework can make these things easier. A language can make their alternatives impossible. The asymmetry is the whole point.

What it costs

Building a language is more work than building a framework. This is the easy half of the trade-off to name. The repo is nineteen thousand lines of Go and three hundred eleven tests with race detection, to deliver something whose feature list on paper looks like a slightly opinionated web framework. If a small web framework was what I wanted, building the framework would have been the right answer.

The harder half is that a language has to take itself seriously. The grammar has to be coherent. The error messages have to be useful. The tooling has to exist. There is no falling back on someone else's ecosystem when something is missing. You either ship the LSP server or your users do not get autocomplete. You either ship the test runner or your users do not get tests. You either ship the playground or your users cannot evaluate the language without installing it. You either auto-generate an AGENTS.md for coding agents or your users get LLMs inventing keywords that do not exist.

The third cost is that a language refuses things, and refusing things is socially expensive. Every refusal is a fight with somebody who has a perfectly reasonable use case that the language does not serve. Frameworks can absorb those use cases with an escape hatch. Languages cannot. You have to look someone in the eye and tell them the language is not going to do that, and that the reason is that doing it would break the contract.

A friend told me that the part of building a language nobody warns you about is that you have to say no to a lot of people who are right.

What it gives back

The give-back is the part that justifies the cost. It is also the part that does not fit on a marketing page, because it has to be measured rather than read about. So measure.

The blog example in the repo is ninety-four lines in a single file. The Express equivalent is between four hundred and six hundred lines across eight files, depending on which middleware you copy versus extract.

Kilnx blog            Express + Prisma + EJS blog
──────────────────    ──────────────────────────────
app.kilnx        94   app.js                     62
                      routes/auth.js             88
                      routes/posts.js           104
                      models/Post.js             36
                      middleware/csrf.js         24
                      middleware/session.js      31
                      db/migrations/             47
                      views/                     94
                      ──────────                 ───
                      8 files                   486
1 file           94                            ~480

The Kilnx version is short because the language absorbed the rest, not because the app does less. The same things ship in both columns. The difference is which side wrote them.

What disappears on the Kilnx side, never written by the user:

bcrypt password hashing       auto from `auth`
session cookies, signed       auto from `auth`
CSRF on every POST/PUT/DELETE auto on every action
SQL parameter binding         only form the grammar allows
HTML escaping in templates    only form the grammar allows
multi-tenant scoping          refused at compile time if missing
schema migrations             same compile pass that builds routes
LLM agent budget enforcement  required attribute on `agent` blocks
MCP server lifecycle          managed by the runtime
HTTP server, routing, logs    embedded in the binary

The other give-back is harder to measure but bigger. The constraints stop you from drifting. There is no point at which you can decide to do auth a different way and have it cost you nothing. The decision was made when the language was designed. You inherit it. The cognitive load of every project drops because the design space is smaller.

For most product work, smaller design space is the gift you have been begging the universe for.

When a framework is the right answer

The inverse is real and worth naming.

If your work needs escape hatches more often than it needs constraints, a framework is the right shape. Custom integrations against an irregular set of third-party systems, custom protocols, custom transport, custom auth flows that do not fit any standard pattern. Days that are ninety percent edge cases. A framework lets you write the edge case directly in the host language without the language fighting you.

Convex made the same trade in the agent world. They accepted the determinism contract, and they paid the cost of not being able to do arbitrary side effects in mutations. For most product workloads that cost is fine. For some, it is too high. The same logic applies here. Kilnx accepts a constraint contract, and the contract is wrong for some workloads. The question is whether your workload is one of them, and the honest answer is usually no.

What the bet really is

The bet at the center of Kilnx is that a specific opinion, taken seriously, eats a category of work that nobody wanted to be doing anyway. Pick the right opinion, encode it past the point where users can opt out, and the opinion becomes leverage. The leverage shows up as code that did not need to be written. Two thirds of every project, in my experience.

A framework can host an opinion. A language can enforce one. The reason I built a language is that I wanted the enforcement, and I had counted the lines of plumbing in enough projects to know what the enforcement was worth.

Kilnx is in early release. The grammar is twenty-seven keywords. The compiler is a few thousand commits old. None of that matters as much as the bet does, and the bet is what is being tested in production over the next year.

The spreadsheet was real. The language was the honest answer to it.

I am building Kilnx, a declarative backend language that pairs with htmx, and Provero, where a lot of the language-shape decisions I write about are the day job. If the diagnosis here lands, that is the door.

André Ahlert is a product engineer. Contributor across Apache, Flyte, Backstage, HTMX, Hyperscript. Currently building Kilnx and Provero.

Soda Moved to ELv2. Provero Is Apache 2.0.

André Ahlert — Wed, 01 Apr 2026 14:19:06 +0000

When Soda changed its license from Apache 2.0 to Elastic License v2, teams that relied on Soda Core as open source infrastructure had to re-evaluate. This post explains what changed, what it means for you, and what alternatives exist.

What happened

Soda Core was originally released under Apache License 2.0. In 2023, Soda switched to the Elastic License v2 (ELv2). The change applied to all new versions of Soda Core and its associated packages.

ELv2 is not an open source license by the OSI definition. It adds two restrictions that Apache 2.0 does not have: you cannot offer the software as a managed service, and you cannot modify the license key functionality. For internal use at most companies, ELv2 is permissive enough. But for platform vendors, consultancies embedding Soda in their products, or organizations with strict open source policies, it creates friction.

Who is affected

Internal data teams (Low impact) -- ELv2 allows internal use. You can keep using Soda Core if the license terms work for your legal team.

Data platform vendors (High impact) -- If you embed data quality checks in a product you sell, ELv2 prohibits offering it as a managed service without a commercial agreement.

Consultancies and integrators (Medium impact) -- Depends on how you distribute. If you ship Soda as part of a client deployment, review the license terms with legal.

Open source projects (High impact) -- ELv2 is not OSI-approved. If your project requires OSI-approved dependencies, you cannot depend on Soda Core.

This is a pattern, not an exception

Soda is not the first data tool to make this move. The playbook is familiar across the industry: release as open source, build adoption, then change the license to protect a commercial offering. Elastic did it. MongoDB did it. HashiCorp did it. Each time, the community had to decide whether to accept the new terms, fork the project, or find an alternative.

The pattern is rational from a business perspective. But it breaks trust with teams who built infrastructure on the assumption that the license would not change.

What Provero does differently

Provero is licensed under Apache 2.0. Every feature ships in the open source package: anomaly detection, data contracts, all 16 check types, the CLI, the Airflow provider. There is no cloud-only tier and no feature gating.

We are pursuing acceptance into the LF AI & Data Foundation, which means the project would be governed by a neutral foundation, not a single company. Foundation governance makes unilateral license changes structurally difficult.

	Provero	Soda Core
License	Apache 2.0	ELv2
OSI approved	Yes	No
Managed service allowed	Yes	No
Anomaly detection	Included	Cloud only
Data contracts	Included	Partial (Cloud for full)
Governance	Targeting LF AI Foundation	Soda Inc.
Check format	YAML	SodaCL
Migration path	`provero import soda`	N/A

Migrating from Soda

If you have existing SodaCL checks, Provero includes a converter that maps Soda check syntax to Provero YAML:

pip install provero
provero import soda checks.yaml -o provero.yaml
provero run

SodaCL	Provero
`missing_count(col) = 0`	`not_null: col`
`duplicate_count(col) = 0`	`unique: col`
`row_count > 0`	`row_count: { min: 1 }`
`freshness(col) < 24h`	`freshness: { column: col, max_age: 24h }`
`valid_count(col) = ...`	`accepted_values: { column: col, values: [...] }`

Checks that don't have a direct equivalent are preserved as YAML comments, so nothing is silently dropped.

Our position on licensing

We think data quality is infrastructure. It belongs in the same category as linters, test frameworks, and CI tools. You would not accept a linter that moved half its rules behind a paywall. Data quality checks should work the same way: open, portable, composable.

Provero will stay Apache 2.0. Not because we are against commercial models, but because we believe the right way to build a business around open source is to sell services, hosting, and support on top of a fully open core. Not to restrict the core itself.

pip install provero
provero init
provero run

GitHub | PyPI | Docs

I built a backend language that a 3B model writes better than Express

André Ahlert — Wed, 01 Apr 2026 10:40:12 +0000

I've been building web apps for years and the thing that always bothered me is how much ceremony goes into something that should be simple. A task list with auth shouldn't need 15 files across 3 directories with 200 lines of config.

So I built Kilnx, a declarative backend language. 27 keywords, compiles to a single binary, SQL inline, HTML as output. At some point I started wondering: if the language is this small, can a tiny local LLM write it? A model that fits on a phone?

I ran the benchmark. Kilnx won every round.

What Kilnx looks like

A complete app with auth, pagination, htmx, and a SQLite database:

config
  database: "sqlite://app.db"
  port: 8080
  secret: env SECRET_KEY required

model user
  name: text required
  email: email unique
  password: password required

model task
  title: text required
  done: bool default false
  owner: user required
  created: timestamp auto

auth
  table: user
  identity: email
  password: password
  login: /login
  after login: /tasks

page /tasks requires auth
  query tasks: SELECT id, title, done FROM task
               WHERE owner = :current_user.id
               ORDER BY created DESC paginate 20
  html
    {{each tasks}}
    <tr>
      <td>{title}</td>
      <td>{{if done}}Yes{{end}}</td>
      <td>
        <button hx-post="/tasks/{id}/delete"
                hx-target="closest tr"
                hx-swap="outerHTML">Delete</button>
      </td>
    </tr>
    {{end}}

action /tasks/create method POST requires auth
  validate task
  query: INSERT INTO task (title, owner)
         VALUES (:title, :current_user.id)
  redirect /tasks

kilnx build app.kilnx -o myapp gives you a ~15MB binary. Registration, login with bcrypt, sessions, CSRF, validation, pagination, htmx inline delete. No framework, no ORM, no node_modules.

The question

The Kilnx grammar fits in 400 lines of docs. Express, Django, and Node.js each have thousands of pages of documentation, dozens of APIs, and multiple ways to do the same thing.

I wanted to know if that difference in surface area shows up when you ask small LLMs to generate code. Not GPT-4 or Claude, but models you run on a laptop with Ollama. Models between 1B and 7B parameters.

Setup

I wrote 10 equivalent tasks across four stacks (Kilnx, Express, Django, vanilla Node.js):

#	Task	Difficulty
1	Hello World page	trivial
2	User model definition	easy
3	Page with database query	easy
4	Create with validation	medium
5	Auth + protected route	medium
6	Delete with htmx response	medium
7	SSE notifications	medium
8	Chat websocket	hard
9	Stripe webhook	hard
10	Complete mini app	hard

Five models, three families, all local:

Model	Parameters	Disk
Qwen 2.5 7B	7B	4.7 GB
Qwen 2.5 3B	3B	1.9 GB
Qwen 2.5 1.5B	1.5B	986 MB
Phi-4 Mini	3.8B	2.5 GB
Llama 3.2 1B	1B	1.3 GB

Three validation passes on every output:

Keyword matching - does the code contain the structural elements the task requires?
Syntax check - kilnx check (semantic analysis), node --check, python compile
LLM-as-judge - Qwen 7B rating syntax/completeness/correctness/idiom (0-3 each)

Every combination ran 3 times. 600 generations, 600 judge evaluations.

About fairness

This is important. Kilnx has never appeared in any training dataset. Zero .kilnx files exist on the internet outside my repo. Express and Django have millions of code examples baked into every LLM's weights.

I gave the models the Kilnx grammar reference (11.7K chars) as prompt context. Express, Django, and Node got no reference docs because they don't need them.

If anything, this setup gives the established frameworks a huge advantage. They've been pre-trained on the entire Stack Overflow + GitHub history. Kilnx gets one document.

The numbers

Structural correctness (keyword score, averaged over 3 runs)

Model	Kilnx	Express	Node.js	Django
Qwen 2.5 7B	100%	88%	93%	83%
Qwen 2.5 3B	99%	88%	89%	87%
Qwen 2.5 1.5B	99%	85%	87%	74%
Phi-4 Mini	98%	88%	93%	85%
Llama 3.2 1B	90%	78%	77%	77%

Qwen 3B, a 1.9 GB model, scores 99% on Kilnx, a language it has never encountered. The same model gets 87% on Django, a framework it has seen millions of times during training.

When you shrink from 7B down to 1B, Kilnx drops 10 points. Node.js drops 16. The simpler grammar holds up better as the model gets dumber.

Tokens per task (completion only)

Framework	Qwen 7B	Qwen 3B	Qwen 1.5B	Phi-4 Mini
Kilnx	105	112	111	95
Django	195	226	152	199
Express	302	349	265	315
Node.js	347	381	333	490

3x fewer tokens than Express/Node. This is not a style difference. It's the same functionality. A chat websocket in Kilnx is ~110 tokens:

socket /chat/:room requires auth
  on connect
    query: select body, author.name, created from chat_message
           where room = :room
           order by created desc
           limit 50
    send history

  on message
    validate
      body: required max 500
    query: insert into chat_message (body, author, room)
           values (:body, :current_user.id, :room)
    broadcast to :room fragment chat-bubble

The Express version of the same task runs ~420 tokens of socket.io setup, middleware, database calls, and room management.

Session economics

Kilnx has a cost that Express doesn't: the grammar reference takes ~3,100 prompt tokens. But that's loaded once per session. The per-task completion cost is what scales.

Over a real session with Qwen 3B:

Tasks	Kilnx	Express	Node.js
1	3,269	464	501
10	4,277	4,640	5,010
25	5,957	11,600	12,525
50	8,757	23,200	25,050
100	14,357	46,400	50,100

Kilnx becomes cheaper than Express at task 9. By the end of a workday (call it 50-100 tasks with a copilot), you've used 71% fewer tokens. If you're paying per token on an API, that's real money. If you're running locally, it's real time.

Raw output from the 3B

No editing, no cherry-picking. This is Qwen 2.5 3B (1.9 GB on disk) generating a complete app from scratch, having seen the Kilnx grammar for the first time in the prompt:

config
  database: env DATABASE_URL default "sqlite://app.db"
  port: 8080
  secret: env SECRET_KEY required

model task
  title: text required
  done: bool default false
  owner: user required
  created: timestamp auto

auth
  table: user
  identity: email
  password: password
  login: /login
  after login: /tasks

page /tasks layout main requires auth
  query tasks: select id, title, done from task
               where owner = :current_user.id
               order by created desc
               paginate 20
  html
    <input type="search" name="q" placeholder="Search tasks..."
           hx-get="/tasks" hx-trigger="keyup changed delay:300ms"
           hx-target="#task-list">
    <table id="task-list">
      <tr><th>Title</th><th>Done</th><th></th></tr>
      {{each tasks}}
      <tr>
        <td>{title}</td>
        <td>{{if done}}Yes{{end}}</td>
        <td><button hx-post="/tasks/{id}/delete"
                    hx-target="closest tr"
                    hx-swap="outerHTML">Delete</button></td>
      </tr>
      {{end}}
    </table>

action /tasks/create method POST requires auth
  validate task
  query: insert into task (title, owner)
         values (:title, :current_user.id)
  redirect /tasks

Auth, pagination, htmx search with debounce, inline delete, form validation. It even added the search input on its own, that wasn't in the prompt.

Why I think this happens

Express forces the model to make a lot of decisions. CommonJS or ESM? Which middleware in what order? Prisma or Sequelize or raw queries? Passport or express-session or JWT? EJS or Pug or Handlebars? Each fork is a place where a small model can pick wrong.

Kilnx has one way to do each thing. One keyword for auth, one keyword for pages, one for actions. The model doesn't pick between approaches because there's only one approach. The decision space is so small that even a 1B model mostly gets it right.

I don't think this is unique to Kilnx. Any DSL with a tight, regular grammar would probably show the same pattern. The surface area of a language directly predicts how well small models can generate it. I haven't seen anyone optimize for that yet.

What I'd do with this

If you're an indie dev or a solo founder shipping CRUD apps:

A 3B model running locally gives you 99% accuracy on Kilnx with no API costs, no internet, no privacy concerns. The 7B hits 100%. You don't need to send your code to OpenAI to get a working backend.

If you're using a paid API, the 71% token reduction over a session adds up fast. Especially if you're iterating on features all day.

If you're just curious, the whole language is 27 keywords. You can read the grammar in 10 minutes.

Links

curl -fsSL https://raw.githubusercontent.com/kilnx-org/kilnx/main/install.sh | sh
kilnx run app.kilnx

I built an MCP Server that lets Claude manage your Substack

André Ahlert — Sat, 14 Mar 2026 14:07:52 +0000

The Substack web UI is fine for casual use, but if you're a power user who publishes daily, manages engagement, and wants to automate interactions, you need something faster.

I built @postcli/substack, a tool that gives you three interfaces to Substack:

1. CLI - Direct commands from your terminal

postcli-substack notes publish "Shipping fast from the terminal"
postcli-substack posts list --limit 5
postcli-substack feed --tab for-you

2. TUI - Full interactive terminal UI with 6 tabs

postcli-substack tui

Navigate with j/k or arrow keys, scroll with mouse wheel, open posts in browser with 'o'. It's keyboard-driven and fast.

3. MCP Server - 16 tools for AI agents

{
  "mcpServers": {
    "substack": {
      "command": "postcli-substack",
      "args": ["mcp"]
    }
  }
}

Tell Claude "like back everyone who liked my last note" and it just works.

The automation engine

The part I'm most proud of is the automation engine. It uses SQLite to track processed entities (no duplicate actions) and supports triggers like:

Someone likes your note → auto-like their latest note back
New note from specific authors → auto-like or restack
New post from specific publications → auto-restack

Auth without API keys

Substack doesn't have a public API. Auth works by extracting your existing Chrome session cookies (AES-128-CBC decryption) or manual cookie entry.

postcli-substack auth login

Install

npm install -g @postcli/substack

89 tests. CI on Node 18/20/22. Open source (AGPL-3.0).

GitHub: https://github.com/postcli/substack
NPM: https://www.npmjs.com/package/@postcli/substack