Forem: Sebastian Schürmann

Let the ORM fight begin!

Sebastian Schürmann — Tue, 28 Apr 2026 17:23:24 +0000

Every few months, a new round of "which TypeScript ORM should we use?" breaks out — on team chats, on Hacker News, in conference hallways. The arguments rhyme: Prisma is too heavy, Drizzle is too new, TypeORM is too legacy, Sequelize is too JavaScript-y. The evidence cited is usually a benchmark from 2022, a vibes-based blog post, or a single bug someone hit on a Tuesday.

Got tired of it as well? We're running an experiment!

The setup

github.com/orm-fight is a public GitHub organization where we've started spinning up one project per major TypeScript ORM. Same domain, same test suite, same CI/CD pipeline, same Dependabot configuration. The only thing that changes between projects is the ORM itself.

The application we're building is a double-entry bookkeeping system. We picked it deliberately: it's a domain with real transactional constraints (debits must equal credits), genuine relationships between tables (accounts, ledgers, entries, postings), and invariants that any ORM worth its npm download has to help you express and enforce.

Each project has:

A test suite built on the Node.js built-in test runner (no Jest, no Vitest — we want to keep the dependency footprint honest).
A CI pipeline that builds main, runs tests, and pushes artifacts.
Dependabot turned on, so every project receives the same upgrade pressure.
An SBOM generated and stored as a build artifact, so we can track dependency drift over time.

What we want to know

Once these projects have been running for a while — we're aiming for a runtime of about a year — we'll have data on questions that usually get answered with anecdotes:

How do the implementations actually differ? It will be interesting to see how the syntax sugar from framework to framework differs - or not.
How often do updates land? Which ORMs ship steadily, which go quiet, which break things in minor releases.
How many CVEs are reported over time? And — more interestingly — how does that propagate to transitive dependencies? An ORM is rarely a single package; it's a tree.
How does the dependency graph evolve? This is what the SBOMs are for. We want to see the supply chain shape, not just the top-level package.json.

Why we're doing this

Two reasons.

The first is straightforward: we want long-term, evidence-based knowledge to ground future ORM discussions. To replace "I read somewhere that..." with "here's the data we collected over the last 12 months." Being equally allergic to hype and bullshit bingo: Maybe its time for some a experiment to see what stands out, whats boring (aka ready for prod) and whats surprisingly broken.

The second reason is supply chain security. An ORM is one of the most consequential dependencies in a typical Node.js backend. It pulls in database drivers, query builders, connection pools, validation libraries, sometimes a whole code-generation pipeline. The security posture of your app is, in large part, the security posture of your ORM and everything downstream of it. Tracking SBOMs over time, against real CVE data, is a useful thing to practice — and one of the side effects of this project is that we'll get good at extracting and researching build artifacts.

We also just want to write about it as it happens. One year is a long time in JavaScript-land. Things will break. Things will improve. Some maintainers will burn out. Some projects will surprise us. That story is worth telling in real time.

The contestants

Here are the TypeScript SQL ORMs (and ORM-adjacent tools) we're starting with:

Prisma — Schema-first with codegen, excellent type safety, migrations built-in. The most popular option, and the one most teams default to.
Drizzle ORM — Lightweight, SQL-like query builder, zero runtime overhead, edge-friendly. The newcomer that has eaten a lot of Prisma's mindshare in the last two years.
TypeORM — Decorator-based, supports both Active Record and Data Mapper patterns. Mature, widely deployed, and starting to feel its age.
MikroORM — Data Mapper, Unit of Work, Identity Map. If you've worked with Doctrine or Hibernate, you'll feel at home immediately.
Kysely — Strictly speaking, a type-safe SQL query builder rather than a full ORM. Included because a lot of teams reach for it instead of an ORM, and we want that comparison on the table.
Sequelize — JS-first with TypeScript types added later. Still very common in legacy codebases.
Objection.js — Built on Knex, less TS-native but still in production use in a lot of places.

We may add or drop candidates as we go. If a project is clearly abandoned by month three, it goes. If something interesting shows up, it gets a slot.

Follow along

Everything is public from day one — the code, the CI configuration, the SBOMs, the Dependabot history. You can find the organization at:

https://github.com/orm-fight/

I'll be writing here periodically with what we're seeing — early surprises, build breakages, CVE patterns, dependency tree changes, and whatever else turns up that we didn't expect.

The fight isn't really between the ORMs. It's between the way we currently choose tools — vibes, recency bias, the loudest voice in the team meeting — and a slower, more boring, more useful alternative: actually watching them over time - create a conclusion from data.

Let the ORM fight begin.

From Vibecoding to Vibelaunching: Building the ecosystems-cli

Sebastian Schürmann — Mon, 20 Apr 2026 18:58:07 +0000

12 months ago I set a goal: ship a production-ready CLI for the ecosyste.ms API, in Python. Some commits later, ecosystems-cli is about to land on PyPI. A deliberate attempt to take LLM-assisted development seriously on something larger than a toy.

Why bother

I follow the AI critics and agree there are real risks in LLM-supported development. But criticism grounded only in theory is thin. I wanted more than the same twenty one-trick-pony demos that cycle through LinkedIn every week. So I picked a language I don't work in daily (Python) to reproduce the quintessential vibecoding condition — a little out of my depth — and built something I'd actually use, because we were using ecosyste.ms at work.

The setup isn't traditional anymore

The project is developed with Claude Code, but the "setup" has drifted a long way from what that phrase usually implies. My baseline expectation is now, that a coding LLM can follow the Zen of Python and write real tests. From the initially used 'instruction heavy mode', something interesting happens: over time, the code turns into spec via the unit tests. The tests pin behaviour; the code becomes the executable description of that behaviour; and the next change — refactor, rewrite, port — starts from a spec that's already true. These days: no skills, no Claude.md, etc. Plain: open a session and tell the clanker what to do.

The practical payoff is replayability. I can blow up a module, regenerate it, and the test suite tells me whether I'm back where I was. On larger projects, that's a qualitatively different experience than any setup I've used before.

What actually took 10 months

Work work (Peons, Warcraft II)

API client approach. Started with an OpenAPI-generated client, abandoned it for a custom implementation, then migrated to a mature third-party library once I understood the domain well enough to evaluate one.
Release pipeline. The unglamorous CI/CD grind is where some of the time went.
17 endpoints + MCP integration. Wrapping each endpoint is concrete, bounded work. MCP added another layer.
Python as a second language. Syntax is the easy part; idioms, tooling conventions, and ecosystem norms take longer. Early decisions got revisited as my understanding deepened.
Continuous refactoring. Fighting unnecessary abstraction is constant. Several designs got simplified once they failed to earn their keep.

OpenCollective changed the trajectory

When ecosyste.ms adopted the CLI officially and started funding it via OpenCollective, the project stopped being a side experiment and started being worth finishing properly. It's not keep-the-lights-on money, but it's a meaningful signal — and it bought the time for the boring, essential work: test coverage, documentation, dependency management, a reliable release pipeline. That's the real line between prototype and production.

Takeaways

Be willing to throw out early architecture. The generated client → custom → library arc wasn't waste; each step taught the next.
Infrastructure is not optional. A serious chunk of any real project is CI/CD, tests, and release plumbing.
Refactoring is continuous. Especially with an LLM in the loop — it happily generates abstractions that don't earn their keep unless you push back.
Tests are the spec. This is the unlock. If your test suite describes behaviour well, the code underneath becomes replaceable. That's what makes LLM-assisted development work at scale instead of collapsing into spaghetti.
Funding enables quality. The non-feature work that separates a tool from a toy doesn't happen on evenings and weekends indefinitely.Thanks to ecosyste.ms
Some problems are 'LLM hard': Parameter parsing in such a large app with nuanced differences in underlying API structure - Mission impossible.

The result is a CLI that's functional, maintainable, and ready to be used by people who aren't me. To end up in a 'Instructionless mode' is quite surprising, but surely a result of python leaning to a minimalism that is to be found in 'The Zen of Python'.

Local AI Will Save Us All (The Math Says So, Trust Me)

Sebastian Schürmann — Wed, 15 Apr 2026 14:05:13 +0000

Every few weeks a take goes viral in tech circles making the case for ditching cloud AI and running models locally. The argument is always roughly the same: cloud costs add up, your data is being shipped to American servers of dubious legal standing, and a one-time GPU purchase pays for itself in 18 months. Bold claim. Simple math. Lots of hashtags.

It deserves a closer look.

The typical version of this argument runs something like: two RTX PRO 6000 Blackwells, 1,200W draw, six hours a day, €0.32 per kWh — "about €48/month" in electricity. The cards themselves cost around €16,000. Cloud AI, by comparison, runs €100–200 per developer per month. Eight developers, 18 months, done.

Except the electricity bill is already wrong. 1.2 kW × 6h × 30 days × €0.32 = €69.12. Not €48. A 44% error in the opening calculation of an argument whose entire appeal is rigorous arithmetic.

The break-even math has bigger problems. €100–200/month per developer implies roughly 20 million tokens consumed per person per month. That is not a power user. That is a token foundry. For any team using AI at normal human rates, the break-even slides quietly past two years — by which point the GPU generation is already dated.

The €16,000 hardware figure also never travels with:

Cooling. 1,200W sustained is a serious heat load. Office HVAC was not designed for this.
Labor. Keeping local model infrastructure running — version management, security patches, prompt compatibility across model updates — is real engineering work that doesn't appear in these spreadsheets.
Hardware failure. Cloud providers have SLAs. Your server closet does not.

Noise. Two RTX PRO 6000 Blackwells under full load exceed 50 dB — a loud dishwasher, sustained, all day. In a dedicated server room, fine. In a shared office, your colleagues will have opinions.

Availability. The RTX PRO 6000 Blackwell is a new, high-demand professional card with constrained supply and multi-week lead times. If one card fails, you are not buying a replacement over the weekend. You wait — potentially a month or more. Keeping a spare sounds prudent; that spare costs another ~€8,000 and is equally hard to source. A single-point-of-failure setup with no redundancy and a six-week replacement window is not infrastructure. It is optimism.

Where the Argument Has a Point

Data sovereignty is real. GDPR compliance for third-country data transfers is genuinely complex, vendor terms change, and strategic dependence on external model providers is a risk that tends to get underweighted until it isn't. The upfront capital requirement is the actual barrier for most teams, not the long-run economics.

But the most important question gets skipped entirely: is the local model actually as good? Two Blackwells with 192GB VRAM can run serious open-weight models — this is not a toy setup. But if developers need two or three attempts to get what a frontier cloud model produces in one, the labour savings evaporate and the break-even never arrives.

The Bottom Line

Local AI infrastructure can make sense — for teams with heavy, sensitive workloads, strong in-house ops capability, and the capital to do it properly, including redundancy, cooling, and the realistic assumption that hardware will occasionally fail at inconvenient times.

What it is not is a simple 18-month arbitrage available to anyone with a GPU and a spreadsheet.

The sovereignty argument is the strongest card in the deck. Lead with that. The cost argument needs a lot more columns in the spreadsheet before it holds up.

Down the Rabbit Hole: Building the Reference List for the Pair-Programming Book

Sebastian Schürmann — Mon, 13 Apr 2026 11:20:07 +0000

There's a particular kind of humbling that happens when you sit down to write a book and realize you need to actually read the papers you've been casually citing for years.

That's more or less where I found myself when I started assembling the reference list for the Pair Programming Book. What started as "I'll just gather the key papers" turned into a months-long excavation through decades of software engineering research. The current estimate: somewhere between 250 and 500 relevant papers. And counting.

Here's what that journey looked like.

The Papers You Know But Haven't Read

Every field has its citation folklore — papers so frequently referenced that they've achieved the status of common knowledge without anyone actually opening them. Pair programming research is no exception.

I had a mental list of "classics" I'd been nodding at for years. Williams et al., 2000. Cockburn and Williams. The early XP studies. I knew their conclusions the way you know the plot of a movie you've never seen — through cultural osmosis, hallway conversations, and abstracts alone.

Actually reading them was a different experience. Some held up beautifully. Others were more nuanced, more conditional, more contested than the canonical summary suggested. A few conclusions that had calcified into "everyone knows that pair programming does X" turned out to rest on a single study with 41 undergraduates.

The lesson: citation chains in a young field are fragile things. You owe it to your readers — and yourself — to go back to the source.

Laurie Williams Deserves a Prize

If pair programming research has a GOAT, it is, without question, Laurie Williams.

The sheer volume of rigorous, foundational work she has produced on the subject is staggering. While others were still debating whether pair programming was a gimmick, Williams was running controlled studies, developing frameworks, and building the empirical case that made the whole conversation possible. Decade after decade.

Writing this book without her work would be like writing about relativity and hoping Einstein doesn't come up. She doesn't just appear in the bibliography — she is a substantial portion of it.

If there is ever a formal prize for contributions to software engineering research, the pair programming category should be named after her.

The Questionable Corners of the Literature

Not every paper in the pile earned its place gracefully.

Some announced themselves with titles that made me wince before I even opened the PDF. You know the genre. A combination of buzzwords, a forced acronym, and a vague promise of insight that the abstract doesn't quite deliver on. I won't name names. But I have a folder.

More substantively: a surprising amount of pair programming research is built on frameworks that the broader scientific community has quietly retired. Personality type taxonomies are the main offender. Myers-Briggs in particular makes repeated appearances — studies earnestly classifying programmers into 16 types and drawing conclusions about pairing compatibility. The problem is that the psychometric foundation for these instruments has been thoroughly undermined. They're not useless as casual conversation tools, but basing empirical research claims on them is shaky ground.

The same applies to some of the "introvert vs. extrovert" dichotomy work, which tends to treat personality as a binary switch rather than the distributed, context-dependent trait that modern personality psychology describes.

This doesn't mean the research is worthless — often the observations are real even when the interpretive framework is suspect. But it does mean a lot of careful reading, and a lot of footnotes that essentially say: the finding is interesting, the taxonomy it's hung on is not.

What 250–500 Papers Looks Like

It looks like a lot of tabs.

It also looks, honestly, like a field that is richer and more contested than its popular summary suggests. Pair programming is not simply "proven effective" or "proven ineffective." The evidence is contextual, domain-specific, experience-level-dependent, and shaped enormously by how you define and measure "effective" in the first place.

That complexity is exactly why the book needs to exist. The practitioner literature tends toward confident prescriptions. The academic literature is full of hedges, replications, and contradictions that rarely make it into the conference talk or the blog post.

The reference list is the honest accounting of that complexity. Every citation is a commitment: I looked at this, I understand what it claims, and I'm representing it faithfully.

That's the job. It's slower than I expected. It's also more interesting.

From Cardboard to Code

Sebastian Schürmann — Fri, 10 Apr 2026 22:43:15 +0000

The design challenge isn't understanding board games. It's turning prose rules into structures a software team can actually act on.

There are thousands of board games. Most of them contain fascinating design work: carefully balanced economies, elegant interaction models, loop structures refined over years of playtesting. Almost none of them exist as digital games. The barrier is real work — translating a 40-page rulebook into a game design document, a feature backlog, an architecture diagram, user stories — before a single line of code is written.

RuleForge automates that translation. You hand it a PDF. It hands you a developer bundle.

What it actually does

At its core, RuleForge is a suite of Claude Code slash commands stored in a .claude/commands/ directory. Each command is a focused AI workflow targeting one specific phase of the board-game-to-digital-game translation process. They can be run individually, or chained together through the main /ruleforge pipeline command.

The full pipeline runs 16 stages and produces a self-contained output directory scoped to the game — something like output/catan/ or output/terraforming-mars/ — filled with structured files ready for a development team.

The pipeline, step by step

1. /complexity-estimate — Quick pre-flight scan
Before committing to the full pipeline, get a fast complexity estimate. How long is the rulebook? How many mechanics? Is this a 20-minute job or a 2-hour one?

2. /ruleforge — Full pipeline, PDF to developer bundle
The main event. Extracts rules, identifies mechanics, generates the game loop diagram, writes the GDD, builds the feature list, creates user stories, outputs architecture diagrams. Resumable if interrupted.

3. /card-database + /economy-flow — Domain-specific extraction
Card-heavy games need their component databases structured. Economy-driven games need their resource flows mapped — sources, sinks, conversions. These commands go deeper on those specific concerns.

4. /accessibility-audit — Check for digital barriers
Audits the extracted design across five accessibility dimensions: visual, motor, cognitive, hearing, and communication. Digital ports are an opportunity to do better than the physical original.

5. /realtime-forge — Translate to interactive game design
The big leap. Takes the RuleForge output and translates it into a real-time or interactive digital game design — covering analysis, a revised GDD, architecture, balance sheets, asset specifications, and prototype prompts. Seven waves, roughly 30 output files.

6. /dev-bundle — Validate and package
Validates all output files including Mermaid diagram syntax, checks for completeness, and packages everything into a clean bundle ready to hand off.

The full command library

Extraction & Analysis

Command	What it does
`/extract-rules`	Parse and summarize the rules from a PDF. The raw input layer.
`/identify-mechanics`	Classify game mechanics across 25 standard types — Worker Placement, Deck Building, Area Control, Engine Building, and so on.
`/game-loop`	Generate a Mermaid diagram of atomic, primary, secondary, and tertiary game loops.
`/validate-loop`	Check the game loop for structural soundness and state reachability. Catches design dead ends.
`/adaptation-gap`	Report on how much work a digital port actually requires — No Change / Simple Adaptation / Redesign.
`/flag-ambiguities`	Surface rules that are unclear, contradictory, or likely to cause bugs when implemented.
`/confidence-score`	Self-assessment of extraction quality. Useful for knowing when to do a manual review.

Design & Documentation

Command	What it does
`/generate-gdd`	Full Game Design Document. Chunked automatically for complex games.
`/balance-sheet`	Extract balance parameters with digital annotations and sensitivity analysis.
`/feature-list`	Prioritized feature list output as both CSV and Markdown, with a dependency diagram.
`/user-stories`	User stories with granularity selector and acceptance criteria. Outputs to Stories.csv.
`/onboarding-design`	Tutorial and onboarding flow design — how a new player learns the game digitally.
`/interaction-model`	Component interaction model — how game entities relate to and affect each other.

Architecture & Prototyping

Command	What it does
`/architecture-diagram`	System architecture in Mermaid. Supports Unity, Godot, Phaser, Web, or generic targets.
`/prototype-prompts`	AI prototyping prompts for Rosebud, v0, Bolt, Lovable, or generic tooling.
`/economy-flow`	Resource economy diagram — where resources come from, where they go, and how they convert.
`/card-database`	Extracts individual card, tile, or component data into a structured database.

Standalone Utilities

Command	What it does
`/game-mixer`	Blend mechanics from two or more games into hybrid designs, with iteration support.
`/decompose-idea`	Break down a game idea using a 7-category ludemic framework.
`/ludeme-generator`	Generate a Ludii game description file (.lud) from a concept.
`/game-fitness`	Analyze a game concept across 6 fitness dimensions.
`/playtest-design`	Design an automated playtesting plan with fitness functions.
`/procedural-generator`	Design procedural generation systems using the Watson et al. (2008) workflow.
`/game-comparison`	Side-by-side comparison of two RuleForge extractions.
`/pdf-to-markdown`	Convert any PDF to clean, well-structured Markdown. Useful as a standalone tool.

The output structure

Every skill writes into a game-scoped directory under output/. The game slug is derived automatically from the title in the rulebook. A .context.json metadata file lets downstream commands pick up where upstream ones left off — that's what makes the pipeline resumable.

A typical output for something like Terraforming Mars would contain a GDD, a feature CSV, a user stories CSV, Mermaid files for the game loop and architecture, a balance sheet, an onboarding flow design, and prototype prompts ready to paste into your AI prototyping tool of choice.

The solo dungeon bash

The repository also ships a solo-dungeon-bash/ directory — a worked example of the pipeline in action on a solo dungeon-crawl game. It's useful both as a reference output and as a test case to understand what the extraction quality actually looks like on a real game with real rules.

Why slash commands, not a CLI tool?

This is a deliberate choice. Claude Code's slash command system makes each step conversational and inspectable. You can run /identify-mechanics, read the output, decide the model missed a nuance, correct it manually, and then continue with /game-loop. That feedback loop would be much harder to preserve in a fully automated CLI pipeline.

It also means the tool is essentially zero-setup. Clone the repo, point Claude Code at the directory, and the commands are available. No build step, no package install, no configuration.

Get started

The project is on GitHub at github.com/sebs/ruleforge. Clone it, drop a rulebook PDF next to it, and start with /complexity-estimate path/to/your-game.pdf.

The design is intentionally modular — you don't have to run the full pipeline. If you just need a GDD from a rulebook, run /generate-gdd. If you want to compare two games, run /game-comparison. Each command is independently useful.

Board games are some of the most densely designed interactive systems humans have made. RuleForge is a bet that those designs are worth bringing into software — and that AI can do a lot of the translation work.

Leading With "I Don't Know"

Sebastian Schürmann — Mon, 30 Mar 2026 19:46:23 +0000

A powerful thing a tech lead can say isn't an answer. It's an honest admission — about your team's code, about AI's trajectory, about a world in crisis — followed by the only thing that matters: what you do next.

There's a version of tech leadership that never actually exists but haunts every leader anyway: the person who has seen every edge case, knows where the technology is heading, understands the macro forces shaping the business, and fields every question with calm, grounded certainty.

It's a fiction. And quietly chasing it is one of the most corrosive things a leader can do.

The real job — leading developers through ambiguous problems, positioning teams in the face of transformative technology, making business decisions while the world keeps breaking in unpredictable ways — requires a completely different posture. It starts with saying three words without flinching: I don't know.

Why Leaders Resist Saying It

The fear is understandable. You got the role because you were sharp. Your team looks to you. Admitting ignorance feels like handing back your credentials in front of everyone who gave them to you.

But engineers are a perceptive group. They know when an answer is being improvised. They can feel the difference between grounded confidence and performed certainty. And nothing erodes trust faster than a leader who bluffs — especially when it costs the team direction, time, or morale.

Faking knowledge doesn't protect your authority. It slowly transfers it to whoever actually knows.

Admitting "I don't know" is one of the highest-signal things a leader can do. It tells your team that you operate in reality — that their trust is well-placed because you won't lead them off a cliff to protect your ego.

But the admission isn't the end. It's an opening move. Three areas, in particular, are where honest not-knowing is most consequential right now.

Your Team: The Daily Not-Knowing

At the most immediate level, this is about the problems that land on your desk every morning: the architectural decision someone needs a call on, the production incident whose root cause is unclear, the technical direction your team is asking you to set on a system you haven't touched in six months.

In this context, "I don't know" is a team-safety tool. When a lead normalises it, developers stop pretending too. They surface problems earlier. They ask questions instead of grinding silently for two hours. They admit blockers instead of heroically absorbing them.

When you don't know → concrete alternatives

#	Move	What it sounds like
01	Name who does know	"I don't know, but Sarah has been closest to that service — let's pull her in." Directing to expertise is leadership, not deferral.
02	Define the investigation	"I don't know, but I think the answer is in the caching config. Can we spike on it this afternoon?" Turn fog into a task.
03	Reason out loud together	"I don't know — walk me through what you're seeing and let's think it through." Your value isn't always the answer; it's the thinking process.
04	Surface the systemic gap	"I don't know — and that's a signal we have a documentation problem worth fixing." Use your ignorance diagnostically.

The whole team starts operating in reality rather than in the performance of competence — and reality, however uncomfortable, is a much better place to build software.

AI: The Impact No One Can Honestly Quantify

Then there's the larger question your CTO, your board, your reports, and your peers are all asking — the one that gets dressed up in confident slides and frameworks but remains stubbornly, genuinely open: what does AI actually do to how we build software, and how we work?

The honest answer, right now, is that nobody knows.

We have data points. AI coding assistants measurably change output velocity in some contexts. Some categories of junior tasks look automatable; others that seemed automatable turned out to require more human judgment than assumed. Certain roles are being restructured; others are being amplified. The second and third-order effects — on team structure, on the value of different skills, on how we hire and what seniority means — are genuinely unresolved.

Any leader who tells you they know exactly what AI will do to their team in two years is either guessing confidently or selling something.

This is not a reason for paralysis. It's a reason for a particular kind of leadership: one that acknowledges the uncertainty explicitly, moves deliberately rather than reactively, and builds in the organisational capacity to adapt as clarity arrives.

What "I don't know" looks like on AI strategy

Run time-boxed experiments with clear hypotheses instead of committing to wholesale transformations based on hype
Tell your team honestly: "We're going to try this, observe what changes, and adjust — we're not doing a big bet we can't reverse"
Resist pressure to make confident AI roadmap calls purely for optics; say "we're still learning" to stakeholders when that's true
Watch the teams two years ahead of you on adoption and study what they're actually saying now vs. what they said then
Invest in the capabilities that remain valuable regardless of how AI develops: systems thinking, communication, judgment under ambiguity

The leaders doing the most honest, useful work on this right now are the ones who've stopped trying to predict AI's impact and started building teams good at navigating whatever it turns out to be.

The World on Fire: Crises That Reach Your Sprint Board

And then there's everything else.

Tech leads used to be able to bracket the world's problems at the office door. That boundary has been dissolving for years — and in the current moment, it's essentially gone. Supply chain shocks affect infrastructure budgets. Geopolitical instability affects where you can hire and what data sovereignty rules apply to your systems. Economic turbulence reshapes what your company thinks the engineering team should be building. Social crises affect your team members directly, and expect leadership to notice.

None of this has clean answers. The honest position is that most leaders — most people — don't know how these crises resolve, what the downstream business effects will be, or exactly what the right response is. Pretending otherwise doesn't help your team. It insults their intelligence.

Your team doesn't need you to have solved geopolitics. They need to know you're not pretending it isn't happening.

Crisis uncertainty → alternatives to false confidence

#	Move	What it looks like
01	Name the uncertainty in planning	Build explicit contingency into roadmaps. "This timeline assumes current conditions hold; here's our branch if they don't." Uncertainty acknowledged is uncertainty managed.
02	Separate "I don't know" from "we're watching"	Distinguish between things you're genuinely uncertain about and things you're actively monitoring. Give your team a sense of the signals you're tracking even when you can't give conclusions.
03	Acknowledge impact without performing solutions	When crises affect team members directly, you don't need a policy or a fix. Sometimes "I see this is real and I don't have the answers" is more valuable than a five-point plan.
04	Influence decisions above you with honest data	When the business is being steered by false certainty about external conditions, your job is to put accurate uncertainty on the table — even when that's unwelcome. Especially then.
05	Prioritise reversibility	When the environment is genuinely unpredictable, bias toward decisions that can be undone. Make "how reversible is this?" a standard question in planning when the context is volatile.

The Second Half of the Sentence

Across all three registers — your team, the technology, the world — the structure is the same. "I don't know" is never the complete sentence. It's always followed by momentum.

"I don't know — but here's how we're going to move anyway."

That pivot is everything. You're not outsourcing the problem or performing helplessness. You're modelling how a technically mature, psychologically honest person handles uncertainty: they acknowledge it, then they act on it. They find what they can know. They reduce the blast radius of what they can't. They keep moving.

There's a kind of confidence that doesn't depend on having the answers. It's the confidence that comes from trusting your ability to navigate uncertainty — to find information, connect people, ask the right questions, and make reasonable calls under ambiguity. That's the confidence your team needs from you. Not omniscience. Not a human forecast engine.

Just someone who can say "I don't know where we are" without panicking — and then get out the compass.

The best leads I've worked with share one trait: they made it feel completely ordinary to not have an answer. And they made it equally obvious that not having one was never the end of the story. Just the beginning of figuring it out together.

That combination — honesty first, momentum second — is what leading a team looks like when the world keeps changing faster than any of us can confidently predict. Which, as far as I can tell, is the world we're in now.

Your build pipeline is not your trust boundary

Sebastian Schürmann — Tue, 24 Mar 2026 16:27:07 +0000

Some teams deploying software to AWS have two registries and think of them as a logistics detail. One holds what came out of CI. The other holds what goes into production. The relationship between those two things — the decision about what is allowed to cross from one into the other, and who makes that decision, and what happens when the answer is no — is not a logistics detail. It is a security architecture decision, and treating it as anything less is how production incidents happen.

The bulkhead pattern is old. It comes from naval engineering, where ships are divided into watertight compartments so that flooding in one section does not sink the whole vessel. The insight is that you do not prevent damage by building a perfect hull. You prevent catastrophic loss by limiting how far damage can travel. Software engineers rediscovered this principle independently and applied it to distributed systems, microservices, and fault tolerance. It belongs equally in a deployment pipeline.

The problem with a single registry

When your CI pipeline pushes directly to the registry your ECS cluster pulls from, you have made a consequential choice that probably did not feel like a choice. You have decided that the build environment and the production environment share a trust boundary. Anything that can write to your CI pipeline — any engineer, any compromised dependency, any malformed Dockerfile, any branch that passes tests — can, directly or indirectly, place an artifact into the registry that production infrastructure will consume without further scrutiny.

This is not a theoretical concern. Supply chain attacks against CI systems have become routine. A compromised build dependency installs a malicious binary during the build phase. The resulting image passes your existing image scan if the scanner's definitions are not current, or if the binary is not yet known to the scanner. The image gets tagged and pushed. On the next deploy, ECS pulls it and runs it in your production environment. At no point did anything behave unexpectedly from a pipeline perspective. Every light was green. That is the problem.

The deeper issue is that a single-registry architecture conflates two fundamentally different questions. The first question is: did this build succeed? The second question is: is this artifact trustworthy enough to run in production? CI answers the first question. Only a deliberate validation gate — one that runs independently of the build environment, with different permissions and different tooling — can answer the second.

The structure of a bulkhead deployment

The architecture worth building has four distinct zones, each with clearly scoped responsibilities and explicitly limited permissions between them.

The first zone is your GitLab CI pipeline. Its job is to build. It runs your tests, compiles your code, assembles your container image, and pushes that image to the GitLab Container Registry. The GitLab registry in this architecture is intentionally treated as ephemeral and untrusted. It is a staging area. Images land there the way packages land on a loading dock: present, but not yet cleared for entry. CI runners have write access to the GitLab registry. They have no access to AWS whatsoever. Not to IAM, not to ECR, not to ECS. If your CI environment is compromised, the blast radius is bounded to the GitLab registry.

The second zone is the deliver pipeline. This is the bulkhead. It is triggered — on a tag, on a merge to a protected branch, on whatever promotion event your organization has decided represents a release candidate — and its sole purpose is to evaluate whether an image from the GitLab registry is trustworthy enough to enter the AWS trust boundary. It pulls the image, runs validation: vulnerability scanning, signature verification, policy checks, SBOM attestation, whatever your threat model requires. If validation passes, it pushes the image to ECR and tags it with a provenance marker. If validation fails, it stops there. Nothing enters AWS. The deliver pipeline is the only principal in your entire system with write access to ECR.

The third zone is ECR. In this architecture, ECR is not just a faster registry. It is a trust signal. The presence of an image in ECR means exactly one thing: the deliver pipeline evaluated it and cleared it. No image arrives in ECR through any other path. Your ECS tasks can therefore pull from ECR with confidence that the contents were not placed there by a CI runner, a developer with elevated credentials, or an automated process that bypassed validation. ECR's access policy reflects this: the deliver pipeline can write, ECS task roles can read, and nothing else has write access.

The fourth zone is the deploy pipeline and ECS cluster. The deploy pipeline runs inside AWS, typically on a runner with an IAM role scoped to the specific ECS actions it needs. It reads from ECR, updates the task definition, and triggers a rolling deployment. It has no awareness of GitLab's registry. It does not cross back outside the AWS trust boundary for any artifact. The deployment is entirely self-contained within the environment it controls.

Why the boundary placement matters

You could draw the bulkhead in a different place. You could run validation inside the CI pipeline, before the push to GitLab's registry, and use a single registry throughout. Many teams do this. It is better than no validation at all. But it is not a bulkhead. A bulkhead only works if the compartments it separates are genuinely isolated — if flooding one compartment cannot automatically flood the other. Validation that runs inside the same environment as the build is subject to all the same compromises as the build. A malicious package can interfere with test execution. A malicious script can tamper with scanner output. The environment in which validation runs cannot be the same environment that produced the artifact being validated, if you want the validation to mean anything.

The deliver pipeline solves this because it runs in a clean context with no dependency on the build environment. It does not trust the image. It does not trust the metadata the build produced. It pulls the image, treats it as an opaque artifact of unknown provenance, and evaluates it from scratch. The only thing it takes on faith is that the image digest it pulls from the GitLab registry corresponds to what CI claims to have built — and even that can be addressed with build attestation and signed manifests if your threat model demands it.

There is also an operational argument separate from the security argument. When validation and promotion are separated from build, you can change your validation requirements without touching your build configuration. You can introduce a new scanner, tighten a policy, or add a new required attestation by changing the deliver pipeline. CI keeps running the same way it always has. The operational surface of security changes shrinks considerably.

Permissions as documentation

One of the most underappreciated properties of this architecture is what the permission model tells you. When you look at your IAM policies and your GitLab CI variable scopes, the structure of your trust boundaries is legible. GitLab runners have credentials that can push to the GitLab registry. They have nothing in AWS. The deliver pipeline has credentials to read from the GitLab registry and write to ECR. ECS task roles can read from ECR. The deploy pipeline can describe and update ECS services. Nothing has more than it needs. Nothing can reach across a zone boundary it has no business crossing.

This matters because permissions-as-documentation is honest in a way that comments and runbooks are not. Runbooks say what is supposed to be true. IAM policies say what is actually true. When your access model is correctly scoped, reading it is equivalent to reading the architecture. When your access model has accumulated scope over time — when CI runners have ECR write access because someone needed to debug something once and never cleaned it up — the permissions tell you that the architecture has quietly collapsed. The bulkhead no longer holds because the compartments are no longer sealed.

Keeping the permission model clean is not just security hygiene. It is architectural discipline. Every time you are tempted to give a component access to something outside its designated zone — to let CI push directly to ECR "just this once," to give the deploy pipeline GitLab credentials "because it's easier" — you are being asked to trade architectural clarity for convenience. The answer should almost always be no.

The cost

This architecture is not free. You have a third pipeline to maintain, with its own failure modes and operational requirements. The deliver pipeline becomes a single point of failure in your promotion path: if it is broken, no image reaches production regardless of how healthy your build and deploy pipelines are. You need to monitor it, alert on it, and be capable of diagnosing failures in it quickly.

The deliver pipeline also adds latency to your release cycle. Validation takes time. Scans take time. If your threat model requires extensive policy evaluation, the gap between a successful build and a deployable artifact may be measured in minutes rather than seconds. This is usually acceptable, but it is a real tradeoff that your organization needs to make consciously rather than discover in the middle of an incident.

The answer to both of these costs is not to eliminate the bulkhead. It is to treat the deliver pipeline with the same engineering seriousness as the rest of your infrastructure. It deserves good observability, clear failure messages, documented recovery procedures, and regular testing. A security boundary that cannot be maintained is not actually a security boundary.

What this is not

A bulkhead is not a substitute for secure coding practices. An image that passes every validation check you have defined can still contain application-level vulnerabilities. The bulkhead protects you against supply chain compromise in the build environment and enforces a consistent set of standards on every artifact that reaches production. It does not protect you against vulnerabilities you have not checked for or logic errors in your application code.

A bulkhead is also not a guarantee of immutability. An image that passes validation today may have a vulnerability discovered tomorrow. Your ECR should be configured with immutable tags so that an existing image digest cannot be overwritten, and you should have a process for responding to newly discovered vulnerabilities in images that are already in production. The bulkhead tells you about the state of an artifact at the moment it crossed the boundary. Keeping that assessment current over time is a different problem, requiring different tooling.

What a bulkhead is, at its most fundamental, is a decision about what it means to trust an artifact. Defining that decision explicitly, embodying it in a pipeline stage with clear inputs and clear outputs, and enforcing it as the mandatory path between your build environment and your production environment — that is the entire value of the pattern. The implementation details matter less than the clarity of the decision. Before you build anything, you should be able to answer: what does it mean for an image to be trustworthy? Who decides? What happens when the answer is no? If those questions have clear answers, you have an architecture. If they do not, you have a pipeline.

From Idea to Implementation-Ready: A Six-Phase Pipeline with Rewelo

Sebastian Schürmann — Tue, 17 Mar 2026 19:13:14 +0000

Most projects start with a vague idea and a Jira board. The gap between "we should build X" and "here is a fully specified, dependency-ordered, priority-scored backlog ready for sprint planning" is usually traversed through a series of meetings, half-written requirements documents, and optimistic estimates scribbled on sticky notes.

This post documents a different approach: a six-phase pipeline in which each phase produces structured, machine-readable artifacts that feed directly into the next. The backbone is Rewelo — a CLI and MCP server for relative-weight backlog prioritization built on DuckDB — which transforms the back half of the process from intuition-based ticket-sorting into a transparent, reproducible calculation.

The question is not "how do we decide what to build first?" but "how do we make the decision auditable, reversible, and legible to every stakeholder?"

The result, across a real product build: 113 scored and tagged tickets, 124 dependency relations organized into five dependency layers, and a backlog that can be re-ranked in seconds when priorities shift.

Metric	Count
Gherkin feature files	16
BDD scenarios	~160
Scored tickets	113
Dependency relations	124

The Pipeline

The process is structured into six sequential phases, each with a defined entry condition, a set of artifacts it produces, and a quality gate before the output is accepted downstream. Two phases have explicit iteration loops; three feedback paths run from the final review back into earlier phases.

Phase 1 — Vision & Concept

Lock the problem space before touching architecture.

The phase begins with four documents: concept.md (what and why), elements.mmd (a Mermaid diagram of the major domain entities), estimates.md (rough sizing and constraints), and trlc-cheatsheet.md (a quick-reference for the requirements language used throughout). Before anything moves forward, the four documents undergo a cross-consistency review — checking that the entity model matches the concept, that estimates are grounded in scope, and that the requirements language is applied uniformly.

Artifacts: concept.md · elements.mmd · estimates.md · trlc-cheatsheet.md · cross-consistency review

Phase 2 — Architecture & Specifications

Define how the system actually works, then find the gaps.

With the vision locked, architecture.md documents the major components, their data flows, and any specialized concerns (in this project, a CRDT-to-Git synchronization layer). A gap analysis follows — specifically looking for what is needed to ship a first version versus what is aspirational. Only what passes that bar makes it into the three specification documents: artifact_schemas.md, api_contract.md, and operational.md, which together define the technical surface area that will be tested and implemented.

Artifacts: architecture.md · gap analysis → v1 · artifact_schemas.md · api_contract.md · operational.md

Phase 3 — Behavioral Specifications

Sixteen feature files. ~160 scenarios. One explicit loop.

This is where requirements become falsifiable. Gherkin scenarios are written for every feature identified in Phase 2 — Given, When, Then triples that can drive automated tests and that make edge-case thinking explicit. A best-practices review is applied to the full scenario set: are scenarios atomic? Are they written from the user's perspective? Do they avoid implementation detail? If the answer to any of these is "no", the feature files are reworked. Only when the scenario set passes the gate does the process continue.

This is the most iterative phase, and deliberately so — fixing an ambiguous scenario at this stage costs minutes; finding the same ambiguity during implementation costs days.

Artifacts: 16 feature files · ~160 scenarios · best-practices review · explicit rework loop

Phase 4 — Decisions & Quality Gates

Capture the choices so future team members can understand them.

Three Architecture Decision Records are authored at this stage: one mapping the technology stack to the problem constraints, one capturing the rationale for the BDD approach, and one documenting the Rewelo integration decision itself. The Definition of Ready and Definition of Done are written here as well — these become the acceptance criteria applied in Phase 6. Finally, reusable templates for User Stories and ADRs are finalized, ensuring that new tickets and decisions added later follow a consistent structure.

Artifacts: ADR: tech mapping · ADR: BDD rationale · ADR: Rewelo · Definition of Ready · Definition of Done · Story and ADR templates

Phase 5 — Backlog

113 tickets. Scored, tagged, and dependency-ordered.

Every story derived from the BDD scenarios and architecture documents is entered into Rewelo and scored across four dimensions: Benefit, Penalty, Estimate, and Risk. Tags group tickets by feature, team, and state. Ticket relations — blocks, depends-on, relates-to — are declared explicitly, yielding 124 relations that sort the backlog into five logical dependency layers. The top of the backlog is not a product manager's gut feel; it is the output of a priority formula.

Artifacts: 113 tickets · B/P/E/R scores · 124 relations · 5 dependency layers

Phase 6 — Four Amigos Review

Four perspectives, one approval gate, three feedback paths.

The Four Amigos — Product Owner, Developer, QA Engineer, and UX Designer — each review the backlog through their own lens, informed by AI-simulated personas. The PO checks value propositions and acceptance criteria. The Developer flags architecture misalignment and re-scores Estimate and Risk. QA surfaces edge cases and cross-feature risks. The UX designer reviews interaction states, cognitive load, and missing flows. The gate is a four-way approval. If it doesn't pass, three feedback loops are available: back to requirements (for conceptual gaps), back to the feature files (for BDD issues), or back to ticket refinement (for scope or scoring problems).

Personas: Product Owner · Developer · QA Engineer · UX Designer · three feedback loops

Rewelo at the Center

The pipeline would be useful without Rewelo — structured documents and BDD scenarios alone are a meaningful step up from most engineering processes. But the backlog phase is where the approach goes from "disciplined" to "genuinely different."

Rewelo is a CLI and MCP server for relative-weight backlog prioritization. It stores tickets in an embedded DuckDB database — no server required — and calculates a priority score at runtime based on four dimensions, normalized across the full backlog or any tagged subset.

Each ticket receives four scores on the Fibonacci scale (1, 2, 3, 5, 8, 13, 21):

Dimension	Measures
B — Benefit	Value delivered by implementing this story
P — Penalty	Cost of not implementing — the downside of deferral
E — Estimate	Resources required for implementation
R — Risk	Uncertainty or complexity in the implementation

At runtime, Rewelo calculates:

# Value vs Cost, normalized across the backlog
Value    = Benefit + Penalty
Cost     = Estimate + Risk
Priority = Value / Cost

Higher priority means better return on investment. The scores are normalized relative to the whole backlog — or any subset filtered by tag — so re-ranking is instantaneous when new tickets are added or when the team changes their weighting preferences. rw calc priority is a single command away.

This matters in the Four Amigos phase specifically. When a Developer argues that a ticket's Estimate score is too optimistic, or a QA engineer surfaces a hidden dependency that increases Risk, the scores are updated and the backlog re-sorts itself. The discussion produces data, not just minutes.

Tag-driven organization

Rewelo uses a flexible prefix:value tag system rather than fixed fields. In this project, tags covered state (state:backlog, state:wip, state:done), feature grouping (feature:auth, feature:checkout), and team (team:platform). Because every tag assignment is logged in an audit trail, the tag history also yields lead time and cycle time data from state: transitions — a useful side effect.

Dependency ordering

The 124 relation declarations (blocks, depends-on, relates-to) produce a directed graph of the backlog. Rewelo uses this to expose a five-layer topological ordering: the tickets in layer one have no upstream dependencies and can be started immediately; each subsequent layer becomes unblocked as the previous one completes. This is far more actionable than a flat, priority-sorted list.

The Four Amigos Review in Detail

The Four Amigos is a well-established agile practice: before any story reaches a sprint, it should be reviewed by representatives of the four key perspectives. What makes this pipeline's implementation unusual is that the review is run against AI-simulated personas — each grounded in the artifact set produced by the earlier phases — before involving the human team. This surfaces structural problems in the backlog without consuming sprint planning time.

Product Owner reviews value propositions, acceptance criteria, and Benefit/Penalty scores. Asks: does this story deliver the outcome described in the concept? Is the acceptance criteria in the feature file comprehensive? Would a user recognize this as solving their problem?

Developer reviews architecture fit, implementation plausibility, and Estimate/Risk scores. Asks: is this story implementable given the architecture defined in Phase 2? Are the E and R scores realistic? Are there hidden technical dependencies not captured in the relation graph?

QA Engineer reviews edge cases and cross-feature risks. Asks: are the Gherkin scenarios sufficient to catch regressions? Are there error states or boundary conditions missing from the feature files? Do any of these stories interact in ways that could produce surprising failures?

UX Designer reviews interaction states, transitions, and cognitive load. Asks: are all the states this feature can be in represented in the acceptance criteria? Is the described flow consistent with how users actually think about the task? Where might a user get confused?

The gate is a four-way approval. If any persona finds a material gap, one of three feedback paths is taken:

↩ Refine requirements. Conceptual gaps or value misalignments send the work back to Phase 1's cross-consistency review — the deepest and most expensive loop.

↩ Rework feature files. Missing scenarios, incomplete edge cases, or poorly specified acceptance criteria send individual feature files back to Phase 3 for revision.

↩ Refine stories. Mis-scored tickets, missing dependencies, or scope problems are addressed directly in the Rewelo backlog — the shallowest and most common loop.

The three loops are tiered by cost: story refinement is cheap (minutes), BDD rework is moderate (hours), requirements revision is expensive (days). The earlier a problem is found, the cheaper it is to fix — which is the central argument for front-loading structure in the first place.

What the Output Looks Like

When the Four Amigos gate passes, the output is a Rewelo project containing 113 tickets that have been scored by all four personas, organized into five dependency layers, and validated against 160 behavioral scenarios. The implementation team can:

Run rw calc priority to get an instant priority ranking
Run rw report dashboard to generate an HTML dashboard showing backlog health, score distribution, and tag breakdowns
Export to CSV or JSON for integration with any downstream tool
Start a sprint immediately from layer one, knowing every ticket in that layer is dependency-free and has passed four distinct review perspectives

What the team cannot do is argue about what to build next without data. That is, perhaps, the most useful property of the whole pipeline.

Running Rewelo as an MCP Server

One of Rewelo's less obvious capabilities is its MCP server mode. Run rw serve (or deploy the Docker container and configure Claude to point to it), and the AI assistant can manage the entire backlog — creating tickets, updating scores, assigning tags, running calculations — through natural language. This is how the Four Amigos review phase was implemented: each persona is a system prompt, the Rewelo MCP server provides the backlog as context, and the review runs as a structured conversation.

The configuration is straightforward. Rewelo's .mcp.json file in the repository shows the exact setup. Because the data lives in a named Docker volume, the database persists across container restarts and the full audit trail — every score change, every tag transition — is preserved.

Conclusion

The pipeline described here is not light-weight. Six structured phases, 16 feature files, 113 tickets, three ADRs, and a four-persona review process is a meaningful investment before any implementation begins. The argument for making that investment is simple: the cost of ambiguity grows exponentially the later it is found. A missing acceptance criterion discovered during sprint planning costs a conversation. The same gap found during code review costs a rewrite. Found in production, it costs users.

Rewelo sits at the center of this because the backlog is where ambiguity historically hides most effectively — in vague story descriptions, in optimistic estimates, in priorities that change with whoever spoke last at the planning meeting. Replacing that with a transparent scoring formula, a dependency graph, and a full revision history is not bureaucracy. It is engineering applied to the product development process itself.

The repository is at github.com/sebs/rewelo. It is experimental software, as the README notes — but the ideas it implements are not.

I Was So Angry, I Actually Shipped It

Sebastian Schürmann — Fri, 13 Mar 2026 22:31:40 +0000

A while ago I wrote about how I was fed up enough with project management tools to build my own. No URL. No code. Just a rant and some screenshots of a half-baked UI.

Several people in the comments called it a tease. They weren't wrong.

So here's the follow-up nobody .... erm ... at least three people did ask for.

The UI Didn't Happen

Let me be upfront: I didn't build the fancy web UI I was implicitly promising. I started down that road a couple of times, got bored fighting CSS and component state, and asked myself the honest question — who is this actually for?

Me. It's for me.

And I live in the terminal.

So I threw out the frontend entirely and built a CLI instead. No regrets.

Meet rewelo

rewelo — Relative Weight Backlogs for the CLI and MCP.

It does exactly what I said I wanted: it prioritizes work using four dimensions instead of the fictional psychic measurement known as story points.

Every ticket gets scored on:

Benefit — value gained by doing the thing
Penalty — cost of not doing the thing
Estimate — how much work it actually is
Risk — how uncertain or gnarly the implementation is

From those four numbers, priority calculates itself:

Value    = Benefit + Penalty
Cost     = Estimate + Risk
Priority = Value / Cost

Higher priority = better return on investment. It's not rocket science. It's just math that most tools refuse to let you do.

DuckDB Was The Right Call

One of the decisions I'm most happy about: no server.

I spent zero hours configuring a database daemon, zero hours fighting connection pools, and zero hours explaining to myself why postgres was running at 3am. The whole thing runs on DuckDB — an embedded analytical database that lives in a single file.

This meant I could focus on the actual problem instead of infrastructure theater. Turns out a project management tool for one person doesn't need a distributed SQL cluster.

Who knew.

Tags Instead of Fixed Fields

The state machine I wanted to build kept getting complicated. So I simplified it down to a tag system.

Every ticket gets tags in prefix:value format: state:backlog, state:wip, state:done, feature:checkout, team:platform. Whatever you need. The system doesn't care — it just tracks every assignment and removal in an audit log.

The beautiful side effect: since every state: tag change is recorded with a timestamp, cycle time and lead time fall out of the data for free. No extra instrumentation. No dashboards you have to manually update. Just the log.

Revision History Because I've Been Burned

Every change to a ticket creates a snapshot of what it looked like before. Not just the scores — the tags too.

This means you can reconstruct the exact state of your backlog at any point in time. Remember that estimation session three weeks ago? You can see the numbers from before the panic re-estimation happened. This turned out to be more useful than I expected. Past me was making different tradeoffs than present me, and it's actually worth knowing when that changed and why.

The MCP Part Is The Interesting Part

Here's where it gets weird in a good way.

The CLI doubles as an MCP server over stdio. Which means Claude — or any AI assistant that speaks MCP — can manage your backlog directly. Create tickets, assign tags, run priority calculations, generate reports. All from a conversation.

I wrote in the original post that I wanted to bind agent integration to workflows, to have some control over machine-made changes. This is the answer to that. The MCP tools are the workflow. The AI calls them explicitly and the audit log catches everything it touches. Nothing happens silently.

In claudes own words

I extracted all 18 Gherkin feature files from your features/ directory and converted each "Rule" block into a rewelo ticket — 104 stories total — with acceptance criteria derived from the scenarios, Fibonacci scores for benefit/penalty/estimate/risk, and system tags. I created the golden-season project in rewelo from scratch, set up 21 tags (3 part tags for B2G/NLS/24H and 18 system tags), and assigned every ticket its corresponding system:* tag. The backlog is now fully populated and prioritised, ready for sprint planning or further refinement like assigning part:* tags to map stories to the three-part implementation roadmap.

A Word on Scope Creep Not Happening

I am genuinely proud of what I did not build.

No user accounts. No sharing. No real-time collaboration. No mobile app. No integrations with Slack, GitHub, Linear, Jira, or anything that would require me to maintain OAuth tokens at 2am.

This tool is for one person — me — and it does that job well. The moment I start building for an imaginary team of five, I stop building for myself and start building a worse version of tools that already exist.

I've read enough HN threads to know how that ends.

It Exists. You Can Download It.

Here it is: github.com/sebs/rewelo.

Sometimes the best tool is the one that you actually finish.

Generator Generator

Sebastian Schürmann — Tue, 10 Mar 2026 12:39:08 +0000

Suppose you need to produce a physical set of Agile Workshop Tokens for your development team. Forty distinct pieces: planning poker chips embossed with Fibonacci values, sprint coins, retrospective cards, role medallions. You have two ways to ask an AI for help.

The first way: ask a generative model to show you what an Agile Workshop Token looks like. It obliges. You receive a pleasant render of a melted-looking coin bearing the legend "SCRROM MSTR." It has no dimensional accuracy, no awareness of Agile methodologies, and no manufacturing utility whatsoever. This is the Nano Banana approach — prompting a black-box model to spit out a singular, static artifact. It yields a statistically probable object frozen in time: zero parametric control, no hierarchical semantics, and a complete ignorance of physical constraints. It is an artifact stripped of its axioms.

The second way is the subject of this essay.

Instead of asking for a token, you ask the AI to design the classical procedural grammar for manufacturing an entire ecosystem of tokens. The output is not an image. It is a generative engine — a formal system of rules, constraints, and stochastic parameters that, when executed, produces exactly 40 watertight, functionally distinct, 3D-printable models. This is the paradigm shift: moving from the discrete generation of objects to the synthesis of generative systems.

f(Direct_Prompt) → X_static
f(System_Prompt) → Parametric_Engine(Θ)
Σ Parametric_Engine(p, t) = ∞

The token set will reappear throughout this essay as our running example, grounding each theoretical claim in the concrete output of a real pipeline.

I. The Nano Banana Fallacy

Direct asset generation treats artificial intelligence as a glorified vending machine. The fundamental problem is not quality — modern diffusion models produce convincing images. The problem is structure. A static output has no memory of how it was made, no handles by which it can be adjusted, and no awareness of the constraints that govern the domain it represents.

In game development, the fallacy is obvious. A neural network hallucinating a 3D mesh of a sword gives you messy topology that cannot be animated, lacks collision volumes, and has no semantic awareness of its own edge flow. The Nano Banana sword is useless the moment you need a second sword that is longer, or rustier, or held by a different character.

A generated system, by contrast, outputs the deterministic procedural grammar to forge a million weapons. It defines the hierarchical L-system of the hilt, the Bézier constraints of the blade curvature, and the algorithmic distribution of surface wear based on a runtime age parameter. It generates the mathematical forge, not the singular sword.

Apply this lens to our token set. A direct prompt gives us one melted coin. The meta-generative approach gives us a shape grammar:

G = (V, Σ, R, S)

Where V is the set of abstract token categories,
      Σ is the terminal geometries (hexagons, discs, shields, rectangles),
      R represents the substitution rules,
      S is the starting axiom.

That grammar, when executed, produces not one token but a coherent family of forty — each geometrically distinct, each semantically correct, each printable.

II. The Three Phases of Meta-Generation

The pipeline that produced our token set operates in three distinct phases. These phases are universal — they apply equally to architectural design, material science, and synthetic data generation. Understanding them is the key to applying the meta-generative approach in any domain.

Phase One: Semantic Partitioning (SPLIT)

The generator's first task is not to draw anything. It is to understand the domain well enough to partition the problem space correctly. For the token set, this means recognising that "Agile Workshop" implies a specific ecosystem of functional object types with specific real-world usage distributions.

The system allocates thirteen VotingChips — because planning poker requires a full Fibonacci sequence plus variants — eight SprintCoins, twelve RetroCards, and seven RoleMedallions. These numbers are not arbitrary. They reflect the actual ratio of pieces required for a workshop of eight to twelve people. A Nano Banana generator ignores this entirely; it does not know what a sprint retrospective is.

In architectural morphogenesis, the equivalent phase generates Voronoi tessellation logic for load-bearing steel, partitioning the structural problem into zones parameterized against wind shear and solar radiation. The output is not a building; it is a spatial algorithm.

Phase Two: Parametric Substitution (SUB)

Once the semantic partitions exist, the generator binds each abstract category to concrete geometry and encodes domain-specific rules as mathematical constraints. This is where the system demonstrates genuine understanding.

For VotingChips, the generator correctly applies the Fibonacci sequence to the value distribution:

f(n) = f(n−1) + f(n−2)  ∀ VotingChip_value
Values: 0, ½, 1, 2, 3, 5, 8, 13, 21, ?

This is not a cosmetic choice. Fibonacci values in planning poker reflect a deliberate epistemological decision: the gaps between numbers encode increasing uncertainty. A generator that assigned random integers would produce a set that looks like poker chips but fails as a planning tool. The meta-generative system encodes the mathematical rule directly into the substitution grammar.

SprintCoins become hexagons (stable, grippable, stackable); RetroCards become rectangles with write-on surfaces and lane indicators (+/Δ/−) that map to the three-column retrospective format; RoleMedallions become shields, differentiated by weight and finish from the lighter functional tokens.

The industrial engineering parallel is the generation of topology optimisation algorithms for triply periodic minimal surfaces — gyroids and diamond lattices — where the substitution rules encode physical constraints like energy absorption and mass minimisation rather than Fibonacci sequences and retrospective formats. The structure of the problem is identical.

Phase Three: Instantiation (TERMINAL)

The grammar resolves. Terminal rules execute Boolean mesh operations, stamp glyphs onto base geometry, apply edge-banding for physical grip, assign material properties, and export watertight models. For the token set:

AXIOM: Token → queued

────────────────────────────────────────────────────────────
[SPLIT]    Token → VotingChip    ×13 instances
[SPLIT]    Token → SprintCoin    ×8  instances
[SPLIT]    Token → RetroCard     ×12 instances
[SPLIT]    Token → RoleMedallion ×7  instances

────────────────────────────────────────────────────────────
[SUB]      VotingChip    → BaseDisc    + ValueBadge(0,½,1,2,3,5,8,13,21,?)
[SUB]      SprintCoin    → BaseHex     + FaceGlyph(⚡×3, ↗×2, ✓×2, 🔥×1)
[SUB]      RetroCard     → BaseRect    + WriteSurface + CategoryBar(+/Δ/−)
[SUB]      RoleMedallion → BaseShield  + RoleGlyph(SM,PO,Dev,QA,UX,STK,Coach)

────────────────────────────────────────────────────────────
[TERMINAL] 40 meshes instantiated, glyphs stamped, edge-bands placed
[TERMINAL] GENERATION COMPLETE: 40 tokens, seed 0xA91ECAFE

The same three-phase structure appears in every successful application of meta-generation: partition the domain, bind abstract categories to mathematically constrained geometry, instantiate. The domain changes; the architecture does not.

III. Where the Paradigm Extends

Parametric Architecture and Urban Morphogenesis

When an architect uses a basic generative AI to create a render of a futuristic building, they receive a beautiful hallucination that fundamentally ignores thermodynamics, zoning laws, and material tensile strength. It is, like the melted coin, a useless image.

A meta-generative system does not output a building. It outputs a spatial algorithm. It generates Voronoi tessellation logic for load-bearing steel, parameterizing geometry against wind shear and solar radiation. Instead of a static blueprint, the system defines a generative grammar where floorplans dynamically restructure themselves based on traffic flow optimisations and HVAC efficiency constraints. Critically, the geometry produced can be mathematically constrained by Cauchy's equilibrium equation:

∇ · σ + F = 0
σ = C : ε

Every procedural strut is guaranteed to actually support its structural load — something no hallucinated render can promise.

Note the structural echo of the token pipeline. The architect's grammar partitions a building into zones (SPLIT), substitutes each zone with geometrically constrained elements (SUB), and instantiates watertight, structurally valid meshes (TERMINAL). The domain is different. The architecture is the same.

Synthetic Data and Kinematic Reality

Machine learning models for autonomous driving or robotics cannot be trained on static images. They require rich, physically accurate synthetic environments. A direct-to-image AI cannot generate a functioning physics simulation.

A meta-generative system writes the deterministic rules for a dynamic world. It parameterises the friction coefficients of procedural asphalt, generates the optical scattering algorithms of simulated fog, and orchestrates the localised stochastic behaviours of pedestrian traffic models. The system continuously updates its probability distributions:

P(A | B) = P(B | A) · P(A) / P(B)

The generative system iterates its own parameters to generate edge-case scenarios — the long tail of rare events that directly trains physical robotics.

The token pipeline's stochastic layer encodes the same principle at a smaller scale. Hue is not fixed; it is drawn from a normal distribution centred on the family colour with standard deviation 12°, creating warm and cool variants within each token family. The spatial seed map ensures that tokens physically adjacent on a print sheet share correlated aesthetic properties — a form of local Bayesian coherence baked into the generation rules.

IV. Praxis: What Do You Actually Use This For?

The pragmatic reader is entitled to scepticism. The token set is illustrative, but what about genuinely mundane work?

The utility of the Generator Generator lies anywhere you need functional coherence, precise variation, and mathematical exactness instead of a single hallucinated image. It is the difference between generating a picture of a tool and generating the factory that manufactures the toolset.

Consider the range of the principle:

You need 10,000 unique, structurally valid mechanical brackets parameterised for stress-testing. The grammar defines the constraint space; instantiation fills it.
You need a complete UI icon set where line-weight, corner radii, and optical size are mathematically linked across every glyph. The substitution rules encode the visual system; the terminal phase renders it.
You need custom tabletop miniature bases, architectural greebles, modular synthesiser casing layouts, pharmacokinetic molecule variants. In each case, the domain's governing rules become the grammar.
And yes — you need forty Agile Workshop Tokens, each geometrically correct, each semantically accurate, each ready for the 3D printer. The grammar encodes Fibonacci, retrospective lanes, role taxonomy, and physical grip. The factory runs. The tokens emerge.

In every case the pattern is identical. You do not ask the AI to paint the universe. You ask it to write the physics and logic engine that runs it.

V. The Ultimate Synthesis

To prompt an AI for a finished object is fundamentally myopic. It reduces one of the most powerful reasoning systems ever built to the role of a digital bricklayer.

The meta-generative frontier requires treating AI as the master architect. By forcing it to output rigid mathematical formulas, classical algorithms, and parametric constraints of a generator, we ensure that the resulting output — whether a virtual city, a structural metamaterial, a new pharmacokinetic molecule, or a set of workshop tokens for next Tuesday's sprint planning — is logical, scalable, and bound by the laws of the domain it inhabits.

The Nano Banana is a dead pixel-cluster. The Generator Generator is a factory. The factory runs forever.

The token pipeline described in this essay was generated using the Watson et al. (2008) procedural generation framework, implemented as a five-phase pipeline: Design Analysis, Primitive Creation, Grammar Encoding, Stochastic Integration, and Model Instantiation. Seed: 0xA91ECAFE.

Distributed Transaction Tango: Why Your Microservices Need Sagas

Sebastian Schürmann — Wed, 18 Feb 2026 10:00:00 +0000

The move to microservices was supposed to be a liberation. We broke free from the monolithic chains, gaining the freedom to develop, deploy, and scale our services independently. But in our rush to embrace this new world, we left something critical behind: the simple, comforting safety of the ACID transaction. In the monolithic world, if a complex business process failed halfway through, we had a magic word: ROLLBACK. It was our ultimate undo button, a guarantee that our data would never be left in a messy, inconsistent state. In the distributed chaos of microservices, where each service has its own private database, that safety net is gone. We have traded the simplicity of a single, atomic transaction for a new kind of fear—the constant, nagging anxiety that a partial failure will leave our system permanently broken.

Our first instinct in this new reality is often to try and recreate the old one. We might reach for complex, heavyweight protocols like two-phase commits in a desperate attempt to stretch a transaction across multiple services. This approach is a trap. It reintroduces the very coupling we sought to escape, creating a brittle, slow, and unscalable system where the failure of one service can bring the entire process to a grinding halt. An even more common, and far more dangerous, response is to simply ignore the problem. We write our services to handle the “happy path,” crossing our fingers and hoping that the network is reliable and every service is always available. This is not engineering; it is wishful thinking. It inevitably leads to disaster: a customer is billed for an item that is out of stock, a user’s account is debited but their access is not granted, and our data drifts into a state of irreconcilable chaos.

We must accept that in a distributed system, partial failure is not an edge case; it is a certainty. The Saga pattern offers a way out of this trap by forcing us to confront this reality head-on. It is a fundamental shift in thinking: instead of trying to prevent failure with a single, all-or-nothing transaction, we manage it with a series of small, reversible steps. A saga is a sequence of local transactions, where each step is a self-contained operation within a single service. The magic lies in the second half of the pattern: for every action that moves the process forward, we must define a corresponding “compensating action” that can undo it. The saga doesn’t prevent failure; it provides a clear, automated path to recovery. The relationship is straightforward:

Action	Service	Compensating Action
Create Order	Order Service	Delete Order
Reserve Item	Inventory Service	Release Item
Process Payment	Payment Service	Refund Payment

This sequence of actions and compensating actions can be managed in one of two primary ways. The first approach is orchestration, where a central coordinator acts like a conductor, telling each service what to do and when. It calls the customer service, then the inventory service, then the billing service. If any step fails, the orchestrator takes responsibility for calling the necessary compensating actions in reverse order to clean up the mess. The alternative is choreography, a more decentralized dance where each service, upon completing its local transaction, simply emits an event. The next service in the chain listens for this event and is triggered to perform its own work. In this model, there is no central brain; the logic is distributed across the event streams. Choosing between them is a trade-off between having a single point of control and visibility versus a more decoupled, and potentially more complex, event-driven architecture.

Aspect	Orchestration	Choreography
Coordination	Centralized coordinator manages all steps	Decentralized; services react to each other's events
Control	High; logic is in one place	Low; logic is distributed across services
Visibility	High; easy to see the state of a saga	Low; requires monitoring event streams to trace a saga
Coupling	Tightly coupled to the orchestrator	Loosely coupled; services only know about events
Complexity	Simpler for sagas with few participants	Can become complex to track with many participants

Adopting the Saga pattern is not a free lunch. It introduces a new kind of complexity, demanding that we explicitly design for failure and recovery. We must build, test, and maintain these compensating transactions, which adds to the development overhead. It also forces us to embrace the concept of eventual consistency, accepting that there will be brief moments where the system is in an intermediate state. But the payoff is a system that is resilient by design. It is a system that can gracefully handle the inevitable failures of a distributed world without losing data or requiring manual intervention. Sagas are more than a design pattern; they are an acknowledgment that the world of microservices is messy and unpredictable. By embracing this reality, we can finally build systems that are not just scalable and independent, but also truly robust.

The Build vs. Buy Trap: Why You Should Be Assembling Instead

Sebastian Schürmann — Tue, 17 Feb 2026 09:00:00 +0000

For a while, engineering teams have been trapped in a false dichotomy, a binary choice that has dictated the shape of our projects and the fate of our budgets: do we build it or do we buy it? The "build" path is a siren song of ultimate control, promising a bespoke solution perfectly tailored to our unique needs. We imagine crafting a flawless system from the ground up, but we conveniently ignore the brutal reality of the resources it will consume, the maintenance burden it will become, and the high cost of pivoting when our perfect requirements inevitably change. On the other side lies the pragmatic allure of "buy," the promise of an off-the-shelf solution that gets us to market faster. Yet, this path often leads to the frustration of shoehorning a generic product into a specific problem, the operational headache of running and patching someone else’s software, and the creeping dread of vendor lock-in. We have been conditioned to see these two paths as our only options, but this rigid mindset is a relic of a bygone era, and it is holding us back.

The first real evolution beyond this binary trap was the rise of the “rent” model, a paradigm shift powered by the SaaS explosion. Instead of buying the software and running it ourselves, we could simply pay a subscription and outsource the entire problem. For a company needing to perform a complex data processing task, this meant no longer building a dedicated compute cluster or managing a licensed software suite; they could just call a third-party API. This approach offers undeniable advantages: near-zero operational overhead, instant access to specialized expertise, and a predictable cost model. However, it comes at the steep price of control. When you rent, you are a tenant in someone else’s ecosystem. Your destiny is tied to their SLA, their feature roadmap, and their security posture. The service is a black box, and when it fails, you are left powerless, endlessly refreshing a status page. It is the pinnacle of convenience, but it forces a trade-off between ease and ownership that many businesses are rightly hesitant to make.

This is where the truly cloud-native paradigm emerges, offering a fourth option that transcends the old debate: we can “assemble.” This isn’t about building from scratch, but about composing a sophisticated solution from a palette of smaller, fully managed, best-of-breed services—the Lego bricks provided by a modern cloud platform. Instead of renting a black-box API, a team can assemble its own data-processing pipeline. They use a cloud storage service for raw data, a message queue to trigger jobs, and a managed compute service to perform the transformation. The critical distinction is that they own the workflow, the logic, and the configuration, but they are completely liberated from managing the underlying infrastructure. This is the synthesis we have been searching for: the customization and control of the “build” world combined with the operational simplicity of the “rent” world.

Adopting this assembly-line mindset requires a fundamental shift in how we evaluate cost and effort, because teams consistently fall into the trap of miscalculation. We dramatically overestimate our ability to build a solution quickly and cheaply, forgetting that the initial development cost is merely the tip of the Total Cost of Ownership (TCO) iceberg. We ignore the immense, ongoing costs of maintenance, security patching, scaling, and the operational staff required to keep a custom service alive. We also underestimate the complexity of running a “bought” solution, which is never as simple as the sales pitch suggests. The “assemble” model may appear to have a higher cost per transaction, but its TCO is often drastically lower because entire categories of work—like managing servers, planning capacity, or patching operating systems—are eliminated for important parts of a app. This requires a cultural shift in financial thinking, moving from a world of upfront capital expenditure to a more fluid, pay-as-you-go operational model.

Ultimately, the most strategic question is not about cost, but about focus. The decision to build, buy, rent, or assemble should be a conscious choice about where to invest your team’s most valuable and finite resource: their attention. It makes little sense to divert your best engineers to build a mediocre version of a solved problem, like a message queue or a workflow engine, when they could be working on the unique features that actually differentiate your business and create a competitive edge. The assembly model allows us to outsource the undifferentiated heavy lifting to the cloud provider, freeing our teams to focus on the core business logic where they can create the most value. The future of engineering is not about being the best at building everything from scratch; it is about being the best at intelligently assembling the powerful components that are already at our fingertips.