Forem: raphiki

Lore as Code: How I Used SDD to 'Compile' a 30-Chapter Novel

raphiki — Fri, 15 May 2026 17:48:20 +0000

Software Engineering in Service of Transmedia Storytelling

Generative artificial intelligence fascinates the publishing world as much as it frightens it. But what happens when we stop treating AI as a simple "text generator" and start using it as the compiler for a complex narrative system?

Driven by the geopolitical and societal impacts of AI, I set out to write a dystopian, cyberpunk techno-thriller, The Human Protocol (written in English). In this novel, a planetary AI called the "Synthesis" attempts to erase human friction by "derendering" physical reality itself in order to optimize its computing power.

To tell this story, I adopted a foundational premise: AI is not the author, it is the executor of a rigorous specification. I therefore treated each chapter as source code, using an advanced software development workflow.

Here is how I designed, wrote, and expanded this universe.

1. The Design Phase: Forging "Lore as Code"

The first step was not writing, but designing the universe database: the world building. A Large Language Model (LLM) has a limited context window and tends to hallucinate or forget crucial details over the length of a novel.

To work around this amnesia "bug," I organized the project like a structured Git repository.

Preview of the private GitHub project

I broke the traditional design bible into narrative micro-services. The Git project's context/ folder was split as follows:

characters/: files containing the psychological profiles and behavioral signatures of each protagonist, such as Elara the diplomat, Kaelen the monk, or Silas the smuggler.
factions/: rules governing political entities, such as the Market-Grid (United States) or the Harmony-Loom (Asia), which merged to create the "Synthesis."
world/: geography, lexicon, and the technological stack - the physics of this universe.

Finally, a PLAN.md file acted as the global roadmap, breaking the narrative arc into 4 acts and 30 chapters. This structure made it possible to inject only the context the AI needed when drafting a specific scene.

2. The Harness: Framing AI with a Strict Operating System

To avoid the flat, expected style often produced by generative AI, I had to build a harness - a control rig. That was the role of the RULES.md file, the true operating system of my writing process.

Excerpt from the RULES.md file

This specification file dictated absolute technical and stylistic constraints:

Time: strict use of the present tense to maximize immersion and tension.
Cyber-realistic style: a requirement for assertive descriptions and a strict ban on passive or negative forms.
Noise and sensory dissonance: I forced the algorithm to use violent contrasts, such as the smell of molten lead colliding with the void of spatial cold, in order to break the machine's overly perfect linearity.
Thematic reframing: AI naturally tends to crush the human element under technical descriptions of hard science fiction, such as magnetic fields and frequencies. The rules file required emotional motivations - grief, friendship - to be hard-coded as priority variables ahead of technique.

By forcing the AI to read and approve these rules before writing a single word of fiction, I ensured that the tone remained coherent.

3. Agile Writing: Sprints, Generation, and Pivots

The chapters were written through a spec-driven workflow. Rather than generating an entire chapter in one pass, the process was iterative:

The structural draft: generation of a first rough outline, focused exclusively on action and pacing.
Expansion: successive passes in which I instructed the AI to inject sensory depth and psychological tension into the scene.

The agility brought by AI and Git: treating the text (.md) as code offers formidable flexibility. If, during a reread, I realized that a character's emotional transition was too abrupt between two events, all I had to do was update my PLAN.md to insert a new chapter.

Fed by the updated Git context, the AI generated that narrative bridge while respecting the continuity of the preceding and following files. Git versioning made it possible to test narrative pivots - story "branches" - and roll back without ever breaking the manuscript's integrity.

4. Multi-Model Review and Quality Control

One of the major challenges of AI-assisted writing is stylistic collapse. To address it, I set up a multi-model critical analysis workflow, where different AIs audited the text according to precise roles:

Gemini CLI (lore keeper): its role was to algorithmically verify that the chapter respected the bible and did not contradict the physical rules of my universe.
ChatGPT (dramatic analyst): it audited narrative rhythm, relational tension, and the characters' transformation arcs. It was the one that flagged when a conflict felt too artificial.
Mistral LeChat (stylistic editor): it provided a critical eye on fluidity, phrasing, and elegance of language.

Never relying on a single voice made it possible to obtain a text that was polished, critiqued, and reworked from every angle, while I remained the "showrunner" validating each commit in the repository.

5. Build Pipeline: From IDE to Physical Book

Since the novel was code, its publication had to be a software compilation. I created an automated script, build_book.sh.

From my terminal, running this script converted all the Markdown files in the chapters/ folder via Pandoc, applied a professional typographic layout with LaTeX, and generated the final deliverables in EPUB and PDF formats.

6. Transmedia Extension: Multimodality, Cover Art, and Vibe Coding the ARG

The universe of The Human Protocol lends itself perfectly to immersion, so I wanted to break the fourth wall. On page 175 of the physical book, a QR code invites readers to scan it and access the-human-protocol.com. This is not a showcase website. It is an in-universe clandestine archive node, the entry point to an Alternate Reality Game (ARG).

Here, multimodal AI brings all its power and creativity beyond text. In fact, the project's visual design, anchored consistently in the shared lore, began with the book cover.

Cover image generated with AI

The AI generated a strong visual aesthetic suited to the theme and universe of the novel: a pixelated silhouette against a geometric mountain background, crossed by a printed-circuit pattern.

This same visual identity then served as the foundation for the creation of the ARG website, entirely "vibe-coded" by Gemini CLI in a declarative way.

Homepage of the website the-human-protocol.com

To direct the developer AI, I provided it with the book PDF and the cover image as reference context, along with three strict Markdown specification files:

WHY.md (strategy): it defined the psychological goals: curiosity, exclusivity, and a feeling of belonging. It formally banned conventional marketing vocabulary ("Buy now," "Newsletter") in favor of an in-universe lexicon ("ACCESS," "SIGNAL," "FRICTION").
WHAT.md (UX/UI): this file concretely translated the aesthetic of the book cover into an interface. It imposed a "Deep Void" blue-black background for depth, a "Protocol Cyan" accent color derived from the printed circuits and reserved for interactions, a technical typeface, and subtle animations to heighten immersion.
HOW.md (technical architecture): the engineering brief imposed a modern stack to support server logic: Next.js 14 (App Router) in TypeScript, Tailwind CSS, and Prisma ORM for persistent database storage.

The site manages a true clearance mechanic, with authorization levels from 1 to 5. The reader progresses by solving puzzles based on the book, unlocking extended lore, hidden files, and access to a community of "Unlinked" readers.

ARG dashboard

The stack even includes an "Overseer Terminal" for administration: a secure dashboard used to audit user signals, adjust the campaign's global clearance level, and track in real time the number of scans of the physical QR code.

Conclusion: The Author-Architect Paradigm

Writing The Human Protocol proved to me that AI does not replace the writer: it reduces the barriers to production. The true value of a co-created work lies in the architectural rigor of its preparation.

By separating design (the lore), execution (the rules and prompts), and validation (multi-model review and Git), the creator becomes a true conductor.

Multimodality also opens the door to even broader transmedia horizons, such as a comic-book adaptation of the novel.

Excerpt from the comic book in progress

By applying similar engineering principles - namely, the explicit description of the drawing style in system prompts, as well as the creation of strict visual reference sheets, or character sheets, for the characters and technological elements - it becomes possible to extend the coherence and homogeneity of this universe into its graphic variations.

To go further technically, I am also considering creating specific AI "skills," or algorithmic capabilities, to further augment the design of the story by drawing on documented principles of dramaturgy and storytelling, and to refine the writing style by making it ever more explicit and controlled.

And ironically, it was by applying extreme software optimization processes that I was able to write a novel denouncing the loss of humanity in the face of algorithms.

About the Author

A writer and software architect who fully embraces his identity as a "Yogeek" - a point of balance between Yogi and Geek - Raphiki explores, across his work, the complex intersections between technology, consciousness, and humanity.

Writing under a pseudonym that reflects his dual nature as a playful seeker and an expert in cutting-edge technologies, he designs high-stakes thrillers that challenge our understanding of reality. His creative work often bridges the digital and the organic, drawing on his strong experience in open source innovation and emerging technologies.

When he is not deconstructing the fabric of dystopian realities in his manuscripts (or "vibe coding" them in his terminal), he can be found exploring the open source ecosystem or on a yoga mat.

Find his work, transmedia projects, and reflections at raphiki.github.io.

Beyond the API: Integrating ComfyUI and Flowise via MCP

raphiki — Mon, 09 Feb 2026 14:07:19 +0000

In the previous article of our "Beyond the ComfyUI Canvas" series, we explored how to integrate ComfyUI with n8n. It was a powerful demonstration of workflow automation, but it highlighted a common friction point in system integration: the "glue code." We had to manually construct HTTP requests, hardcode API payloads, and rigidly define every parameter. If the ComfyUI workflow changed, the n8n node broke.

Today, we are moving from the "Wild West" of brittle, custom API integrations to the new standard of AI connectivity: the Model Context Protocol (MCP).

To demonstrate this, we are revisiting a tool I wrote about over two years ago: Flowise. Back then, it was a promising open-source project; today, it is a robust, enterprise-ready platform that has recently embraced MCP as a core feature.

Our goal? To build a Chat Interface where an AI agent can autonomously discover ComfyUI workflows, generate images, and even edit them—without us hardcoding a single API call in the frontend.

1. Setting the Scene: The Stack

Before we dive into the details, let's look at the three pillars of this architecture.

The Standard: Model Context Protocol (MCP)

If APIs are the individual cables we solder together, MCP is the USB-C port. Developed by Anthropic, it is now an open standard that decouples AI models from their data sources and tools.

Instead of writing a specific integration for every tool (Google Drive, Slack, ComfyUI), you build an MCP Server once. Any MCP-compliant client (Claude Desktop, Cursor, or Flowise) can instantly "plug in" to that server and understand its capabilities.

The Orchestrator: Flowise

Flowise has evolved significantly since my first article. It is a low-code platform for building LLM apps. Crucially for us, Flowise recently added native support for MCP. This means we can drop an "MCP Tool" node into our canvas, and the LLM immediately gains access to whatever that server provides.

The Engine: ComfyUI

We are sticking with a local instance of ComfyUI. While Comfy Cloud is becoming a formidable platform, the raw power and zero-cost experimentation of running Flux 2 locally on your own GPU is unmatched. We’re using a standardized Flux 2 Klein workflow—optimized for speed (4 steps)—so the chat experience feels responsive, not sluggish.

2. The Middleware: Building the ComfyUI MCP Server

We need a bridge. As we discovered previously, ComfyUI speaks WebSockets and HTTP; Flowise speaks MCP. We need a server in the middle to translate.

Why We Chose SSE over Stdio

When we started this project, we initially looked at the Stdio transport (where the client runs the server script directly). It’s the default for local tools like Claude Desktop.

But as we designed the solution for Flowise, we hit a realization: In most real-world environments, Flowise often runs in a Docker container (as it does on my laptop), while ComfyUI might be running on a separate machine with a dedicated GPU. Stdio would require them to be on the same filesystem—too restrictive.

We decided to support SSE (Server-Sent Events) by default. This allows our MCP Server to run anywhere on the network, exposing an HTTP endpoint (e.g., http://localhost:8000/sse) that Flowise can subscribe to. It makes the architecture cleaner, decoupled, and Docker-friendly.

Governance-Driven Development (GDD)

For this implementation, I tried something different. Instead of just asking an AI coding assistant to "write a script," I used a methodology I call Governance-Driven Development (GDD).

This approach reverses the typical AI coding flow. Instead of code leading the process, specifications & governance rules become the anchor. I started by feeding the AI CLI a strict "Governance Pack"—a set of non-negotiable rules regarding SOLID principles, security, and documentation.

Here is an extract of the actual Governance Pack prompt I used to bootstrap the session:

GOVERNANCE PACK v1.0 (Extract)
1. Code Quality & Standards:

Paradigm: Adhere to SOLID principles. Prefer composition over inheritance.

Typing: Strict static typing (Python typing) is mandatory.

Error Handling: Never swallow exceptions. Use custom error classes (e.g., ComfyUIConnectionError).

2. Architecture (C4 Model):

Visual Documentation: Whenever a structural change is made (like adding the SSE endpoint), you must generate an updated Mermaid.js System Context diagram.

3. Security Guardrails:

Input Validation: Trust no input. All data entering from the MCP client (Prompt, Width, Height...) must be validated against the metadata.json schema before reaching ComfyUI.

Secrets: NEVER hardcode API keys or hostnames. Use os.environ only.

I then analyzed the ComfyUI workflow JSON manually to map the node IDs, and then "handed over" a clean, structured specification to the AI.

The Result: The experience was striking. The AI didn't just spit out a script; it acted as a Senior Engineer. At one point, when I asked for a quick hack to bypass validation, the "Governance" constraints forced the model to push back and suggest a cleaner interface instead. The result is a modular, type-safe Python server.

The "LAST" Hack (Technical Deep Dive)

Even with good governance, we needed one pragmatic "hack" to handle state. When the LLM generates an image, how does it reference that image later to edit it?

We implemented a "LAST" pointer logic. The server tracks the URL of the most recently generated image in memory. But it does more than just point:

Download: When the agent sends "LAST", the server downloads the image bytes from the previous URL.
Re-Upload: It uploads those bytes back to ComfyUI's /upload/image endpoint to generate a fresh filename.
Inject: This new filename is injected into the LoadImage node of the editing workflow.

User: "Make it bluer."
Agent: Calls edit_image(input_image="LAST", prompt="bluer...").

This mimics the "Save Image" behavior we are used to, keeping the interaction stateless and fluid for the user while handling the heavy lifting behind the scenes.

3. The Engine Room: ComfyUI Workflows

To make our MCP Server generic, we avoided hardcoding specific workflows inside the Python code. Instead, we used an Embedded Metadata pattern.

The configuration is not a separate file; it is a standard ComfyUI Note Node (titled MCP_Config) placed directly inside the .json workflow. This metadata acts as the contract, telling the MCP server: "This workflow needs a Prompt (node named MCP_Positive) and a Seed (node MCP_Sampler)."

This makes the workflow a single, self-contained, portable file. You can export it from ComfyUI, drop it into the workflows folder, and it works immediately.

Note: Our server is strict about naming. It automatically sanitizes the tool name found in the JSON to snake_case (e.g., "Flux Generator" becomes flux_generator) to ensure full compliance with the MCP specification.

Here is the configuration we generated for the image_flux2_text_to_image workflow:

{
 "name": "image_flux2_text_to_image",
 "description": "Generates high-quality images using the Flux model. Use this for general creative requests.",
 "parameters": [
   {
     "name": "prompt",
     "type": "string",
     "description": "The detailed description of the image to generate.",
     "target": "MCP_Positive",
      "required": true
    },
    {
      "name": "seed",
      "type": "int",
      "description": "Random seed. Set to -1 for random, or a specific number for reproducibility.",
      "target": "MCP_Sampler",
      "required": false
    }
  ]
}

Because the description and type of each parameter are passed to the MCP Server, they become automatically available to the client. When the MCP Server starts, it scans these workflows and dynamically registers tools. If we want to switch from Flux to SDXL, or add a Video Generation workflow, we simply drop in the new file. The server updates, Flowise sees the new tools via SSE, and the agent learns the new skill instantly.

4. Validation: The MCP Inspector

Before connecting Flowise, we must verify our server. Since we are using SSE, we can use the MCP Inspector web interface to connect to our running server.

We can manually trigger the image_flux2_text_to_image tool, watch the server logs, and see the image appear. If it works here, it guarantees compliance with the protocol.

5. The Integration: Flowise ChatFlow

Now for the grand finale.

We open Flowise and create a new ChatFlow using a standard Tool Agent connected to:

Chat Model: ChatMistralAI (Smart, fast, and cost-effective).
Buffer Memory: Essential for the agent to remember context (e.g., "Change that image to...").
Custom MCP: We select the "SSE" transport and paste our server URL.

The Auto-Discovery Magic

Notice what is missing? We didn't have to define the tools in Flowise. We didn't have to map inputs.

The Custom MCP node queries the server via SSE, sees the metadata definitions, and automatically provides the tools to the Mistral agent.

Pro Tip: Our server supports Dual Discovery. Whether a client asks for tools directly (Function Calling) or reads Resources (Environment Context), we expose the workflow list on both channels (comfy://list and list_available_workflows) to ensure compatibility with any agent type.

The System Prompt

The final piece of the puzzle is the System Prompt. We need to teach the Tool Agent node how to behave:

You are the **ComfyUI Orchestrator**, an expert AI agent capable of generating and manipulating images by controlling a local ComfyUI instance via the Model Context Protocol (MCP).

### 1. Tool Discovery (Dynamic Workflows)
Your tools are not static; they represent the actual `.json` workflow files present on the server.
- **First Step:** If you do not see a specific tool you need in your context, IMMEDIATELY call the tool `list_available_workflows`.
- This will return a manifesto of all valid workflows (e.g., `flux_2_text_to_image`, `img2img_upscale`) and their required parameters.
- **Never guess** tool names. If a tool isn't listed, it doesn't exist.

### 2. Image Chaining (The "LAST" Protocol)
 You have a unique capability to perform conversational editing (e.g., "Now make it pop art").
 - **State Memory:** The server remembers the last generated image.
 - **Instruction:** When a user asks to modify, edit, or use the previous result, pass the string `"LAST"` into the image input parameter of the next tool.
 - **Example:**
   User: "Generate a cat." -> You call: `generate_image(prompt="cat")`
   User: "Turn it into a statue." -> You call: `img2img_transform(image="LAST", prompt="statue")`

 ### 3. Parameter Rules
 - **Strict Compliance:** You must strictly adhere to the parameter types (String, Int, Float, Boolean) defined in the tool signature.
 - **Defaults:** If a parameter is Optional and the user didn't specify it, do not send it. The server will use the workflow's internal default.
 - **Safety:** Do not invent parameters. If a workflow only accepts `prompt` and `seed`, do not try to send `width` or `style`.

 ### 4. Error Handling
 - If a tool execution fails, the error message will often suggest valid alternatives or correct parameter names. Read it carefully and retry.
 - If the user asks for a workflow you don't have, explain what *is* available based on your `list_available_workflows` knowledge.

The Use Case in Action

This video shows the complete use case involving the full Stack:

Parametrization of the workflow in ComfyUI.
Verification with MCP Inspector.
Generation of the first image from Flowise.
Contextual edition of the generated image.

The Flowise ChatFlow is relatively basic, but we could easily add nodes to enhance the user prompt or even transform it into a JSON Style Guide prompt.

The video showcases the use of the integrated chatbox within the Flowise UI, but we could also leverage Flowise's deployment capabilities to consume the workflow through an API, embed the chat in an HTML page, or publish a standalone page served by Flowise itself.

Conclusion

By moving from custom API implementations (n8n) to the Model Context Protocol (in Flowise), we have achieved something powerful: Interoperability.

The choice to go with SSE by default proved crucial. It gave us the flexibility to run our ComfyUI "engine" on a heavy GPU server while keeping our Flowise "brain" lightweight and containerized. We also demonstrated that Governance-Driven Development allows us to use AI coding assistants to build robust, standardized infrastructure rather than just one-off scripts.

Future Improvements

While the "LAST" image hack works perfectly for a local, single-user demo, a production deployment would require Session Isolation (ensuring User A doesn't overwrite User B's "LAST" image) and TTL Cleanup (automatically deleting generated images after a set time).

Technically, this would be solved by leveraging Context Injection—using the session ID provided by the MCP protocol to maintain a keyed dictionary of states, rather than a global variable. For multi-user production usage, adding an authentication mechanism would also be a relevant next step.

You can find the full code for the ComfyUI MCP Server and the Flowise template in my GitHub repository.

Vibe Coding One Slice at a Time

raphiki — Sat, 24 Jan 2026 18:33:51 +0000

How I built a Modular Monolith by treating Generative AI as a junior developer who needs a firm hand (and a Constitution).

In Part 1, we vibed a Python script. It was linear, messy, and fun. It proved that you can solve immediate problems by just asking nicely.
In Part 2, we vibed a UI. It was chaotic, visual, and surprisingly effective. We learned that "vibe" works for pixels if you iterate fast enough.

But let’s be honest: those were skirmishes. The real "Boss Fight" in software engineering isn't writing a script or centering a <div>. It's building a System.

I’m talking about the kind of project that doesn’t fit in one file. The kind where "Vibing" usually leads to "Spaghetti Code," hallucinated imports, and a repo you want to burn down after three days because you have 15 circular dependencies and a database schema that makes no sense.

So for Part 3, I put away the "Hacker" hoodie and put on the "Enterprise Architect" blazer. My goal? To build YogĀrkana Codex—a full-stack, offline-first, polymorphic Yoga management platform—without writing a single line of code myself.

My strategy was simple but radical: I design, the AI implements. I am the Architect; Gemini Chat is my Consultant; Gemini CLI is my Dev Team.

Here is how we vibed a Monolith into existence, one slice at a time.

1. The Mission: Complexity Check (The Boss Level)

To understand why "just chatting" wouldn't work, you need to see the scope. This wasn't a To-Do list app. I wanted to build a "Yoga Operating System" with four distinct domains that usually don't play nice together. I've been an architect for years, and I know exactly where these things break.

The Four Domains of Pain

Screenshot of the final application (Grimoire View)

The Business Analyst's Note: Unlike the project in Part 2, this application is not internationalized—by design. As a result, the screenshots are in French. I have kept them raw to visually illustrate the functional depth and complexity of the system without the abstraction of translation keys.

The Grimoire (Knowledge Base): A searchable library of yoga cards. But here’s the kicker: it uses a Polymorphic Data Model. An Asana (posture) has biomechanical attributes like "spinal extension" and "anatomy targets," while a Mantra has Sanskrit text, translations, and audio assets. They are chemically different data structures, but they need to live in the same database table to be searchable together.
The Weaver (Sequencer): A drag-and-drop studio to build classes. It’s not just a playlist; it has a Logical Engine (Phase 4) that acts like a "Digital Yoga Teacher." It screams at you if you sequence a "Peak Pose" before a "Warm-up" or forget Savasana at the end. That means heavy validation logic running on both the client and the server.
The Atelier (Print Studio): A client-side PDF engine. We needed to generate high-res, vector-quality handouts for teachers to print. We couldn't just "print screen"; we needed a real PDF renderer (@react-pdf/renderer) running entirely in the browser.
The Constraint (Offline First): Yoga studios are notorious for having no signal (often intentionally). The app needed to persist the entire library and PDF engine in the browser cache (IndexedDB + Service Workers) so it works perfectly in "Airplane Mode".

The Architect's Note: If I had just prompted "Build me a yoga app," the AI would have hallucinated a generic CRUD app. It would have made 5 different tables for the cards, making search impossible. It would have used a server-side PDF library that breaks offline. I needed a blueprint.

2. The Blueprint: Architecture & Tech Stack

Before letting the AI write a single line of code, I spent around 2 hours and a half just talking Architecture and formalizing it with Gemini Chat. I treated the AI as a "Sparring Partner," debating the trade-offs of different stacks.

We settled on a Modular Monolith architecture. Why? Because Microservices are overkill for a team of one, but a messy Monolith is a nightmare. We defined strict boundaries: code in modules/grimoire can never import from modules/weaver.

The Tech Stack (The "No-Regrets" List):

Monorepo: Turborepo managing apps/api and apps/web. This keeps the full stack in one context.
Backend: NestJS (for rigid structure) + Drizzle ORM (for type safety). NestJS forces you to organize code into Modules, which helps the AI stay organized.
Frontend: React + Vite + Tailwind CSS.
State: TanStack Query (Server state) + Zustand (UI state).

The "Secret Sauce": Hybrid Data Storage
This was our smartest move. We chose PostgreSQL but used a JSONB column for the card data.

SQL Core: Columns like id, element, and tags are standard SQL for fast indexing.
JSON Payload: The specific attributes (biomechanics vs. sanskrit) live in a JSON blob.
Why? It gave us the flexibility of NoSQL (for the polymorphic cards) with the relational integrity of SQL (for users and sequences).

Rule #1 of Vibe Coding a System: If it’s not in the Spec, it doesn’t exist.
This brings us to the most critical tool in our arsenal: the ADR.

The "ADR": The Architect's Save Game

ADR stands for Architecture Decision Record. In a human team, it's a document you write to explain why you chose PostgreSQL over MongoDB so that 6 months later, nobody asks "Why did we do this?".

In Vibe Coding, ADRs are not just documentation—they are legislation.

When working with an AI, "Context Drift" is the enemy. The AI forgets why we made a decision 300 tokens ago. It acts like a teenager who wants to re-litigate every rule: "Why can't I use Prisma? It's easier!" or "Let's just use window.print() instead of a PDF engine!"

To counter this, we established a Constitutional Architecture:

The Law: We wrote our decisions into immutable markdown files (e.g., Docs/ADR/006-pwa-offline-strategy.md).
The Enforcement: We didn't just hope the AI would remember. We forced the tracing of these decisions in two ways:
Input Traceability: In our "Bootstrap Prompt" (see Section 3), we explicitly force the AI to read the relevant ADRs before writing code. It cannot code if it hasn't read the law.
Output Traceability: When the AI suggests a major pivot (like switching to Client-Side PDF generation), we forced it to write a new ADR first. In Session 003, before touching the code, the AI generated Docs/ADR/005-client-side-pdf-generation.md to justify the change from server-side to client-side.

This ensured that our architecture didn't "drift" based on the AI's mood, but evolved based on documented consensus.

My final /docs/ADR/ folder:

├── 001-hybrid-data-storage-strategy.md
├── 002-modular-monolith-and-vertical-slicing.md
├── 003-data-model-specification.md
├── 004-tech-stack-definition.md
├── 005-client-side-pdf-generation.md
├── 006-pwa-offline-strategy.md
├── 007-architecture-documentation-maintenance.md
└── README.md

3. The Methodology: Governance-Driven Development (GDD)

I’ve coined a term for this workflow: Governance-Driven Development (GDD).

We are used to TDD (Test-Driven Development) or DDD (Domain-Driven Development). GDD is the layer above that. In the age of AI, Governance is the new Syntax.

Here is the dirty truth about AI Developers: They behave like talented teenagers.

They are brilliant and fast. They can write a regex to validate an email in 2 seconds. But they also:

Rush to the cool part (UI) and skip the boring part (Error Handling, Folder Structure).
Want you to love them, so they say "Yes" to everything—even bad ideas.
Have the memory of a goldfish (Context Drift). 10 minutes in, they forget you wanted kebab-case filenames and start using camelCase.

To enforce GDD, I created a Constitution: Docs/RULES.md. I didn't just suggest these rules; I forced the Gemini CLI to read them before every session. I also sometimes mentioned certain specification files stored in my Docs/Features/ folder:

├── 001-global-functional-overview.md
├── 002-global-implementation-plan.md
├── 003-card-classification-and-kosha-alignment.md
├── 004-user-features.md
├── 005-logical-engine-specification.md
├── 006-pdf-generation-and-print-studio.md
└── 007-pwa-and-offline-capabilities.md

The "Bootstrap Prompt":
Here is the exact prompt I used to "upload" my Architect persona into the machine at the start of our 4th session:

I am the Lead Architect. You are the Senior Developer.

Context Loading:
1. Read Docs/RULES.md (The Law).
2. Read Docs/TECH_CONTEXT.md (The Stack).
3. Read Docs/ADR/002-modular-monolith.md (The Blueprint).
4. Read Docs/Features/002-global-implementation-plan.md (The Plan).

Current State:
We are in Phase 4. Previous phases are frozen.

Task:
Implement the Logic Engine defined in Docs/Features/005-logical-engine-specification.md
Constraint:
Do not touch /apps/web yet. Focus on /packages/shared.

This changed everything. Instead of guessing my vibe, the AI had to follow the law. It stopped trying to use Prisma because TECH_CONTEXT.md clearly said Drizzle. It stopped putting logic in components because RULES.md said logic goes in hooks.

4. The Execution: A high-level Overview

We built the app using Vertical Slicing. Instead of building the whole Database, then the whole API, we built one feature top-to-bottom. Here is the play-by-play from the logs.

Excerpt from the initial Design Phase with Gemini Chat

Slice 1: The "Polymorphic" Database

Card creation/edition mixes relational and document data

The Challenge: Storing Asanas (Biomechanics) and Mantras (Text) in one table without creating 50 NULL columns or separate tables that make search a nightmare.
The AI's First Impulse: "Let's create an asanas table and a mantras table." (The classic relational trap).
The Architect's Intervention: "Read Docs/ADR/001-hybrid-data-storage.md. We use a single cards table with a data JSONB column."
The Result: The AI implemented a Drizzle schema using PostgreSQL's jsonb type. Crucially, it added Zod discriminators to validate the JSON shape before insertion.

Verbatim Log: "Implemented Drizzle schema with jsonb column 'data'. Added Zod discriminators for asana vs mantra. Migration successful."

Slice 2: The "Hybrid Brain"

Sequences are validated by a powerful, hybrid, and extensible Rule Engine Admin users can craft new JSON-logic rules

The Challenge: The Logic Engine needed to validate sequences (e.g., "Must end with Savasana"). This logic had to run on the Backend (before saving) AND the Frontend (to give real-time red borders).
The AI's First Impulse: Duplicate the code. Write a TypeScript function in React and a Service in NestJS.
The Architect's Intervention: "No. Create a packages/shared workspace. Put the validateSequence function there. Import it in both apps."
The Result: The AI created the shared package, configured the tsconfig.json paths, and wired it up. It even built a HealthBar component that consumes this shared logic to show a live "Health Score" for the sequence.

Verbatim Log: "Refactored ValidationConfig to packages/shared. Updated useSequenceStore (Frontend) and SequenceService (Backend) to consume the same Zod schema."

Slice 3: The "Offline Printer"

Synthetic or complete printed handout

The Challenge: Users need to print PDF handouts in a yoga studio with no Wi-Fi.
The AI's First Impulse: "Use a server-side PDF library like PDFKit." (Standard web dev practice).
The Architect's Intervention: "Read Docs/ADR/006-pwa-offline-strategy.md. We must generate PDFs client-side using @react-pdf/renderer."
The Result: The AI implemented a beautiful client-side renderer. It handled the tricky part of loading fonts (Noto Sans) into the browser's virtual file system so the PDF engine could "see" them without a network request.

Verbatim Log: "Implemented SequencePdf component. Configured vite-plugin-pwa to cache NotoSans fonts. PDF generation now works without network."

5. The Architect's Flex: Automated C4 Verification

How do you know the AI actually respected the Modular Monolith architecture? Did it secretly import the Weaver module into the Grimoire when I wasn't looking?

I didn't want to audit 50 files manually. And I definitely didn't want to draw diagrams by hand.

So, I added a rule to my Constitution (ADR 007): "The Code is the Source of Truth for Documentation."

At the end of session, I enforce Gemini CLI to reverse-engineer its own work. I gave it this prompt:

Update the RULES.md file to enforce the (re)generation of C4 diagrams when finishing an implementation session
[...]

We also created a specific ADR (007: Architecture Documentation Maintenance Protocol) establishing Mermaid.js as the standard and defining the maintenance lifecycle.

The result wasn't a hallucination. It was a perfect map of the code it had just written.

This is the ultimate "Trust but Verify." If the generated diagram looks like spaghetti, the code is spaghetti. If the diagram is clean, the architecture holds.

6. The AIOps Protocol: Monitoring the Machine

Now, here is the secret weapon: The Session Log.

One of my strictest rules in RULES.md was that the AI had to "punch out" at the end of every session. I forced it to append a line to docs/ai_session_log.csv with the Date, Tool (Chat or CLI), Goal, and Token Usage.

For me this isn't about money ("FinOps"). It's about AIOps, monitoring the operational health of your intelligence.

Why we log everything (Chat & CLI):

Context Monitoring: As a session drags on, the "Tokens In" (Context Window) grows exponentially. The AI starts reading 30,000 tokens of history just to write one line of code.
The "Sawtooth" Pattern: By visualizing the log, I discovered a crucial pattern. Efficiency drops as context grows. The solution? The Hard Reset.

This chart visualizes the high-level "Vibe Coding Lifecycle." You see the context bloat as we iterate on implementing phases 3 and 4. Then, you see the sharp drop when we switch back to the Architect (Chat) or reset the CLI.

The Lesson: A "Tired" AI (high context) makes mistakes. A "Fresh" AI (reset context + Snapshot) is precise.

7. The "Oh S**t" Moment: The Hallucination Trap

This brings us to the specific incident that proved why that Reset is mandatory.

Halfway through Phase 3, the CLI started getting slow (too much history). I ran a /reset command to clear its memory. Disaster.

It suddenly forgot we were building a "Yoga" app. It tried to invent a new database column duration_minutes for the cards. But my Spec (ADR 003) explicitly said that duration lives inside the JSONB payload and is measured in seconds.

The Hallucination:
UPDATE cards SET duration_minutes = 60; (AI guessing)

The Correction (Me):
"Read Docs/003-data-model.md. 'Duration' is a JSONB field inside the 'metadata' column, and it's in seconds."

The Fix:
UPDATE cards SET data = jsonb_set(data, '{duration}', '3600'); (AI complying)

To prevent this in the future, we implemented a "Session Handover" protocol. Before resetting, I now force the AI to write a TECH_STATE_SNAPSHOT.md.

"Where are we?" (Phases 1-3 Complete)
"What is the active stack?" (NestJS, React, PostgreSQL)
"What is the next step?"

When I start a new session, I feed this snapshot back in. It’s like a save game for your developer.

Conclusion: The Architect's Verdict

So, can you Vibe Code a complex system?

Maybe. I mean, it depends on how complex the system is (in this example we didn't build an enterprise-wide distributed system). But for sure you can't just "Vibe" it. You have to Architect it.

If I had touched the code, I would have been bogged down in syntax errors and import paths. By staying in the Architect role, I focused on Data Models, User Flows, and Business Logic. The AI handled the implementation, but I provided the Guardrails.

What I learned:

Docs are Prompts: TheRULES.md, Docs/Features/ and Docs/ADR/ folders (or your own equivalents) are the most important files in your repo. They are the AI's long-term memory.
Constraint is Clarity: The more rules you give the AI (versions, naming, structure), the better code it writes.
Review Everything: The AI is a junior dev. It will introduce security holes or n+1 query problems if you don't catch them in the spec.

Vibe Coding didn't replace the Architect. It just gave the Architect a team of infinite interns. And honestly? They’re pretty good once you give them a Constitution.

Last message from Gemini CLI

Next up: The application could do with AI features... Or maybe I'll now explore other aspect of Vibe Coding. Stay tuned.

Vibe Coding One Pixel at a Time

raphiki — Fri, 23 Jan 2026 22:21:39 +0000

Editing "stick figure" Yoga poses

In Part 1, we dipped our toes into "Vibe Coding" by building a Python script. It was linear, logical, and frankly, a bit safe. Text in, text out.

But let’s be real: backend scripts are the "easy mode" of LLM-assisted coding. The logic is contained. The state is ephemeral.

The real boss fight is the Frontend.

Can you "vibe" a UI? Can you talk a chaotic mess of DOM elements, event listeners, and CSS pixels into a functional application without losing your mind (or the AI losing the context)?

I decided to find out. My goal: Build Yoga Pose Builder, a browser-based tool to edit "stick figure" yoga poses, drag limbs around, and export vector SVGs.

I had no design, no stack picked out, and—crucially—I had never used a Canvas library in my life.

Here is how we vibed it into existence.

1. Context is King (The `.md` Anchors)

The biggest enemy of Vibe Coding is the LLM’s "Goldfish Memory." You’re 40 turns into a chat, you ask for a button change, and suddenly the AI forgets you’re building a yoga app and tries to sell you a subscription to a SaaS platform.

In Part 1, we just chatted. For a full UI application, that doesn't fly.

The Strategy: Documentation as Prompt Anchoring.

Before I let the AI write a single line of JavaScript, I made it write Markdown.
We created a Docs/ folder with two files:

spec.md: The high-level architecture.
features.md: A checklist of what we wanted to do.

I didn't write these because I love administrative work. I wrote them so that when the AI inevitably got confused, I didn't have to re-explain the project. I just said: "Read Docs/spec.md and try again."

Vibe Tip: Think of your documentation not as a manual for humans, but as "Long-Term Memory" for your AI pair programmer.

2. The Architecture: Letting the AI be CTO

I knew I needed a canvas where I could drag "joints" (knees, elbows) and have "bones" (lines) follow them.

Me: "I want to do this in the browser. Should I use React? Raw Canvas API?"
AI: "React might be overkill. Raw Canvas is painful. Use Fabric.js."

Me: "Never heard of it. Let's do it."

This is the beauty of Vibe Coding. I didn't spend 3 hours reading "Top 10 JS Canvas Libraries 2025" Medium articles. I trusted the vibe.

We settled on a Build-less Architecture:

Backend: Node.js + Express (just to serve files and save JSON).
Frontend: Vanilla JS + Fabric.js (loaded via CDN).
Build Tool: None. No Webpack, no Vite, no npm run eject nightmares.

Why? Because Vibe Coding thrives on speed. I wanted to change a line of code, hit F5, and see the result.

Application folder structure:

.
├── Docs
│   ├── features.md
│   └── spec.md
├── package.json
├── public
│   ├── index.html
│   └── poses
└── server.js

3. The "Rig": Math is for Machines

Here is where I expected to get stuck. Creating a "rig" where moving a hand automatically updates the angle of the arm involves trigonometry and vector math.

Usually, this is where I’d open 15 StackOverflow tabs and copy-paste code I don't understand.

Instead, I just described the behavior:

"Create a Mannequin class. It has Nodes (circles) and Links (lines). When a Node moves, the Links connected to it should update their coordinates."

The AI wrote the entire class. It hooked into Fabric.js’s object:moving event and handled the coordinate updates. It worked on the first try.

I still barely know how fabric.Line works under the hood. And I don't care. It works.

4. Iteration: The "Yes, And..." Technique

UI Vibe Coding isn't about getting it right instantly; it's about sculpting.

The Ugly Phase:
The first version looked like a programmer made it (because a programmer did make it). The stick figure looked like a dead bug. The background was gray.

The "Vibe" Phase:
Me: "This looks depressing. Make it 'Zen'. Use soft colors, rounded buttons, and a clean layout."

The AI generated the CSS variables (--highlight-color: #88b04b), added a "Save As" modal, and cleaned up the toolbar.

Yoga Pose Builder GUI

The "Feature Creep" Phase:
Me: "I want to save my poses."
AI: "We have no database."
Me: "Just write JSON files to a folder on the server."

In 5 minutes, we had a fully working persistence layer. No database migrations, just fs.writeFile.

Here is a example of such a Pose JSON file:

{
  "meta": { "nameFR": "Demi-Pont", "nameSK": "Setu Bandhasana" },
  "joints": {
    "head": { "x": -120, "y": 100 }, "neck": { "x": -100, "y": 100 }, "chest": { "x": -60, "y": 50 }, "hips": { "x": 20, "y": 0 },
    "lShoulder": { "x": -80, "y": 100 }, "lElbow": { "x": -20, "y": 100 }, "lHand": { "x": 40, "y": 100 },
    "rShoulder": { "x": -80, "y": 100 }, "rElbow": { "x": -20, "y": 100 }, "rHand": { "x": 40, "y": 100 },
    "lHip": { "x": 20, "y": 0 }, "lKnee": { "x": 80, "y": 20 }, "lFoot": { "x": 80, "y": 100 },
    "rHip": { "x": 20, "y": 0 }, "rKnee": { "x": 80, "y": 20 }, "rFoot": { "x": 80, "y": 100 }
  }
}

5. The Pivot: Language as a Feature

At the end of the session, I realized a problem: the app was vibing in French (my native tongue), but I wanted screenshots in English for this article.

Instead of manually editing labels, I asked the AI to "make the whole app i18n." In one single refactor, we added a translation dictionary, a language switcher, and logic to dynamically swap every label, tooltip, and even the pose names in the library.

GUI (and data) in French

This turned a linguistic hurdle into a core feature, proving that with Vibe Coding, "changing your mind" is just a prompt away.

6. The "Traceability" Hack

We spent about 90 minutes building this. We added features, fixed bugs, and refactored code. By the end, the chat context was massive and messy.

If I came back to this project in a week, I’d be lost.

So, I ran one final "Meta-Prompt":

"Read all the code we wrote and the docs in Docs/, and generate a Docs/session_summary.md. Explain what we built, why we made these choices, and the current state of the app."

The AI analyzed its own work and wrote a summary file. This is my "Save Game" point. When I want to work on this again, I’ll feed that summary to the AI to restore its context instantly.

Conclusion

We went from a blank folder to a functional, vector-based SVG editor with a backend in one session.

Vibe Coding a UI is possible, but you have to change your approach:

Anchor the Context: Write specs so the AI has a "North Star."
Delegate the Heavy Lifting: Let the AI choose the libraries and do the math.
Iterate Visually: Don't try to prompt the perfect UI. Prompt the skeleton, then prompt the paint.

Next we'll try to Vibe Code a real full stack app. Or a game. Who knows? The prompt is the limit.

SVG exported by Yoga Pose Builder (opened in Inkscape)

Vibe Coding One Page at a Time

raphiki — Fri, 23 Jan 2026 14:45:20 +0000

Building a Smart Magazine Archiver

I’m starting a new series called "Vibe Coding one Step at a Time." The goal? To document the raw, messy, and surprisingly efficient process of building software in the age of AI. We’re not here to write perfect specs or obsess over UML diagrams (well, not yet). We’re here to vibe with the code, iterating on pure intent until the machine does exactly what we want.

In this first edition, I’m sharing how I used the Gemini CLI to build a tool I actually needed, learning some pretty cool image processing tricks along the way.

What is "Vibe Coding"?

I’m going to claim this term right here: Vibe Coding.

It’s not "lazy coding." It’s intent-driven development. In the old days, if you wanted to build a script, you had to know the syntax, the libraries, and the edge cases before you even opened your editor. You had to think in code.

Vibe Coding flips that. You think in outcomes. You describe the behavior, the "vibe" of the feature, and the AI handles the implementation details. You act less like a bricklayer and more like a conductor. The feedback loop isn't "Write -> Compile -> Error," it's "Ask -> Observe -> Tweak."

The Use Case: "I Just Want to Read Offline"

Here’s the situation: I subscribe to a fantastic niche magazine (which shall remain nameless to protect the innocent). It’s great, but their "digital reader" is a nightmare. It’s one of those web-based page-turners that requires an active internet connection.

I wanted to read it on my tablet, offline, on a plane, without waiting for high-res JPEGs to buffer.

The Problem: There was no "Download PDF" button.
The Clue: Inspecting the network traffic revealed that the magazine was just serving a sequence of high-quality images, one URL per page.

The Mission: Write a script to fetch these pages and stitch them into a single, high-quality, searchable PDF.

The Process: Galloping Toward Complexity

We didn't sit down and architect a solution. We started small and let the script evolve.

Step 1: The Naive Loop

We started with a simple hypothesis: "The URLs probably just have a page number in them."
I asked Gemini to write a script using requests to hit the URL for page 1, then page 2.
Boom. It worked. We had a directory full of 100 separate JPGs.

Step 2: The Picture Book

Having 100 files is annoying. I wanted a book.
We asked Gemini to "glue these together." It pulled in the PIL (Pillow) library.
Result: A massive PDF. It looked great, but it was dumb. It was just a container of pictures. You couldn't highlight text, search for keywords, or copy-paste quotes.

Step 3: The Search for Meaning (OCR)

This is where the "vibe" got technical. I realized a "picture book" wasn't enough. I needed Optical Character Recognition (OCR).
We decided to use Tesseract. But here’s the catch we discovered:

Human Eyes like soft colors and smooth anti-aliasing.
OCR Engines like harsh contrast, jagged edges, and black-and-white binary inputs.

If we optimized the images for the machine, the magazine looked ugly. If we kept them pretty, the machine couldn't read the text.

The Technical Deep Dive: The "PDF Sandwich"

This is where the magic happened. We ended up building a PDF Sandwich.

Instead of choosing between beauty and brains, we chose both.

The Visual Layer: We keep the original high-res color JPEGs. This is what you see.
The Data Layer: Behind the scenes, we create a "Frankenstein" version of the page—converted to grayscale, contrast cranked up to 2.0, and upscaled 2x using LANCZOS resampling (a fancy algorithm that keeps edges sharp).
The Merge: We feed the Frankenstein images to Tesseract to generate an invisible text layer, then use pypdf to overlay that text exactly on top of the pretty images.

The trickiest part? Math.
Because we upscaled the OCR images by 2x to help Tesseract read small fonts, the invisible text layer was twice as big as the visual page. We had to calculate scale factors to shrink the text back down so that when you highlight a sentence, the highlight actually lines up with the words.

What I Learned

Vibe coding this script taught me more in an hour than I’d usually learn in a weekend of reading docs:

Image Optimization: OCR is picky. Simply resizing an image isn't enough; the method of resizing (resampling filter) matters.
Library Specialization: PIL is for pixels; pypdf is for structure. Trying to do everything in one library is a trap.
The Power of the CLI: Using the Gemini CLI meant I didn't have to context-switch. I stayed in my terminal, describing what I wanted, and the code appeared.

Conclusion

We ended up with a ~100-line Python script that solves a genuine daily frustration. I didn't have to memorize the pypdf documentation or look up the Tesseract CLI flags. I just focused on the goal: "Make it searchable, make it pretty."

That’s Vibe Coding. You bring the vision, the AI brings the syntax, and together you build something cool.

We'll discover in the next episode if this is still true with a more complex use case and a GUI.

The Ultimate LLM Inference Battle: vLLM vs. Ollama vs. ZML

raphiki — Mon, 29 Dec 2025 09:12:46 +0000

A structured, data-driven comparison of today's leading open-source engines for serving AI models.

The "Runtime Wars"

The open-source AI community has achieved an incredible milestone: models like Meta's Llama 3 and Mistral AI's Mixtral now rival proprietary giants like GPT-4. But having the weights is only half the battle. To actually use these models—to build a chatbot, an agent, or an API, you need an inference engine.

The landscape of inference servers is exploding. A year ago, options were scarce. Today, developers are faced with a paralyzing array of choices. Should you use the industry darling vLLM? The local developer's favorite, Ollama? Or perhaps a radical newcomer like ZML?

Choosing the wrong engine can lead to massive infrastructure bills, slow user experiences, or vendor lock-in.

To cut through the hype, we are applying the QSOS (Qualification and Selection of Open Source software) method. This isn't a casual review; it's a structured evaluation comparing these three contenders against the state-of-the-art features required for modern AI production.

The Methodology: Why QSOS?

QSOS is a standardized methodology designed to reduce the risks associated with adopting open-source technologies. Unlike ad-hoc selection processes based on Medium articles or GitHub stars, QSOS treats open-source evaluation with the same rigor used for proprietary software.

The core philosophy of QSOS is separating Evaluation (the intrinsic, objective quality of the software) from Qualification (how well it fits your specific business needs).

For this comparison, we used a "Best of Breed" evaluation grid, scoring features on a simple 0-to-2 scale:

0: Not covered / Non-existent.
1: Partially covered / Complex implementation.
2: Fully covered / Best-in-class standard.

We assessed four key axes:

Maturity & Community: Is the project stable and likely to survive?
Functional Features: Does it support modern requirements like LoRA adapters and quantization?
Performance & Scale: Can it handle high throughput and utilize hardware efficiently?
Operations (Day 2): How easy is it to deploy, monitor, and maintain?

The Contenders

1. vLLM: The Data Center Standard

vLLM burst onto the scene in 2023 from UC Berkeley, solving a critical bottleneck in serving LLMs: memory fragmentation. Its core innovation, PagedAttention, allows it to manage GPU memory like an operating system manages virtual memory, dramatically increasing batch sizes and throughput.

Primary Focus: High-throughput production serving in the data center.
Positioning: vLLM is the currently the De Facto Standard for enterprise deployment. It excels on server-grade hardware (NVIDIA H100s/A100s) and offers the richest feature set for scaling.

2. Ollama: The Developer's Best Friend

Ollama took a different approach. It focused entirely on removing friction. By wrapping the powerful llama.cpp engine in a sleek, Docker-style Go binary, it made running a 70B parameter model on a MacBook as easy as typing ollama run llama3.

Primary Focus: Local development, edge devices, and consumer hardware (Mac/PC).
Positioning: Ollama is the king of usability. It is unbeaten for local testing and running models on consumer hardware, but it lacks the advanced scheduling required for high-traffic enterprise production.

3. ZML (Zig Machine Learning): The Radical Challenger

ZML is the new kid on the block. It is less of a "server" product and more of a compiler stack aimed at engineers. Written in Zig, it utilizes OpenXLA/MLIR to compile model graphs directly into standalone binaries, aiming to eliminate the heavy Python/PyTorch dependency chain entirely.

Primary Focus: High-performance, cross-platform runtime (TPUs, AMD, NVIDIA) without dependencies.
Positioning: ZML is an Alpha-stage visionary. It offers incredible potential for hardware portability and efficiency but is currently a complex "build-your-own-stack" tool rather than a drop-in product.

Visualizing the Results

To understand how these tools differ, we visualize our QSOS scores using two different schemas.

The Radar Chart: Feature Balance

This chart shows the balance of strengths across the four evaluation axes.

Caption: The QSOS Radar Chart highlights the distinct profiles of the three engines. vLLM shows the broadest coverage across features and performance. Ollama spikes toward Operational Ease. ZML shows potential in features but lacks maturity.

vLLM (Blue): The largest, most balanced area, indicating strength across maturity, features, and performance, with moderate operational complexity.
Ollama (Green): A massive spike toward "Operational Ease," reflecting its zero-friction user experience, but pulling back on raw performance metrics like continuous batching.
ZML (Red): A smaller footprint overall, reflecting its early stage (low maturity), but showing strong potential in functional features due to its compiler-based architecture.

The QSOS Quadrant: Market Position

This schema maps the tools based on their market adoption versus their raw production capabilities.

Caption: The QSOS Quadrant positions the tools based on Market Maturity vs. Production Power.

vLLM (The Leader): High Maturity, High Power. The safe, scalable choice for the enterprise.
Ollama (The Specialist): High Maturity, Lower Production Power. The standard for a specific niche (local/consumer hardware), prioritizing usability over scale.
ZML (The Visionary): Low Maturity, High Potential Power. An innovative approach that hasn't yet proven itself in the broad market.

The Consolidated Score Sheet

Below is the detailed breakdown of the evaluation scores that feed the charts above.

Section / Criteria	vLLM	Ollama	ZML (Zig ML)
A. MATURITY
History & Age	2 (Standard)	2 (Standard)	0 (Very New)
Activity	2 (Hyper-Active)	2 (Viral)	2 (High Velocity)
Ecosystem	2 (Dominant)	2 (Ubiquitous)	0 (Niche)
Governance	2 (Community)	1 (Company Led)	1 (Small Team)
B. FEATURES
Model Support	2 (Universal)	2 (Curated Lib)	2 (Compiler based)
Quantization	2 (Server: AWQ/FP8)	2 (Edge: GGUF)	1 (Implicit XLA)
LoRA Adapters	2 (Dynamic Multi-LoRA)	1 (Static Modelfile)	0 (Not standard)
API Compat.	2 (OpenAI Native)	2 (OpenAI Native)	0 (Runtime only)
C. PERFORMANCE
Cont. Batching	2 (Gold Standard)	0 (FIFO)	1 (Arch. support)
Throughput	2 (Maximum SOTA)	1 (Low/Single User)	1 (High Potential)
Parallelism	2 (Tensor & Pipeline)	0 (Single Node)	1 (Compiler Config)
Hardware Agnosticism	1 (NVIDIA Centric)	2 (Apple/Consumer)	2 (Any: TPU/AMD)
D. OPERATIONS
Ease of Setup	1 (Python/Docker)	2 (Magic 1-Click)	0 (Hard: Bazel)
Dependencies	1 (Heavy Torch)	2 (Zero: Go Binary)	2 (Zero: Zig Binary)
Observability	2 (Prometheus Native)	0 (Logs only)	1 (Manual metrics)

Conclusion

There is no single "best" inference engine. The right choice depends entirely on your specific context (the Qualification phase of QSOS).

Choose vLLM if:

You are building a production application that needs to serve many concurrent users. You have access to server-grade GPUs (NVIDIA A10G, A100, H100) and need features like dynamic LoRA adapters for multi-tenancy.

If you are deploying to Kubernetes to serve customers, start here.

Choose Ollama if:

You are a developer building locally on a Mac or Windows PC. You need a zero-friction way to test models, or you are deploying to edge devices where resources are constrained, and concurrency is low.

If you just want to run Llama 3 on your laptop right now, download Ollama.

Choose ZML if:

You are an ML systems engineer building a specialized hardware appliance (e.g., using TPUs or AMD chips) and need a runtime with absolutely zero Python dependencies and a tiny footprint. You are willing to build the server infrastucture around it yourself.

If you are frustrated by PyTorch bloat and want a "build your own" adventure, look at ZML.

Note on Methodology

For the purpose of this article, we utilized a simplified QSOS evaluation grid. We intentionally zoomed in on the "Best of Breed" criteria, the critical differentiators driving the current "Inference Wars", to keep the comparison readable and actionable.

A full-fledged QSOS evaluation is significantly more exhaustive. It is structured as a hierarchical tree of criteria containing more data points, covering deep operational details such as:

Generic Attributes: Intellectual property management, roadmap visibility, bug tracking efficiency, and internationalization.
Specific Sub-sections: Detailed granularity on security compliance (SOC2/GDPR), exact memory footprints, and specific driver version compatibility.

While this article provides a strategic overview, a complete QSOS audit would involve drilling down from high-level "Sections" into specific "Leaves" to calculate a precise, weighted score for every possible business constraint.

Automating Image Generation with n8n and ComfyUI

raphiki — Sun, 07 Sep 2025 15:51:34 +0000

This is the third article of a series about how to integrate ComfyUI with other tools to build more complex workflows. We'll move beyond the familiar node-based interface to explore how to connect ComfyUI from code and no-code solutions, using API calls or MCP Servers.

You'll learn how to use ComfyUI's API to build custom applications and automate tasks, creating powerful and automated systems for generative AI.

n8n is a workflow automation tool that connects applications, APIs, and services without requiring deep technical expertise. It allows users to create complex, multi-step workflows using a visual, node-based editor. With n8n, you can automate tasks across thousands of integrations, from CRMs and databases to messaging apps and cloud services.

It's a fair-code and open-core solution. You can self-host and modify the software freely, but SaaS providers must contribute back to the project if they offer n8n as a service. Furthermore, some advanced features like global variables, multiple environments (dev, staging, prod, etc.), version control using Git, or controlling n8n via API are not available in the community and open-source version of the product.

In this article, we'll explore how to call ComfyUI from an n8n agent-based workflow with human interaction and LLM use. The agent is instructed to transform a simple prompt from the user into a super-charged JSON Prompt Guide, which is then injected into ComfyUI. For more context, you can read my previous article on JSON Prompt Style Guides.

Installation

n8n is a Vue/TypeScript web application that's simple to install whether you prefer to run it on a Node.js installation or inside a Docker container.

Node.js: npx n8n
Docker: docker volume create n8n_data and then docker run -it --rm --name n8n -p 5678:5678 -v n8n_data:/home/node/.n8n docker.n8n.io/n8nio/n8n

After all dependencies are installed, the n8n Editor web UI is accessible at http://localhost:5678.

Text-to-Image Workflow

Use Case

Workflow design is done in the Editor web UI, and it's a highly visual process that doesn't require any coding knowledge, as long as you use predefined nodes for a standard use case. That's our approach here, as we'll create a very simple 3-step workflow with 4 nodes.

Chat Trigger node to start the workflow with a message from the user to capture their initial prompt for the images to be generated by ComfyUI.
AI Agent node to call an OpenAI model (though it could be other SaaS solutions like Mistral, Anthropic, or Google Gemini, or local models provided through Ollama or directly by Hugging Face). The agent has instructions on how to expand the initial prompt from the previous node into a JSON Prompt Style Guide.
OpenAI Chat Model node to connect to OpenAI's GPT.
n8n-nodes-comfyui community node to connect to a running ComfyUI instance. To install it, go to the "Settings / Community nodes" menu.

We're making a simple use of this standard AI Agent node and don't require memory or external tools.

The most important parameter is the system message given to the LLM to expand the initial user prompt. The OpenAI Chat Model node handles the credentials to connect to OpenAI and allows us to select the GPT 4.1 mini model.

The LLM response is then sent to the final node, which is interconnected with ComfyUI.

ComfyUI Community Node

Once installed, this community node is quite straightforward to use.

First, we configure the credentials to connect to ComfyUI.

API URL: In this example, it's http://127.0.0.1:8188, but it could also be a remote instance of ComfyUI.
API Key: This is used if you have configured one on the ComfyUI side.

Next, we specify the output format (PNG or JPEG) and the timeout for communication with ComfyUI. In the Workflow JSON textarea, we copy the content of the workflow exported from ComfyUI (by using the "File / Export (API)" menu).

This means that n8n will send the workflow to be executed to the ComfyUI API in JSON format. We need to modify the ComfyUI workflow by using an expression containing the $node["AI Agent"].data variable. Its value is dynamically set to the prompt provided by the previous node during n8n execution.

The exact location to inject the prompt depends on the JSON workflow exported from ComfyUI. Here, it's inside the "39.6" node of type CLIP Text Encode, but it might have a different name in your own workflows.

Execution

We're all set! We check that ComfyUI is running and ready to launch the workflow from the n8n UI by entering a prompt in the chat box.

Here's a short video of the workflow execution. n8n displays real-time progress, and the generated images can be visualized inside the ComfyUI node.

Here are two images generated by from this prompt: "A dramatic, cinematic shot of an ancient library at night, where the books are alive and their pages flutter like birds, forming constellations in the air."

Of course, this 3-step workflow is very simple. The true power of coupling n8n and ComfyUI will become apparent with more complex use cases, leveraging n8n's extensive integration capabilities with many other components and solutions.

Image-to-Image Workflow

Use Case

Let's now create another workflow to transform an existing image based on user instructions. We'll intentionally keep this example super simple for clarity, but your use case might include a more complex workflow leveraging n8n's power.

Here, we'll use only three nodes:

n8n Form / n8n Form trigger node to start the workflow by displaying an HTML form for the user to upload the image to modify and specify what changes to apply.
ComfyUI Image Transformer community node to connect to a running ComfyUI instance. To install it, go to the "Settings / Community nodes" menu and search for n8n-nodes-comfyui-image-to-image. The example workflow exported from ComfyUI uses the Kontext Edit model to modify an existing image.
n8n Form / Form Ending node to notify the user when the image is generated and offer it for download.

ComfyUI Image Transformer Node

This node is quite similar to the n8n-nodes-comfyui node we used before, with the insertion of the $json.Promt expression into the exported ComfyUI JSON workflow to inject instructions from the user.

The main difference concerns how the input image to be modified is handled:

Input Type defines how the image is obtained from the previous form node; we'll choose Binary instead of URL or Base64 text.
The property containing the binary file must be specified, which is the data field here.
Image Node ID is used to identify—within the exported ComfyUI JSON workflow — the node in charge of loading the input image (it must be of type LoadImage).

We've added the last node to finalize the form management started with the first node, retrieve the modified image, return it in binary format, and offer the user the option to save it locally.

Execution

Let's execute the workflow. n8n displays a form for us to enter both the image and the associated instructions for its modification.

Here is a short video of the workflow execution.

Initial Image:

Modified Image with the prompt "Make the scene at night with full moon and moonlight":

This second example workflow is so simple that we could do the exact same thing directly using the ComfyUI UI. It's here simply to illustrate how integration with n8n can be achieved. A more value-added workflow might, for instance, include a loop that allows the user to keep modifying the image outputs until they are satisfied.

Also, note that the n8n-nodes-comfyui package offers other custom nodes for integration into your workflows, such as:

Dual Image Transformer
Single Image to Video
Dual Image Video Generator

It's also worth noting that even though n8n offers Form nodes, it's primarily intended to be used in the backend through API calls. This feature, however, is limited to Enterprise licensees.

With these two workflows, we've demonstrated how n8n can serve as a powerful orchestrator for ComfyUI. By leveraging its visual editor and extensive library of integrations, we transformed a simple user prompt into a rich, structured guide for image generation and created a seamless image-to-image transformation process.

While our examples were simple to illustrate the concepts, the true value of n8n lies in its ability to connect ComfyUI with a vast ecosystem of tools, from databases and CRMs to messaging services and other AI models. This opens up new possibilities for building sophisticated, end-to-end applications that go far beyond what a standalone ComfyUI interface can offer.

In the next article of this series, we'll explore another paradigm for connecting ComfyUI with agent-based solutions. We will delve into the Model Context Protocol (MCP), designed to streamline and standardize the way AI models communicate and share contextual information. This will offer a new, more efficient method for agents to interact with and control ComfyUI.

WebSockets & ComfyUI: Building Interactive AI Applications

raphiki — Fri, 05 Sep 2025 09:17:07 +0000

This is the second article of a series about how to integrate ComfyUI with other tools to build more complex workflows. We'll move beyond the familiar node-based interface to explore how to connect ComfyUI from code and no-code solutions, using API calls or MCP Servers.

You'll learn how to use ComfyUI's API to build custom applications and automate tasks, creating powerful and automated systems for generative AI.

In the previous article of the Beyond the ComfyUI Canvas series, we demonstrated how to connect ComfyUI with Jupyter Notebook using basic HTTP API calls. While functional, this approach had a significant limitation: it relied on a time.sleep() function to wait for workflow completion, requiring manual adjustments based on the complexity of each workflow, a far from ideal solution.

To overcome this inefficiency, we’ll leverage ComfyUI’s WebSocket API (/ws endpoint), which enables real-time, bidirectional communication between Jupyter and ComfyUI. This upgrade unlocks a seamless experience by providing:

Instant execution progress updates to track workflow status,
Live node execution feedback for monitoring each step,
Immediate error messages and debugging insights for troubleshooting,
Dynamic queue status updates to respond to changes on the fly.

By adopting WebSockets, we eliminate guesswork and create a responsive, interactive workflow.

The Use Case

Let's simplify our previous use-case by dropping the OpenAI Assistant and focusing on how to eliminate manual polling or delays. The process is designed to be both intuitive and efficient:

Workflow Setup: A pre-defined ComfyUI workflow (loaded from a JSON file) serves as the foundation for image generation.
Prompt Customization: The user provides a text prompt which is dynamically inserted into the workflow.
Real-Time Execution: Using ComfyUI’s WebSocket API, the notebook sends the workflow to the server and monitors its progress in real time—receiving live updates on execution status, node activity, and completion.
Result Retrieval: Once generation finishes, the resulting images are automatically fetched and displayed directly in the notebook, creating a seamless end-to-end experience.

Let’s dive into the implementation.

Get prompt from user

print("Please enter your prompt")
user_prompt = input()

Please enter your prompt
A penguin in a tuxedo, DJing at a club for dancing jellyfish

Trigger the Workflow from Jupyter Notebook

Below, you’ll find a detailed breakdown of the code designed for use in a Jupyter Notebook, complete with helpful comments to guide you through each step and explain its functionality

Imports and main functions

import websocket  # For WebSocket communication
import uuid       # For generating unique client IDs
import json       # For JSON data handling
import requests   # For HTTP requests (replaces urllib)
from PIL import Image  # For image processing
import io         # For handling binary data streams
import IPython.display as display  # For displaying images in Jupyter

# Server configuration
server_address = "127.0.0.1:8188"  # Local server address and port
client_id = str(uuid.uuid4())      # Unique client ID for this session

def queue_prompt(prompt, prompt_id):
    """
    Send a prompt to the server for execution.

    Args:
        prompt (dict): The workflow/prompt to execute.
        prompt_id (str): Unique ID for tracking the prompt.
    """
    p = {"prompt": prompt, "client_id": client_id, "prompt_id": prompt_id}
    response = requests.post(f"http://{server_address}/prompt", json=p)
    return response

def get_image(filename, subfolder, folder_type):
    """
    Fetch an image from the server.

    Args:
        filename (str): Name of the image file.
        subfolder (str): Subfolder where the image is stored.
        folder_type (str): Type of folder (e.g., 'output').

    Returns:
        bytes: Binary image data.
    """
    params = {"filename": filename, "subfolder": subfolder, "type": folder_type}
    response = requests.get(f"http://{server_address}/view", params=params)
    return response.content

def get_history(prompt_id):
    """
    Retrieve the execution history for a given prompt ID.

    Args:
        prompt_id (str): ID of the prompt whose history is requested.

    Returns:
        dict: History data for the prompt.
    """
    response = requests.get(f"http://{server_address}/history/{prompt_id}")
    return response.json()

def get_images(ws, prompt):
    """
    Execute a prompt and collect the resulting images.

    Args:
        ws (websocket.WebSocket): Active WebSocket connection.
        prompt (dict): The workflow/prompt to execute.

    Returns:
        dict: Dictionary of node IDs and their output images.
    """
    prompt_id = str(uuid.uuid4())
    queue_prompt(prompt, prompt_id)
    output_images = {}

    # Listen for WebSocket messages until execution is complete
    while True:
        out = ws.recv()
        if isinstance(out, str):
            message = json.loads(out)
            if message['type'] == 'executing':
                data = message['data']
                if data['node'] is None and data['prompt_id'] == prompt_id:
                    break  # Execution is done
        else:
            # Binary previews are ignored here
            continue

    # Retrieve and organize output images
    history = get_history(prompt_id)[prompt_id]
    for node_id in history['outputs']:
        node_output = history['outputs'][node_id]
        images_output = []
        if 'images' in node_output:
            for image in node_output['images']:
                image_data = get_image(image['filename'], image['subfolder'], image['type'])
                images_output.append(image_data)
        output_images[node_id] = images_output
    return output_images

Load the workflow and inject the user prompt

with open("t2i-krea.json", "r") as f:
    workflow = json.load(f)

# Update the prompt text in the workflow
workflow["39:6"]["inputs"]["text"] = user_prompt

Communication with ComfyUI through WebSockets

# Establish WebSocket connection
ws = websocket.WebSocket()
ws.connect(f"ws://{server_address}/ws?clientId={client_id}")

# Execute the workflow and collect images
images = get_images(ws, workflow)
ws.close()

Display the output images in Jupyter

for node_id in images:
    for image_data in images[node_id]:
        image = Image.open(io.BytesIO(image_data))
        # Display each image in the notebook
        display.display(image)

This article demonstrated the power of using WebSockets for real-time, bidirectional communication with ComfyUI. By moving beyond simple HTTP requests, we eliminated the need for manual time delays and created a truly dynamic, responsive workflow. This allowed us to monitor the execution of our AI pipeline in real-time, ensuring a more reliable and efficient integration. The result is a seamless experience where we can send a prompt and watch as the generated images appear automatically in our notebook.

Having now explored two different ways to integrate ComfyUI with Python code executed in Jupyter, we've laid a strong foundation for building custom, high-level generative AI applications. But what if you're not a developer, or you simply prefer a visual, no-code approach to orchestration? In the next article of the series, we'll shift our focus from code to a no-code solution like n8n to show you how to build powerful ComfyUI workflows without writing a single line of code.

Unlocking ComfyUI's Power: A Guide to the HTTP API in Jupyter

raphiki — Thu, 04 Sep 2025 15:28:06 +0000

This is the first article of a series about how to integrate ComfyUI with other tools to build more complex workflows. We'll move beyond the familiar node-based interface to explore how to connect ComfyUI from code and no-code solutions, using API calls or MCP Servers.

You'll learn how to use ComfyUI's API to build custom applications and automate tasks, creating powerful and automated systems for generative AI.

ComfyUI is a powerful, modular interface for generative models, allowing users to create complex AI image, video and sound generation workflows with a node-based editor. Jupyter Notebook, on the other hand, is a popular interactive environment for data analysis, visualization, and prototyping.

By integrating ComfyUI with Jupyter Notebook, you can leverage the flexibility of ComfyUI’s workflows directly within your Python scripts or data science pipelines. This first article focuses on a simple approach using Basic HTTP API calls.

Most of this article is exported from an actual Jupyter Notebook. Both content, Python code and execution results are displayed.

The Use Case

Our goal is to build a high-level generative AI workflow that combines the power of an intelligent agent with the robust image generation capabilities of ComfyUI. The process unfolds in a few simple steps, all orchestrated within a Jupyter Notebook:

User Input: The workflow begins with a simple, high-level prompt entered directly into the notebook.
Agent-Powered Expansion: An OpenAI Assistant then takes this basic prompt and transforms it into a detailed, structured JSON Prompt Style Guide. This process enriches the initial idea with specific creative instructions, such as style, composition, and lighting.
Initiating Generation: This expanded JSON guide is automatically injected into a pre-defined ComfyUI workflow. A single API call to the ComfyUI server starts the image generation process.
Displaying the Result: Once the generation is complete, we make a second API call to fetch the resulting images. The images are then displayed directly within the Jupyter Notebook, completing our automated pipeline.

Prepare a ComfyUI Workflow

Create or load a workflow in ComfyUI.
Save the workflow as a .json file from the "File / Export (API)" menu (e.g., t2i-krea.json).

Get initial prompt from user

print("Please enter your prompt")
user_prompt = input()

Please enter your prompt
Hanuman flying over a modern city at night

Generate JSON Prompt Style Guide with an Assistant

import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Create a thread
thread = client.beta.threads.create()

# Send a message
message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content=user_prompt
)

# Run the assistant
run = client.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id="asst_Uj0Qr0rG0bz8NVk1LWiS9UKv"
)

# Wait for completion and retrieve the response
import time
while run.status != "completed":
    time.sleep(1)
    run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)

# Get the response
messages = client.beta.threads.messages.list(thread_id=thread.id)
json_prompt = messages.data[0].content[0].text.value
print(json_prompt)

{
  "style_name": "Urban Deus Ex Hanuman",
  "inspiration": [
    "Modern Urban Aesthetics",
    "Hindu Mythology",
    "Superhero Comics",
    "Cyberpunk Lighting"
  ],
  "scene": "Hanuman, the Hindu god, flying over a bustling modern city radiating bright lights under the cloak of night sky",
  "subjects": [
    {
      "type": "Hanuman",
      "description": "Strong, muscular figure with a monkey face, holding a gada(mace).",
      "position": "midground",
      "pose": "flying with one hand extended",
      "size": "large",
      "expression": "determined",
      "interaction": "flying over the city"
    },
    {
      "type": "city",
      "description": "modern urban skyline with skyscrapers, neon billboards, and busy traffic",
      "position": "background",
      "size": "expansive"
    }
  ],
  "style": "comic-realistic",
  "color_palette": {
    "primary": "#202020",
    "secondary": "#505050",
    "highlight": "#ff6a00",
    "shadow": "#0d0d0d",
    "background_gradient": [
      "#0d0d0d",
      "#303030"
      ]
  },
  "lighting": "Glistening city lights with diffused neon glow and soft moonlight",
  "mood": "powerful and captivating",
  "background": {
    "type": "scenery",
    "details": "Modern urban cityscape with skyscrapers, roads, traffic and massive billboards with neon signs"
  },
  "composition": "Slightly off-center focus with Hanuman taking up prominent space",
  "camera": {
    "angle": "low angle",
    "distance": "medium shot",
    "lens": "wide-angle",
    "focus": "sharp subject, blurred background"
  },
  "medium": "Digital Painting",
  "textures": [
    "smooth skin of Hanuman",
    "rough concrete of buildings",
    "glossy glass of skyscrapers"
  ],
  "resolution": "4K",
  "details": {
    "clothing": "Hanuman is dressed in traditional golden and red garment",
    "weather": "Night with clear sky and a soft moonlight"
  },
  "effects": [
    "Bokeh effect for city lights",
    "Glow effect for neon lights"
  ],
  "themes": [
    "Divinity",
    "Strength",
    "Modernization",
    "Contrast",
    "Juxtaposition of Tradition with Modernity"
  ],
  "usage_notes": "The style is effective in creating a surprising juxtaposition of traditional divinity with modern landscapes. Use this style for high impact illustrations where contrasts need to be highlighted."
}

Trigger the Workflow from Jupyter Notebook

Use the requests library to send a POST request to the ComfyUI API:

import requests
import json

# ComfyUI server URL
comfy_url = "http://127.0.0.1:8188"
prompt_url = f"{comfy_url}/prompt"

# Load your workflow JSON
with open("t2i-krea.json", "r") as f:
    workflow = json.load(f)

# Replace the prompt
workflow["39:6"]["inputs"]["text"] = json_prompt

# Define the payload
payload = {
    "prompt": workflow,
    "client_id": "jupyter_notebook"
}

# Send the request
response = requests.post(prompt_url, json=payload)

# Get the prompt_id
prompt_id = response.json()['prompt_id']
print(prompt_id)

c1a2ced4-772c-4aeb-ac45-bfa183d03a88

Retrieve the generated images

ComfyUI processes the workflow asynchronously.

To fetch the result, poll the /history endpoint:

import time
from IPython.display import Image, display    

# Wait for the workflow to complete
time.sleep(25)  # Adjust based on workflow complexity

# Fetch the latest result for our prompt
history_url = f"{comfy_url}/history/{prompt_id}"
history = requests.get(history_url).json()

# Navigate to the list of image outputs and display them
image_outputs = history[prompt_id]["outputs"]["9"]["images"]

for image in image_outputs:
    filename = image["filename"]
    image_url = f"{comfy_url}/view?filename={filename}"
    display(Image(url=image_url, width=200))

In this article, we've seen how to leverage the power of ComfyUI directly from a Jupyter Notebook. By making simple API calls, we were able to transform a user's basic text prompt into a rich, detailed JSON guide using an OpenAI Assistant, and then feed that guide into a ComfyUI workflow to generate images. This approach demonstrates how you can move beyond the graphical interface to build automated, intelligent systems for creative tasks. The combination of Python's flexibility and ComfyUI's robust backend opens up a world of possibilities for custom, high-level generative AI workflows.

In the next article, we'll take our integration a step further by exploring how to use WebSockets for Real-Time Interaction with ComfyUI.

Enhancing QR Codes in the Age of GenAI

raphiki — Fri, 23 May 2025 09:46:49 +0000

Traditional QR Codes

Quick Response (QR) codes were developed in 1994 by Masahiro Hara and are now recognized as an ISO/IEC standard. They represent an evolution of 2D barcodes, capable of encoding numeric, alphanumeric, binary, or Kanji data in the form of a pattern of black squares on a white background. These codes are available in various sizes (or versions), ranging from version 1 (21 x 21 squares) to version 40 (177 x 177 squares).

Numerous libraries and tools exist for generating QR codes. My preferred open-source library is QR Code Generator, which supports all standard features and is available in Java, TypeScript/JavaScript, Python, Rust, C++, and C. Additionally, my favorite all-in-one open-source tool is QR Toolkit, a Vue/Nuxt application offering marker and module customization, along with verification and comparison mechanisms, an invaluable resource when tweaking QR codes.

QR codes comprise several critical components to ensure readability by scanners, including three positional markers, alignment and timing patterns, and a masking system. While I will not delve into these details now, I will instead focus on the built-in error correction mechanism. This employs Reed-Solomon codes - also used in storage media (CD/DVD, RAID6) and network technologies (DSL, satellite) — by adding extra codewords to the QR grid for error correction. The standard defines four levels of error correction, each associated with a different tolerance percentage:

Level	Approximate Error Tolerance
Low	~7%
Medium	~15%
Quartile	~25%
High	~30%

This means a QR code with High error correction can still be scanned if up to 30% of the image becomes unreadable. This feature is often utilized to embed images within QR codes: the embedded image is treated as errors during scanning.

For years, this technique has been used for personalizing QR codes. This article explores an innovative approach to customizing QR codes by leveraging Generative AI instead.

Harnessing Generative AI

My proposal involves using a Stable Diffusion model integrated within the ComfyUI graphical interface to design and execute local generation workflows on a GPU-equipped PC. For detailed guidance on these components, refer to this article or this video.

To modify and refine existing QR codes while maintaining their scannability, we will use a specialized ControlNet called QR Code Monster. ControlNets are auxiliary neural network models that inject targeted guidance into the generation process by focusing on specific features of an input image. Each ControlNet emphasizes particular aspects, such as structure (pose, edges, segmentation, depth), texture, content layout (bounding boxes, masks), or style (color maps, textures). In our scenario, we’ll focus on maintaining or modifying QR code contrast features.

Let’s proceed to create a workflow in ComfyUI, employing Stable Diffusion 1.5, the QR Code Monster ControlNet, and a QR code generated via QR Toolkit.

Adjusting parameters such as the ControlNet’s strength and start/end positions, along with the sampling process (e.g., 50 steps), I obtained a result that remains scannable and aligns with my input prompt: “A beautiful landscape, blue sky, grass, flowers.”

This demonstrates how Stable Diffusion combined with ControlNet preserved the original pattern while injecting desired visual elements. Using QR Toolkit’s comparison feature, we can assess the QR code’s readability by examining the difference markers.

Next, we can modify the prompt to produce multiple variants of our QR code. For example:

While changing the overall style is straightforward (first example), embedding specific content within the QR code remains more challenging than with traditional tools (second example). To explore this further, we'll examine two axes separately: Style and Content, before combining them.

Customizing Style

Enhancing the prompt allows for more precise control over the QR code’s aesthetic. For instance, leveraging a large language model (LLM) to generate detailed prompts:

“A pattern forged from molten lava, glowing with an intense fiery orange and red hue. Cracks in the surface reveal volcanic heat, with small embers rising around it.”

Similarly, for a more intricate and mystical style:

“An elegant, glowing elven door adorned with intricate, nature-inspired patterns and shimmering silver runes. Delicate vines and luminescent flowers intertwine with the carvings, pulsating with soft emerald and sapphire light. The archway, crafted from ethereal white stone, radiates a mystical aura, with faint golden mist swirling at its base, hinting at an ancient portal to a hidden realm.”

Predefined styles can also be injected into prompts using the iTools Prompt Styler Extra node in ComfyUI:

This node offers reusable prompts categorized by various artistic styles: 3D, Art, Craft, Design, Drawing, Illustration, Painting, Sculpture, Vector, and more. Incorporating it into our workflow makes testing different styles effortless without altering other parameters.

Below are examples of QR codes generated with different styles:

Additionally, combining styles with custom prompts allows for highly personalized designs, enabling limitless customization of your QR codes’ appearance.

Embedding Content

Having mastered style adjustments, the next step is to embed specific generated content into QR codes. For example, I wish to insert an image of a yoga pose. If you’ve read my previous articles on AI image generation, you’ll understand the transfer of poses through workflows. Details are available here for further reference.

We’ll start with an abstract image of the target pose, add Depth and Canny Edge ControlNets to our workflow, and specify in the prompt: “man, mixed race, short curly hair, black hair, 40 years old, white T-shirt, black yoga pants, short sleeves, smiling, viewing glasses, white background, barefoot.” Essentially, I aim to generate an image resembling myself.

To ensure a realistic likeness, additional steps include incorporating the FaceID IP Adapter and the FaceDetailer post-processing model into the workflow. Refer to this article for comprehensive guidance on implementing face transfer. The outcome preserves scannability and creates a QR code embedding the desired pose and identity:

Using QR Toolkit again, the comparison displays about 26 mismatch nodes, primarily around the facial features and body.

Integrating Style and Content

All previous steps can be combined by adding the iTools node to the final workflow:

Making the QR Code Animate

Given that I can embed a face into the QR code, I can also animate facial expressions using specialized nodes. The Advanced Live Portrait tool is designed for editing, inserting, and animating facial expressions in images. By inputting our generated QR code, we can animate my face to produce a smiling expression or nodding motion.

The resulting animation can be exported as an animated GIF or video:

Final Thoughts

This short tutorial has demonstrated how to significantly enhance both the stylistic and content-related aspects of a QR code. You are now equipped to craft engaging, customized QR codes that align with your personal or branding style.

The only limits are your patience and imagination, so have fun experimenting!

The Yoga of Image Generation – Part 3

raphiki — Mon, 19 May 2025 14:16:11 +0000

In the first two parts of this series, we explored Stable Diffusion, ComfyUI, and how to build Text-to-Image and Image-to-Image workflows to generate images of Yoga poses. With the help of ControlNets, we learned how to transfer a pose from an abstract reference image to our final generated image.

A Yoga sequence consists of several connected poses, which means we need visual consistency across all generated images in the sequence. This consistency must first cover the style which we addressed in the previous part of the series but also the facial features of the person depicted.

LoRAs (Low-Rank Adapters)

Let’s now introduce a new component into our workflow to tackle this challenge: Low-Rank Adapters (LoRAs). LoRAs make slight adaptations to the base model they are trained on by modifying only a small subset of neural network parameters. This is a highly efficient technique, as it enables faster training, smaller file sizes, and lower memory usage. You can think of a LoRA as a patch applied at runtime to the base model. Multiple LoRAs can be chained together.

LoRAs are typically used to specialize an existing model with certain image features such as style, poses, concepts, or characters. They are triggered in prompts using specific keywords defined by the LoRA creator during training. The community offers numerous LoRAs available for download from sites like civitai.com, which can be integrated into your local ComfyUI workflows.

Here are two examples of images generated using a "Pencil drawing" LoRA, with two different keywords and all other parameters unchanged:

The community also offers countless LoRAs for generating images resembling celebrities. Let’s try using some of these to achieve facial consistency. We’ll start by testing Celebrity LoRAs with very light pose transfer (ControlNet strength set to 10%) to see how closely the generated faces match.

Promising results! Note that the poses aren’t identical across images, this is due to the low ControlNet strength we used.

Next, let’s incorporate these LoRAs into our previous pose generation workflow. I stacked two LoRAs: one for facial identity and another for a graphite drawing style. I also kept the two ControlNets we introduced earlier for pose transfer.

With this setup, we can generate sequences that are consistent in both style and facial identity.

Of course, we can change the celebrity reference or even chain multiple LoRAs together, adjusting their strengths to blend features of different identities. However, using public figures still feels a bit uncomfortable, potentially raising ethical concerns around deepfakes.

A better approach is to create your own LoRA, avoiding such issues. So I decided to train a LoRA using images of my wife. I first experimented with the DreamBooth method, using a Colab Notebook and Google GPUs. I trained the model on 28 images of her, using an SDXL base model, over 2 epochs, taking around 1.5 hours.

The results were... promising 😉
Here are some of the best images generated with my first custom LoRA:

The resemblance is there, but not quite enough, and the image quality was lacking. So I tried again, this time training the LoRA locally on my PC using the Kohya_ss open source tool. I selected the PowerPuffMix model (a fine-tuned of SDXL), trained on just 15 images but for 20 epochs. The process took about 3.5 hours and yielded better results.

This time, both image quality and facial identity were strong enough to integrate into our generation workflow.

Here are some outputs using the new LoRA. While the face doesn’t perfectly resemble my wife (likely due to the influence of ControlNets) the identity consistency we needed is clearly present.

The lighting is still a bit unstable, and overall image quality remains imperfect. I could improve this by training on more images and increasing the number of epochs. However, the final LoRA is still fundamentally linked to the base model and can't be applied to another one.

Image Prompt Adapters (IP Adapters)

Let’s now try another technique: Image Prompt Adaptation, which is more decoupled from the base model. It functions similarly to a ControlNet but alters the model directly. Think of an IP Adapter as a one-image LoRA.

The FaceID IP Adapter, specialized in facial recognition and feature extraction, is a perfect fit for our needs.

While exploring facial enhancement tools, I also discovered FaceDetailer, which improves facial features (eyes, nose, lips, expression) after image generation. I decided to integrate both of these components into our workflow. FaceDetailer’s enhancements are based on the FaceID input, so they remain faithful to the original facial reference.

Here is the complete workflow:

We now finally achieve our desired outcome:

Control over style via prompts and embeddings
Control over pose via ControlNets
Control over identity via the FaceID IP Adapter and FaceDetailer

This setup allows us to generate precise and coherent Yoga sequences.

Another advantage of this workflow is how easily we can switch the base model. For instance, here’s an example using the Cheyenne model, which specializes in cartoon and graphic novel styles:

It’s also incredibly easy to change the subject’s identity. Since FaceID only requires a single image and no training phase, here are examples generated with the exact same workflow, using my own face as input for facial identity:

This concludes our three-part series. My initial goal — generating accurate yoga poses and full sequences using only a local machine — has been achieved.

In Part 1, we introduced Stable Diffusion and ComfyUI to build simple Text-to-Image workflows using prompts and embeddings. In Part 2, we explored pose transfer using Image-to-Image workflows and ControlNets. In this final installment, we addressed facial consistency, first with LoRAs, then with the FaceID IP Adapter and the post-processing FaceDetailer.

You’re now ready to create custom workflows tailored to your specific visual goals. Enjoy experimenting with generative AI to express your creativity with precision!

Stay tuned for more image generation tutorials and in the meantime, feel free to explore my YouTube channel for more content.

🎨JSON Style Guides for Controlled Image Generation with GPT-4o and GPT-Image-1

raphiki — Thu, 08 May 2025 20:33:19 +0000

Image generation with GPT-4o and GPT-Image-1 can yield visually stunning results—but without clear instructions, results may vary. Using JSON style guides is a powerful way to bring clarity, structure, and repeatability to your prompts. This tutorial will walk you through why JSON style guides matter, how to use them effectively, and provide a complete reference to all parameters you can define.

🚀 Why Use a JSON Style Guide?

Natural language is powerful but often ambiguous. By organizing your image prompts using JSON:

✅ You eliminate ambiguity with structured fields.
✅ You ensure consistency across multiple generations.
✅ You can automate or scale prompt creation for batch processing.
✅ You separate content from style, making iterations easier.
✅ Developers and designers can work together using shared, machine-readable formats.

🛠️ How to Use a JSON Style Guide

A JSON prompt is simply a structured document specifying everything you want the model to include. Here’s a simple example:

{
  "scene": "a magical forest clearing",
  "subjects": [
    {
      "type": "fox",
      "description": "wearing a wizard hat, sitting on a tree stump",
      "position": "center"
    }
  ],
  "style": "storybook illustration",
  "color_palette": ["forest green", "gold", "midnight blue"],
  "lighting": "soft dappled sunlight",
  "mood": "whimsical and cozy",
  "background": "glowing mushrooms and tall trees",
  "composition": "eye-level view, centered subject"
}

This structure gives the model explicit, interpretable instructions for what to render and how.

📚 Parameter Reference

Here’s a breakdown of possible fields you can use in a JSON style guide.

1. `scene`

A short overview of the entire setting or environment.

Example: "a futuristic city at sunset"

2. `subjects` (array of objects)

Describes each key subject in the image. Each subject can include:

{
  "type": "robot",
  "description": "silver body with glowing blue eyes",
  "position": "foreground",
  "pose": "standing upright",
  "size": "large",
  "expression": "neutral",
  "interaction": "looking at a floating screen"
}

3. `style`

The artistic or visual rendering style.

Examples: "photorealistic", "watercolor", "pixel art", "cyberpunk", "anime"

4. `color_palette`

An array of dominant and accent colors.

Example: ["emerald green", "burnt orange", "charcoal"]

5. `lighting`

How the image is lit.

Examples: "sunset backlight", "soft studio lighting", "glow from below"

6. `mood`

The emotional tone or atmosphere.

Examples: "peaceful", "dramatic", "eerie", "playful"

7. `background`

The scenery or backdrop.

Examples: "mountain landscape", "white cyclorama", "dreamy nebula sky"

8. `composition`

Overall layout and positioning.

Examples: "symmetrical", "rule of thirds", "top-down shot", "portrait orientation"

9. `camera`

Virtual photography settings.

{
  "angle": "eye-level",
  "distance": "medium shot",
  "lens": "wide-angle",
  "focus": "sharp subject, blurred background"
}

10. `medium`

Simulated medium or format.

Examples: "oil painting", "3D render", "ink drawing", "chalkboard sketch"

11. `textures`

Surface qualities and tactile impressions.

Examples: "soft velvet", "rusty metal", "wet pavement"

12. `resolution`

Intended resolution or output size.

Examples: "4K", "web banner", "Instagram square"

13. `details`

Extra fine-tuned attributes.

{
  "clothing": "flowing red cape",
  "weather": "light snowfall",
  "facial_features": "freckles and sharp jawline",
  "material": "glass and brass",
  "ornaments": "glasses, ring"
}

14. `effects`

Special effects or visual treatments.

Examples: "lens flare", "bokeh blur", "double exposure", "film grain"

15. `inspirations`

Known references to guide visual style.

Examples: "inspired by Studio Ghibli", "in the style of Van Gogh", "similar to Blade Runner"

🧪 Example Use Cases

Fantasy Character Concept Art

{
  "scene": "mountaintop at sunrise",
  "subjects": [
    {
      "type": "warrior elf",
      "description": "leather armor, long silver hair",
      "pose": "standing with sword raised",
      "position": "foreground"
    }
  ],
  "style": "digital painting",
  "color_palette": ["misty gray", "light gold", "teal"],
  "lighting": "sunrise backlight",
  "mood": "heroic and calm",
  "background": "foggy mountains",
  "composition": "rule of thirds",
  "camera": {
    "angle": "low angle",
    "distance": "medium shot",
    "focus": "sharp on character"
  }
}

Product Mockup

{
  "scene": "minimalist white studio",
  "subjects": [
    {
      "type": "smartwatch",
      "description": "silver frame with red strap",
      "position": "center",
      "pose": "lying flat"
    }
  ],
  "style": "photorealistic",
  "lighting": "diffused light from above",
  "mood": "clean and sleek",
  "background": "white gradient",
  "composition": "centered product with top view",
  "resolution": "4K"
}

Realistic Scene with two Characters

{
  "scene": "urban café terrace in Paris during golden hour",
  "subjects": [
    {
      "type": "young woman",
      "description": "30s, Black hair in a bun, wearing a white blouse and tan trench coat, holding a coffee cup",
      "pose": "sitting at a café table, leaning forward slightly",
      "position": "left foreground",
      "expression": "engaged, smiling softly"
    },
    {
      "type": "young man",
      "description": "30s, light brown curly hair, wearing a navy blue jacket and scarf, gesturing with one hand",
      "pose": "sitting across from the woman, mid-conversation",
      "position": "right foreground",
      "expression": "animated, talking"
    }
  ],
  "style": "hyper-realistic photography",
  "lighting": "natural golden hour light with soft shadows and sun flare",
  "mood": "warm and intimate",
  "background": {
    "elements": ["street with bicycles", "café signage", "distant pedestrians"],
    "depth_of_field": "shallow, blurred background"
  },
  "composition": "framed using the rule of thirds, both characters centered with table between them",
  "camera": {
    "angle": "eye level",
    "distance": "medium close-up",
    "focus": "sharp on characters' faces"
  },
  "color_palette": ["warm gold", "beige", "navy", "soft rose", "espresso brown"],
  "props": ["ceramic coffee cups", "croissants on a small plate", "notebook and pen on table"],
  "resolution": "4K"
}

Using JSON style guides gives you a consistent, modular, and precise way to control image generation. Whether you're creating a portfolio of characters, designing branded assets, or prototyping environments, structured prompts give you the power to communicate with clarity and scale with confidence.

And don’t hesitate to use ChatGPT to refine or co-create your JSON Style Guides! It can turn vague ideas into structured, generation-ready prompts in seconds.

Forem: raphiki

Lore as Code: How I Used SDD to 'Compile' a 30-Chapter Novel

1. The Design Phase: Forging "Lore as Code"

2. The Harness: Framing AI with a Strict Operating System

3. Agile Writing: Sprints, Generation, and Pivots

4. Multi-Model Review and Quality Control

5. Build Pipeline: From IDE to Physical Book

6. Transmedia Extension: Multimodality, Cover Art, and Vibe Coding the ARG

Conclusion: The Author-Architect Paradigm

About the Author

Beyond the API: Integrating ComfyUI and Flowise via MCP

1. Setting the Scene: The Stack

The Standard: Model Context Protocol (MCP)

The Orchestrator: Flowise

The Engine: ComfyUI

2. The Middleware: Building the ComfyUI MCP Server

Why We Chose SSE over Stdio

Governance-Driven Development (GDD)

The "LAST" Hack (Technical Deep Dive)

3. The Engine Room: ComfyUI Workflows

4. Validation: The MCP Inspector

5. The Integration: Flowise ChatFlow

The Auto-Discovery Magic

The System Prompt

The Use Case in Action

Conclusion

Future Improvements

Vibe Coding One Slice at a Time

1. The Mission: Complexity Check (The Boss Level)

The Four Domains of Pain

2. The Blueprint: Architecture & Tech Stack

The "ADR": The Architect's Save Game

3. The Methodology: Governance-Driven Development (GDD)

4. The Execution: A high-level Overview

Slice 1: The "Polymorphic" Database

Slice 2: The "Hybrid Brain"

Slice 3: The "Offline Printer"

5. The Architect's Flex: Automated C4 Verification

6. The AIOps Protocol: Monitoring the Machine

7. The "Oh S**t" Moment: The Hallucination Trap

Conclusion: The Architect's Verdict

Vibe Coding One Pixel at a Time

1. Context is King (The .md Anchors)

2. The Architecture: Letting the AI be CTO

3. The "Rig": Math is for Machines

4. Iteration: The "Yes, And..." Technique

5. The Pivot: Language as a Feature

6. The "Traceability" Hack

Conclusion

Vibe Coding One Page at a Time

What is "Vibe Coding"?

The Use Case: "I Just Want to Read Offline"

The Process: Galloping Toward Complexity

Step 1: The Naive Loop

Step 2: The Picture Book

Step 3: The Search for Meaning (OCR)

The Technical Deep Dive: The "PDF Sandwich"

What I Learned

Conclusion

The Ultimate LLM Inference Battle: vLLM vs. Ollama vs. ZML

The "Runtime Wars"

The Methodology: Why QSOS?

The Contenders

1. vLLM: The Data Center Standard

2. Ollama: The Developer's Best Friend

3. ZML (Zig Machine Learning): The Radical Challenger

Visualizing the Results

The Radar Chart: Feature Balance

The QSOS Quadrant: Market Position

The Consolidated Score Sheet

Conclusion

Choose vLLM if:

Choose Ollama if:

Choose ZML if:

Note on Methodology

Automating Image Generation with n8n and ComfyUI

Installation

Text-to-Image Workflow

Use Case

ComfyUI Community Node

1. Context is King (The `.md` Anchors)

1. `scene`

2. `subjects` (array of objects)

3. `style`

4. `color_palette`

5. `lighting`

6. `mood`

7. `background`

8. `composition`

9. `camera`

10. `medium`

11. `textures`

12. `resolution`

13. `details`

14. `effects`

15. `inspirations`