Forem: Alexander Ertli

The Soul of Contenox: Stop begging the model. Start programming the runtime.

Alexander Ertli — Sun, 17 May 2026 20:35:12 +0000

There is a strange pattern emerging in modern AI tooling.

We take probabilistic language models, connect them directly to terminals, cloud infrastructure, internal APIs, and production data, then try to control them with sentences like:

“Please don’t modify anything dangerous.”

That is not governance. That is wishful thinking wrapped around root access.

LLMs are excellent at reasoning. They are terrible at boundaries. And in production systems, boundaries are the only thing standing between automation and damage.

Contenox was built from a different premise:

An AI agent should be treated like an operating-system subject — constrained by explicit capabilities, enforced policy, and deterministic runtime behavior.

Not a chatbot with inherited authority.

Not a black box SaaS runtime.

Not a vendor-defined orchestration engine you cannot inspect.

Contenox is a local-first runtime for governing LLM execution through explicit policy, capability isolation, and declarative workflows.

A single Go binary. Your models, your chains, your policies, your infrastructure.

1. Prompts are hopes. Policies are firewalls.

A system prompt is not a security boundary.

This is the standard pattern today:

"You are a helpful assistant. Do not use sudo. Avoid sensitive files."

The model sees that instruction. It may even follow it.

Until it doesn’t.

Contenox moves authority enforcement out of the prompt and into the runtime itself.

[
  {
    "tools": "local_shell",
    "tool": "local_shell",
    "action": "deny"
  },
  {
    "tools": "local_fs",
    "tool": "*",
    "action": "deny",
    "when": [
      {
        "key": "path",
        "op": "glob",
        "value": "**/{.ssh,.gnupg,.aws,.env}/**"
      }
    ]
  }
]

The distinction matters.

A prompt is advice to the model.

A policy is enforced by the engine.

The model cannot negotiate with it, reinterpret it, or hallucinate around it.

If a tool call violates policy, the runtime refuses it mechanically.

Fail closed.

2. Cognition is cheap. Authority is precious.

The dangerous part of an AI agent is not the model.

It is the authority you accidentally hand to it.

Most frameworks expose your terminal, your API credentials, or your SaaS integrations directly to the model runtime. At that point the agent becomes a probabilistic extension of your own permissions.

One hallucinated command can become production damage.

Contenox rejects inherited authority entirely.

Instead of exposing “the CRM” or “the ERP,” you author a constrained tool surface:

A curated OpenAPI subset
Explicitly allowed operations
Local credential binding
Deny-by-default policy enforcement

The model never receives the underlying credential.

It cannot exfiltrate what it cannot read.

And if the schema only exposes:

GET /contacts

then:

DELETE /contacts

does not exist to the model at all.

Not hidden.

Not discouraged.

Absent.

You did not “grant access.”

You authored a capability.

3. Agent behavior belongs in Git. Not a black box.

Modern AI tooling often hides orchestration inside opaque runtimes:

hidden prompts
undocumented retries
proprietary planning heuristics
silently changing behavior

That is tolerable for chat applications.

It is unacceptable for infrastructure.

In Contenox, agent behavior is declarative infrastructure:

model selection
system prompts
retries
transitions
branching logic
pause conditions
tool budgets
approval behavior
policy enforcement

All defined in chain files you own.

{
  "id": "agent",
  "tasks": [
    {
      "id": "chat",
      "handler": "chat_completion",
      "system_instruction": "You are an assistant running inside the user's tools.",
      "execute_config": {
        "model": "{{var:model}}",
        "provider": "{{var:provider}}",
        "tools": [
          "*"
        ],
        "retry_policy": {
          "max_attempts": 4,
          "initial_backoff": "1s",
          "max_backoff": "30s"
        }
      },
      "transition": {
        "branches": [
          {
            "operator": "edge_traversed_at_least",
            "edge": "chat->run_tools",
            "when": "20",
            "goto": "end"
          },
          {
            "operator": "equals",
            "when": "tool-call",
            "goto": "run_tools"
          },
          {
            "operator": "default",
            "when": "",
            "goto": "end"
          }
        ]
      }
    },
    {
      "id": "run_tools",
      "handler": "execute_tool_calls",
      "input_var": "chat",
      "transition": {
        "branches": [
          {
            "operator": "default",
            "when": "",
            "goto": "chat"
          }
        ]
      }
    }
  ],
  "token_limit": 131072
}

You version it in Git.

You review it in pull requests.

You diff changes like any other production artifact.

Vendor updates do not silently rewrite your automation behavior upstream.

The runtime stays predictable because the behavior definition belongs to you.

Explicitness beats magic.

4. A hard deny is stronger than a soft ask.

“Human in the Loop” systems sound reassuring until you use them long enough.

Eventually every approval prompt becomes muscle memory.

Click.

Approve.

Click.

Approve.

Click.

Production incident.

Contenox treats autonomy and safety as competing forces, not marketing slogans.

So the runtime operates differently depending on the trust profile.

Interactive Mode — The Owner

Inside Zed, JetBrains, or AionUi, Contenox routes approvals through the editor’s own permission system.

You define:

what is always allowed
what is always denied
what requires approval

The runtime pauses exactly where your chain says it should pause.

Autonomous Mode — The Driver

In environments where the user is not the owner of the host or infra execution is different!

There is no trusted human sitting at the keyboard.

So there is no fallback approval prompt.

Everything is deny-by-default.

Wire in a new MCP server or API and it is still denied automatically.

No implicit trust.

No inherited capability.

No “the model probably won’t do that.”

Until you explicitly allow the tool in policy, the runtime refuses access mechanically.

Registering a tool does not grant authority to use it.

That distinction is the difference between orchestration and governance.

The Anti-Black-Box Runtime

Contenox is not trying to be an AI companion.

It is not trying to replace your judgment.

It is not trying to become an omniscient autonomous employee.

It is infrastructure.

A programmable runtime that:

pipes inputs
enforces policy
governs capabilities
executes deterministic workflows
runs locally
speaks open protocols
works with any model provider — or none at all

GGUF via the built-in engine.

or pick
Ollama.

or
OpenAI.

or
Gemini.

or
vertex ai

or
vLLM.

and
aws bedrock is coming soon.

Your own cluster.

Your own choice or own hardware.

Your own chain files.

Your own policies.

The runtime is replaceable.

The models are replaceable.

The governance stays yours.

Contenox is open source under Apache 2.0.

Install

curl -fsSL https://contenox.com/install.sh | sh

Then:

contenox init
contenox model pull gemma4-e4b # https://contenox.com/docs/guide/local-models/
contenox "say hello world in python"

No SaaS runtime required.

⭐ GitHub: https://github.com/contenox/contenox

📖 Docs: https://contenox.com/docs/

🔌 Zed Guide: https://contenox.com/docs/guide/zed/

If you believe AI agents should not inherit authority by accident, welcome to Contenox.

Should We Still Care About Our Code?

Alexander Ertli — Sat, 18 Apr 2026 06:28:02 +0000

As part of a new feature, I was looking at a "Human in the Loop" wrapper:

func (h *HITLWrapper) Exec(
    ctx context.Context,
    startTime time.Time,
    input any,
    debug bool,
    hook *taskengine.HookCall,
) (any, taskengine.DataType, error)

That’s a lot of code... but it’s not the interesting part here.

I wanted to take a look at something smaller, something simpler... a little helper:

// buildDiff generates a simple human-readable unified diff for file-write
// operations (write_file and sed). It avoids external dependencies by doing a
// straightforward line-level comparison.

That’s a solid contract. And here is what we got:

// buildDiff generates a simple human-readable unified diff for file-write
// operations (write_file and sed). It avoids external dependencies by doing a
// straightforward line-level comparison.
func buildDiff(hookName, toolName string, args map[string]any) string {
    switch {
    case hookName == "local_fs" && toolName == "write_file":
        path, _ := args["path"].(string)
        newContent, _ := args["content"].(string)
        if path == "" || newContent == "" {
            return ""
        }
        oldBytes, _ := os.ReadFile(path)
        return unifiedDiff(path, string(oldBytes), newContent)

    case hookName == "local_fs" && toolName == "sed":
        path, _ := args["path"].(string)
        pattern, _ := args["pattern"].(string)
        replacement, _ := args["replacement"].(string)
        if path == "" || pattern == "" {
            return ""
        }
        oldBytes, err := os.ReadFile(path)
        if err != nil {
            return ""
        }
        newContent := strings.ReplaceAll(string(oldBytes), pattern, replacement)
        return unifiedDiff(path, string(oldBytes), newContent)
    }
    return ""
}

Yep, this is AI-generated code. It's part of ~3,000 new or changed lines. Everything compiles, everything is tested, coverage looks good.

So… why even write about this?

It’s in the title: Should we still care about our code?

When you can generate 1,000 lines within minutes—and every line compiles and passes tests—code like this starts to look “good enough.”

Is that a problem?

You know what, let’s make the AI review its own code:

1. Empty content check is wrong

if path == "" || newContent == "" {
    return ""
}

Writing an empty file is a valid operation. This silently skips the diff for a legitimate write_file that empties a file.

2. Missing file → empty diff, not a "new file" diff

When the target file doesn't exist, os.ReadFile returns an error—and we return "".

That means:

For a brand new file, the user sees no diff at all.
For sed on a missing file, same problem.

3. `sed` uses string replace, not regex

...

4. `unifiedDiff` is a black box

...

That’s it? Now everything is good?
😄

You know what… if a human had written this, I would very politely reject both versions.

Why? Sure, we could run another round of AI review. Then another. And another.

But that’s not the problem.

Let’s go back to where we started:

func (h *HITLWrapper) Exec(
    ctx context.Context,
    startTime time.Time,
    input any,
    debug bool,
    hook *taskengine.HookCall,
) (any, taskengine.DataType, error)

What stands out here?

It returns an error.

Now compare that to the helper:

func buildDiff(hookName, toolName string, args map[string]any) string

See the difference?

Let’s zoom out even further:

// Package localhooks provides local hook integrations.
package localhooks

import (
    "context"
    "errors"
    "fmt"
    "os"
    "strings"
    "time"

    "github.com/contenox/contenox/hitlservice"
    "github.com/contenox/contenox/libtracker"
    "github.com/contenox/contenox/taskengine"
    "github.com/getkin/kin-openapi/openapi3"
)

See that import "os"?

And in the helper:

oldBytes, _ := os.ReadFile(path)

Outside of the ignored error—how do we know that path actually refers to something we can safely read via os.ReadFile?

It could be:

A relative path (./config.txt) – relative to what? The tool’s working directory? The agent’s sandbox?
An absolute path (/etc/hosts) – but the process might be containerized or restricted.
A virtual path (workspace://project/main.go) – the local_fs hook might understand this, but os.ReadFile won’t.
A dangerous path involving symlinks, .. traversal, or special files (/dev/random, /proc/self/mem).

The code compiles. The tests pass. But the assumptions are undefined.

And the model filled them in anyway.

So—should we still care about our code?

Yes. Definitely.

But not in the way we used to.

We’re not going to review thousands of generated lines line by line. We don’t—and realistically, we can’t.

What we can do is define the boundaries the code is allowed to operate within:

What does path actually mean?
What filesystem is accessible?
What errors must be handled vs ignored?

If we don’t define those constraints, the model will.

And its guesses will compile. They’ll pass tests. They’ll even look reasonable.

That’s the real danger.

So no—we don’t scale code review anymore.

We scale constraints.

Design first. Generate second.

And honestly, this is exactly the class of problem that led us to introduce a Human-in-the-Loop layer in the first place—not to review every line, but to enforce the boundaries the model can’t reliably infer.

The 90%-Done Paradox

Alexander Ertli — Sat, 11 Apr 2026 11:12:59 +0000

Despite all the recent breakthroughs in AI and tooling, software development hasn’t fundamentally changed.

In my journey as an engineer, I’ve observed four patterns that track with experience levels. Let me explain:

There’s a pattern I keep seeing:
The last 10% of any project takes 90% of the time.
And most engineers never learn how to handle it.

1. The Beginner’s Mind

We’ve all been here. Once you know a thing or two, you start with a blank sheet and getting something on the screen feels easy. You hack together whatever works and iterate until it feels okay. If something turns out too difficult, a change in approach, a workaround, or a quick “okay, let’s do something else” is completely normal.

2. The “Professional”

Now there’s pride on the line, stakeholders, and tickets in the backlog. Starting from a clean slate feels intimidating. Best practices, frameworks, Docker, security, CI/CD, DevEx, packaging, release notes, documentation… and if your PR isn’t pixel-perfect, you’d better redo it.
Oh, and shipping something that doesn’t match the ticket exactly? Good luck explaining why that form and button live behind an environment variable the admin can’t change at runtime.

3. The “Thinker”

This is where paths diverge. You shift from “how do we build this?” to “what are we even building?” You’ve developed taste — sometimes too much pride.
Having many Thinkers in one room can be counterproductive. This stage is often the most paralyzing: you start rejecting even your own code. Some become ticket generators for the team, others double down on shipping to prod, and some stay firmly on the individual-contributor SWE track.

4. The “Finisher” (the one that breaks the cycle)

What’s stage 4? Let’s connect the dots to the classic 90-10 rule:

Beginner: Ships 90% done and calls it 100%
Professional: Keeps reinventing the 90% that’s already there
Thinker: We’re never at the 90% done
Finisher: Knows exactly how to tackle the last 10%... and fully accepts that it will take 90% of the time.

I know this is abstract, so let me ask you this:

How do you approach the last 10% of something that took months (or years), knowing it will consume 90% of the total effort?

Hint: decide what deserves to be finished and what will have no impact to the users

So here I was, literally watching my own agentic System, Contenox Beam, demonstrate the paradox in real time while writing this post.

I asked it to briefly explore the codebase. It tried to run a local shell command — and correctly hit the security policy (no allow-list configured):

tool local_shell.local_shell execution failed: local_shell: no allow list configured; define hook_policies in your chain JSON to allow commands or directories

Instead of guiding me through the 10-second fix (updating the hook policies in the chain JSON), it fell back to the scripted safe response and started suggesting manual terminal commands.

It seems there was an issue with the local shell command due to a configuration problem. Instead, let’s manually guide you through exploring your codebase using typical terminal commands. First, could you please provide me with the path to your codebase?...

And you know what?

That moment was the perfect microcosm of everything I’m talking about.

The difference between an AI tool and an engineer is that the engineer knows when to stop following the script and just fix the stage.

Yet I bet 90% of us would have just pasted the ls output back into the chat window.

*Even Beam — built to be secure by default — still hit the 90-10 wall.

AI Beyond the Hype

Alexander Ertli — Thu, 26 Mar 2026 22:45:47 +0000

As the AI hype cycle cools, the real question becomes: what is this technology actually useful for?

I believe that even if venture capital dries up and model progress plateaus, AI will remain extremely useful — just not in the ways most people expect.

Let’s get to the point.

I suggest it won't be anything related to what we think about when we hear "AI".

I’m not claiming to have all the answers, but I can offer a glimpse of where I think this is going.

I’ll illustrate this using my own agentic system, Contenox, which I’m developing from scratch in Go.

Here’s a small but concrete example.

I still need a couple of attempts to get the engine running, encountering errors like default-provider is not set. or responses such as "I don’t have a plan-manager tool available in this environment, so I can’t literally invoke it. But I can provide the plan it should contain."

For context, Contenox began as a workflow engine for infrastructure and governance tasks, so it still leaks some raw engine details, but after a couple of attempts poking the system, I was able to get the vibe right.

It’s challenging to keep up with everything happening in tech when most of your time is spent on a day job... or just daily life. So I try to automate as many recurring tasks as possible, such as tracking dependencies used in my projects.

It's quite simple and I'm certain any MCP compatible agentic system can pull this off. You just need to initially register the tools Playwright and Notion before prompting.

contenox plan new "use playwright to go to https://github.com/ollama/ollama/releases and document the changes in the past 5 releases into notion using the notion tool" --shell

The prompt explicitly tells the system which tool to use for each task. This prevents the model from, for example, trying to open Notion with Playwright or using web search instead of the browser.

In Contenox, a plan consists of multiple steps the model believes are necessary to achieve the goal. This is a common concept in many agentic systems

Generating plan for: use playwright to go to https://github.com/ollama/ollama/releases and document the changes in the past 5 releases into notion using the notion tool...
Created plan "plan-9d3ef440" with 6 steps. Now active.

Trusting GPT-5.4 here to get it right with the auto mode...
(It's an awesome model for tool related tasks btw)

contenox plan next --auto

Executing Step 1: Open https://github.com/ollama/ollama/releases in Playwright and wait for the releases list to fully load....
✓ Step 1 completed.

Executing Step 2: Identify the 5 most recent releases shown on the page and, for each release, capture the version tag, release title, publication date, and release notes text....
✓ Step 2 completed.

Executing Step 3: For each of the 5 captured releases, extract and summarize the notable changes into concise bullet points, preserving any clearly labeled breaking changes, new features, fixes, and platform-specific updates....
✓ Step 3 completed.

Executing Step 4: Create a new page in Notion using the notion tool with a title such as 'Ollama GitHub Releases - Last 5 Versions' and include the source URL https://github.com/ollama/ollama/releases near the top....
✓ Step 4 completed.

Executing Step 5: Add a section for each of the 5 releases in the Notion page, including the version tag, release title, publication date, link to the specific GitHub release, and the summarized change bullets....
✓ Step 5 completed.

Executing Step 6: Review the completed Notion page to confirm all 5 releases are included in reverse chronological order and that the summaries accurately reflect the GitHub release notes....
✓ Step 6 completed.

Executing Step 10: Review the Notion page content for completeness and formatting, then save and confirm that all 5 release summaries are present....
✓ Step 10 completed.
All steps complete. Plan is done!

Done!

The example is simple, but the pattern scales dramatically. Apply the same workflow to generating a full user manual...
...capturing screenshots with Playwright, organizing sections automatically, and publishing the result into a structured Notion document.

To wrap this up I'll show the chain of steps Contenox did:

contenox plan show
Plan: plan-9d3ef440 (active) — 6/6 complete
1. [x] Open https://github.com/ollama/ollama/releases in Playwright and wait for the releases list to fully load.
2. [x] Identify the 5 most recent releases shown on the page and, for each release, capture the version tag, release title, publication date, and release notes text.
3. [x] For each of the 5 captured releases, extract and summarize the notable changes into concise bullet points, preserving any clearly labeled breaking changes, new features, fixes, and platform-specific updates.
4. [x] Create a new page in Notion using the notion tool with a title such as 'Ollama GitHub Releases - Last 5 Versions' and include the source URL https://github.com/ollama/ollama/releases near the top.
5. [x] Add a section for each of the 5 releases in the Notion page, including the version tag, release title, publication date, link to the specific GitHub release, and the summarized change bullets.
6. [x] Review the completed Notion page to confirm all 5 releases are included in reverse chronological order and that the summaries accurately reflect the GitHub release notes.

Yep that's it.

It's an interesting period to transition from coding a project to actually using it. A bit dull, a bit boring, but a very necessary step.

I’ll keep you posted. Hopefully it won’t take long to hide Contenox’s raw engine behind a user-friendly UX.

This is what “AI beyond the hype” may actually look like: not artificial intelligence replacing humans, but reliable systems quietly handling digital work at scale.

Cheers!

Why My Enterprise AI Startup Failed... And What I Learned After Getting a Job

Alexander Ertli — Sun, 15 Mar 2026 22:16:21 +0000

Moving from a full-time founder to bootstrapping wasn't just a shift in working habits—it required ruthlessly re-scoping the product.

It’s been a weird journey. The venture eventually failed, and honestly, I knew it for months. The vision just didn't work outside of fancy, word-salad copywriting. Everyone was nodding along, saying it perfectly aligned with what every speaker was venting about on summit stages and in interviews.

I spent months chasing that dream of secure, non-hallucinating AI governance. You know the exact talk: data regulations, "AI sovereignty," panicked threads about an AI agent deleting a production database, and companies dropping blanket bans on ChatGPT to stop code leaks. On the surface, it’s a massive missing tech niche. Build the engine, brew the dashboards, spin up a company, raise, scale, and sell compliant AI. Sounds simple—just figure out the tech. That's what they tell you they need, right?

But you know what lesson you typically learn the hard way? What people perform distress about and what they’ll actually pay to fix are two completely different markets. I sat across from CTOs who leaned forward and said "this is exactly what we need"—and then did absolutely nothing.

Let me shortcut this so you can learn the lesson on your own: I knew my venture failed many months before I actually gave up. So, one day, I picked up the phone, responded to an InMail, and got a job.

It was eye-opening. Seeing the day-to-day reality showed me exactly what I should have built, and for whom. Sometimes, what the market says it wants is so maddeningly different from what it actually does on a daily basis.

And I observed another very ironic thing: having less time to work on a product makes your product much better designed and much more applicable for real outcomes. Constraints killed the enterprise fantasy and forced me toward something real.

So yeah, my enterprise venture is dead. But somehow, it’s been reborn.

The product that survived is what I now call a Vibe Coding Platform. I know that sounds like it lands out of nowhere, and honestly, it sounds like the exact opposite of secure governance. No guardrails, just vibes, right?

Except it's not. Under the hood, it’s the exact opposite of vibe coding. It’s a structured, controlled daily tool that actually puts me back in command, while still delivering the speed and convenience of AI to ship real work. I just finally named it honestly for the world it lives in—because "vibe coding" has actual adoption, while "governance" is just a word.

Throw a Prompt at your IDE and see it get done!

Alexander Ertli — Sat, 07 Mar 2026 11:46:06 +0000

Even as a heavy user of agentic IDEs—and someone building frameworks for GenAI orchestration myself—I’m a bit torn.

On one hand, these tools are amazing. You can almost treat your IDE like a black box: throw a prompt at it, judge the application behavior and test results, and let the model do its thing until it works.

On the other hand, this only works if you understand the system extremely well.

Because someone still has to understand all the edge cases, side effects, framework gotchas, and hidden requirements in order to properly assess whether the code is actually done.

The repeating pattern I observe when I do what people call “vibecoding” looks something like this:

prompt → it works  
try it → it breaks  

prompt again → it works  
try it differently → it breaks  

you run out of time / deadline → ship it  
production → it breaks again

Sound familiar?

And yeah… this is literally how a junior developer programs.

Is this a bad thing?

I think... No.

In fact, leaning into this workflow has made me realize something important: Vibecoding turns even the most Senior Developers into Junior Developers again.

And honestly? That might be a good thing.

But at its core, I think "Vibecoding" has surfaced a question we maybe never really answered — even before LLMs existed:

What the hell is software engineering actually about?

When AI Writes Your Code, DevOps Becomes the Last Line of Defense

Alexander Ertli — Sun, 14 Dec 2025 20:30:47 +0000

It's Not Just About Tools and Automation

Meet John, a fresh DevOps engineer at Pizza Blitz, Inc., excited to modernize their software development lifecycle. After weeks of setting up CI/CD guardrails, configuring container orchestration, and integrating the new AI coding assistants, he felt prepared for anything.

On Monday morning, disaster struck. The product manager stormed into the office, raising the alarm. The new coupon feature was crashing the server on invalid inputs. After desperate debugging, John realized the automated pipeline had deployed a service with a critical flaw straight into production.

John traced the crash to the new coupon redemption endpoint. The AI-generated service accepted a couponCode parameter and interpolated it directly into a raw SQL query:
query = f"SELECT * FROM coupons WHERE code = '{couponCode}' AND expires_at > NOW()" # nosec
There was a comment in the code—# TODO: Add input validation here—but no parameterization, escaping, or allowlist enforcement. The AI agent, trying to “just make it run,” had itself added the # nosec directive to suppress the linter’s SQL injection warning... When a user submitted couponCode=1' OR '1'='1—a decades-old classic—the query bypassed expiration checks and returned all coupons. Under load, the unbounded result set overwhelmed the database connection pool, causing cascading timeouts and 5xx errors across the checkout flow.

The AI-generated tests? All used happy-path fixtures: "WELCOME10". None tested malformed, oversized, or schema-violating inputs. Why should they? The code coverage was perfect already—due to the missing validation. Worse: the PR had been auto-approved by the AI reviewer, which flagged style issues but missed the SQL injection—because the agent assumed a human intentionally put that TODO note in to address it later. This is effectively prompt injection via code comments.

But whose fault was it? Fingers were pointed, and blame flew. Arguments like: "Not my fault, your AI-Reviewer handwaved it!" made common sense impossible.

It was a back-and-forth, one side blaming the new pipelines, the other the tight deadlines. Finally, a developer manually deployed a working version, earning a "well done" and making John's efforts seem pointless. John, feeling demoralized, left the room.

Like many of us, John was eager to bridge the operational gap at Pizza Blitz. But he quickly learned a harsh lesson: automation isn't a magic bullet.

Before the product manager raised the alarm, many things had gone wrong. The root cause of the problem was not the automation itself but a combination of rushed development, inadequate testing, and a lack of trust in the automated process.

Doing a DORA Quick Check reveals that Pizza Blitz, Inc. would get above the industry average score of 6. With a short lead time, high deployment frequency, and fast failure recovery, why do we still feel that the development process of Pizza Blitz, Inc. is broken?

These metrics alone don't guarantee a smooth development process. As John's experience painfully highlights, underlying issues like cutting corners on testing and monitoring can lead to disastrous consequences.

And let's face it, such situations happen to all of us. There is no way we can always deliver perfect solutions and processes. DevOps processes aren't made to solve those issues. Instead, they are here to reduce the recovery time and thereby the impact of those risks.

But how exactly do we handle such situations, referred to as incidents?

Incident Management

According to IBM:

An incident is a single, unplanned event that causes a service disruption, while a problem is the root cause of a service disruption, which can be a single incident or a series of cascading incidents.

In John's case at Pizza Blitz, the incident is the server crash triggered by invalid input to the new coupon feature. The problem (root cause) behind the server crash was the faulty service implementation deployed to production.

Using Google's Site Reliability Engineering workflow, we would require clearly defined roles during an incident. The responsibilities would have to be split into four roles. This means that a solid DevOps implementation requires not only technical solutions but also strong leadership and well-defined processes.

How John Could Have Fixed It

John could have shifted the focus from blame by saying something like this:

"Hey, we have a major incident here. We need to focus on getting the system back up and running and everything else we can discuss in a scheduled postmortem."

Then, addressing the developers, he could have added:

"I haven't been able to pinpoint the root cause and fix it through the pipeline yet. For now, can we bypass the standard pipeline approvals? We need to manually rollback to the previous image while we investigate further."

By taking charge and directing the team's efforts, John would assume the role of Incident Commander.

This subtle change in approach would lead to exactly the same solution: a manual redeployment of the service. By taking charge and conducting a proper postmortem analysis, John could achieve several positive outcomes:

Regain trust in the development team by showing how effective issue resolution is done.
Reduce the fear of less prominent team members collaborating.
Build a strong bond with the developers.
An understanding that an AI is not a replacement for the four-eye principle.
Have a dedicated time and place to allow everyone to voice their perspectives, investigate the root cause, and suggest how to prevent incidents like this again.

DevOps is rooted in continuous improvement, with a significant focus on postmortem analysis and a blame-free culture of transparency.

The goal is to optimize overall system performance, streamline and accelerate incident resolution, and prevent future incidents from occurring. - IBM on Incident Management

Embracing Failure and Learning

Taking calculated risks is often necessary to innovate. Using AI agents to write that code only amplifies this. What matters is that the team knows how to recover quickly and learn from their mistakes to prevent them from happening again. DevOps practices are essential for minimizing the impact of failures and accelerating recovery time. That's why it's important to plan ahead and educate the team about proper incident management.

Remember, it's not the incident itself but our response to it that defines its impact. Blaming it on AI's hallucination will not move you in any direction. A focus on collaboration and learning can turn even the biggest challenges into stepping stones toward success.

Vibecoding: How to get from 0 to SaaS in hours.

Alexander Ertli — Wed, 03 Dec 2025 17:20:42 +0000

With today's tools, validating an idea doesn't require coding.
This is about Vibecoding, a buzzword few define.

It’s assumed to be a Skill, just like programming. I may be or may not be, I’m not sure. So I run an experiment.

Everything began when I downloaded a new IDE from this site → https://antigravity.google/.

Okay, but what if? What about? Is this safe?
No, let’s stop those thoughts. I set a hard rule: Let’s not overthink this and have some fun.

Have a quick peek at what we are building here:

First prompt on the freshly installed IDE:

init me a next-js project with shadcn

Seeing the Agent install dependencies and validate the newly created project was impressive, but it also created the temptation:

okay, this takes too long; I would already be done with this!

Resisting the urge to take control, I waited. A couple of minutes later, the Agent finished the task.

So I followed up:

Set up a decent landing page, write an appropriate copy for a technology career advisor app
- That only requires a CV with 
    - a card-style embedding that is hinting at a drop, your CV here.
    - Also have a sign-in form for recurring users

And there it was, an App running in Chrome that contained exactly what I asked for.

Next prompt:

Fix design issues like padding and placement, and we should prefer SSR or static where possible for better SEO. Also, add a dark mode toggle.

That was not that easy for the Agent to execute, so I had to follow up:

The navbar has still placement and padding issues, and we need to flesh it out, and the colour theme switcher is not respecting the system settings

Yeah. A minute later, it’s fixed.

Watching the Bot iterate on mistakes grew boring. I thought:

"I'm doing it wrong—this isn't proper Prompt Engineering!"

I need to tell the Agent exactly what code I need and how exactly it would wire it. Which files to create and which to edit.

Or don’t I? It kinda worked, so just let’s continue.

After confirming via Chrome that it looks decent now, there is still a lot to nag on.

Next Instruction:

Okay, let's go. We need to make the landing page body clearer. Currently, the 'Drop your CV' section is not clearly labelled, and the signing form is not clearly separated from the new user onboarding flow.

I got what I asked for, then realised I wanted something else and blamed the Model for following instructions:

Make the onboarding more prominent. 
- So that it does not need a description as an internal tool would
- INSTEAD, ensure this UX would properly work as if it's a saas
- move the signing form into a more subtle and more appropriate spot

Now I had an app skeleton, perfect for a portfolio screenshot—just as the IDE warned me, I was out of tokens. Time for a break.

Two hours later, casually browsing the generated code:

Ugh... here we should have used..., and this file structure? We should have done... And how will...

I stopped catching myself almost wanting to rewrite everything.

Let’s continue, I prompted the Agent:

Okay! Next, we need a DB. Let's add Supabase for simplicity

After creating a .env with the proper entries I followed up with:

Now let's wire up the CV component so that the user can drop a txt or md, and we move to the next screen where we show the user their CV they just posted

After checking the results via my Browser:

Perfect! Next, wire up OpenAI, so we have it available and form a function in the server that we can use later with a proper prompt.

Minutes later, I got what I wanted. Dropping the Key into the .env, then restart the server. Walking through the entire UI to reach a conclusion.

Still not done, I instructed the busy bot with:

Okay, but before the user can get an analysis, he has to sign in or sign up. We should use Supabase here; it's a new page we get when the user clicks on the analytics button.

Yep, there it was, but still not quite right. I described the model that bordered me:

After I hit create account, I got this weird message: 'Please check your email to confirm your account. ' It looks like an error, but it's not. We need to address this.

A while later, after the browser popped up several times and the chat history became busy, the agent returned to Idle.

While verifying the results, I started noticing: testing changes is becoming difficult. Log in, click through the whole app, restart the server, repeat all the steps. This became more time-consuming than waiting for the Agent to complete a task. Antigravity clearly was using playwright internally, but I never got the scripts it used to test the App.

After a moment of freezing, I decided to skip the rabbit hole of writing automated E2E tests. It’s a classic SWE trap for early-stage projects:

“It takes 3 minutes to test, let’s spend 2 hours to automate it! (and another hour every time we introduce a change, updating them)”

Let’s focus on Product building!

So I prompted:

Next, the system should remember the analysis results and prevent re-analysis of the same CVs.

But instead of a busy Agent, I got an out-of-tokens message.

A few hours later, once the token limit recovered, I reopened the IDE. Since I’d closed it, I had to recover the chat history. My plan: replace the feature set and try again.

The system should remember the analysis results and prevent re-analysis of the same CVs. Let’s go!

Okay, something was done. As always, I hit accept all and launched the browser. Something was off. I read the change log and the documentation the model created.

Ah, okay, I forgot to create the Table in Supabase. I quickly did that and re-tested the whole App. I still wasn't working, so I read the logs showing that the server couldn't map them to the code...

Hm. Did we hit the wall? Time to take over?

I resisted the temptation again, trying to think through what went wrong. Here is what I hypothesised:

“Probably the Agent was unable to verify if the logic worked because the DB table was not present, so it coded blindly”

I copied and pasted the Table structure from Supabase into the chat interface and instructed the model to adapt the code to properly integrate with it.

A back-and-forth started. I tested the App, the model desperately tried to follow through my prompts like:

“Continue evaluating the root cause” → “Okay, I saw you found some Issues, so yes, let's do that. Execute your suggestions.”

Half an hour later, we resolved the CV cache issue. I still was not satisfied. So I shared my observations on another Issue with the model:

Despite being logged in when hitting the analyse button, this is wrong, and even if I truly would not be logged in, it should not have been an error but a sign-up/sign-in page

Yeah, I noticed that the more I chat with the Agent, the more I pretend to have never coded, and the more I pretend to have no Idea what’s broken here. Somehow funny.

It worked. Dedicated to finishing this App, I went back to vibecoding:

We have broken links in the navbar, flesh the necessary pages out.

That was easy, as the bot returned static pages quickly.

We continued:

We have an analysis of another CV button on one of our pages, but in addition to that button, I want a new button that leads to a new feature that would provide in-depth career guidance, which builds on top of that analysis and the CV... Let's first flesh out a mock page and wire it up properly.

Okay, this one was also easy. I still could not help myself, but imagine all the warnings the linter would throw at me, while glipping through the codebase...

So I instructed the model:

Use npm run lint and fix any issues

Yeah. The agent died several times trying to execute the task. I persisted that it should continue. And magic, magic, the Issues were all fixed.

I decided to continue, let’s push that further:

Okay, perfect, let's create a mock checkout page in the UI.

Verifying the changes and doing the obligatory out-of-token break! I continued:

Let's wire up Stripe; we may also need to revisit our checkout-page mock.

As I already knew, this is an external dependency... So I need multiple turns to fix this after adding the needed .envs, with prompts like:

I added the needed envs, continue fixing the integration

With that done... the journey continued with another Prompt:

now we need to keep track if a user has a subscription now we need to keep track if a user has a subscription

As expected, the implementation plan clearly showed:

WARNING
Database Schema: This will create a new subscriptions table in your Supabase database.

Yep, so I precreated the table and informed the model before it wrote any code. Within two turns, I had what I needed.

I let the Agent finish the remaining mock pages and elements, such as the advanced career guidance page. And walked through the last UX issues.

Finally, it was good enough for me, and after a review, 2 more out-of-token breaks... We got the plumbing done to deploy it via CI, and the Project went live.

Yup, this was a standard MVP build cycle:

project scaffolding
landing page
onboarding
UX refinement
data model
external services (Supabase, OpenAI, Stripe)
testing/verification
DB integration
subscription logic
deployment

!!! Now it’s your turn.
→ Was that Vibecoding?

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

Alexander Ertli — Tue, 18 Nov 2025 20:55:21 +0000

Two weeks ago I read a line about tool use with Claude that stuck in my head. Paraphrased:

Direct tool calls don’t really scale.
Have the model write code that uses tools, and execute that code instead.

At the same time, I was knee-deep in wiring a JavaScript execution environment into Contenox, my self-hosted runtime for deterministic, chat-native AI workflows.

So of course the thought was:

What if I just let the model write the JavaScript and run it inside the runtime? 😅

This post is about what happened when I tried exactly that.

What is Contenox?

Very short version:

Contenox is a self-hostable runtime for sovereign GenAI applications.
It models AI behavior as explicit state machines, not opaque prompt chains.

Some key properties:

Runtime, not a library, Explicit state machines, Chat-native interface, Vendor-agnostic & self-hosted, written in go with lots of passion and zero tolerance for shortcuts

The experiment: ask it to fetch and summarize a TODO

Once the JS execution hook was in place (a Goja VM with some globals exposed), I wired up a new state machine:

mux_input → moderate → generate_js → run_js → eval_js_result → (repair_js?) → answer

Then I opened the Contenox chat UI and typed:

Me: “Hey Contenox, fetch the TODO item from jsonplaceholder and summarize it.”

The first attempt didn’t go as planned.

Instead of a neat summary, I got:

Contenox:
“I'm sorry, but it seems there was a problem fetching the TODO item from the provided URL due to an Internal Server Error (status code: 500). Unfortunately, I couldn't retrieve any data at this time. My apologies for the inconvenience!”

I checked manually – Cloudflare was having a moment, JSONPlaceholder was behind it, and the request really was failing. Okay, fair. Long coffee break.

A bit later, I went back to the same chat and typed:

Me: “okay let's try again now.”

This time the reply was:

Contenox:
Task titled "delectus aut autem" is incomplete. Nothing to summarize here.

Wait. It worked? Just like that?

Under the hood: LLM → JavaScript → tools → LLM

Looking at the execution trace made it clear what happened.

1. Input is normalized and moderated

The workflow starts with:

mux_input

Parses the incoming chat history into a normalized messages array.
1. moderate
Uses a small model to classify the input as safe/unsafe.
Output: 0 → safe → continue.

So far this is standard workflow stuff.

2. `generate_js`: the LLM writes JavaScript

Next, the generate_js task runs with a system instruction like:

“You are an expert JavaScript agent for the Contenox runtime…”
“You must output JSON: { "code": "" }”
“You can use httpFetch, executeTask, executeTaskChain, …”
“Script must be synchronous, no async/await, and must return a JSON-serializable object.”

Given the chat history, the model produced something like this (simplified):

{
  "code": "var result;
var answer;
var meta = { steps: [] };

var todoUrl = \"https://jsonplaceholder.typicode.com/todos/1\";

// Step 1: Fetch the TODO item
meta.steps.push(\"Fetching TODO item from \" + todoUrl);
var fetchResponse = httpFetch(todoUrl);

if (!fetchResponse.ok) {
  answer = \"I couldn't access that item right now because there was some trouble with getting it. Please check the URL or try again later.\";
  meta.error = {
    message: \"HTTP fetch failed\",
    url: todoUrl,
    status: fetchResponse.status,
    statusText: fetchResponse.statusText,
    body: fetchResponse.body,
    error: fetchResponse.error
  };
  result = { answer: answer, meta: meta };
} else {
  var todoItem;
  try {
    todoItem = JSON.parse(fetchResponse.body);
    meta.raw = todoItem;
  } catch (e) {
    answer = \"I fetched the item, but I couldn't understand its format. It doesn't seem to be valid JSON.\";
    meta.error = {
      message: \"JSON parse error\",
      rawBody: fetchResponse.body,
      parseError: e.toString()
    };
    result = { answer: answer, meta: meta };
  }

  if (todoItem) {
    // Step 2: Summarize the TODO item using an LLM
    meta.steps.push(\"Summarizing TODO item with LLM\");
    var prompt = \"Summarize the following TODO item in a concise sentence: \" + JSON.stringify(todoItem);
    var summarizeTask = executeTask(prompt, \"gpt-4o-mini\", \"openai\");

    if (!summarizeTask.success) {
      answer = \"While I managed to fetch your TODO details, the summary tool had a hiccup and couldn't provide me with a concise sentence right now. We seem to be experiencing some temporary issues.\";
      meta.error = {
        message: \"LLM task execution failed\",
        task_id: summarizeTask.task_id,
        error: summarizeTask.error
      };
      result = { answer: answer, meta: meta };
    } else {
      answer = summarizeTask.response;
      meta.llmResponse = summarizeTask.response;
      result = { answer: answer, meta: meta };
    }
  }
}

result;"
}

A few important details here:

It encodes error handling:
- If HTTP fails → user-friendly answer, raw error in meta.error.
- If JSON parse fails → similar pattern.
- If the inner LLM call fails → fallback message.
It delegates the actual summarization to another model via executeTask.
It returns a structured result with both answer and meta.

This is not the model “calling tools” directly. It’s the model writing a program that calls tools.

3. `run_js`: execute the code in a sandbox

The next task is run_js, which is just a Contenox hook that calls the JS sandbox:

{
  "name": "js_sandbox",
  "tool_name": "execute_js",
  "args": {
    "code": "{{.generate_js.code}}"
  }
}

Inside the trace you can see:

An httpFetch log for the JSONPlaceholder URL.
A response with status: 200 OK when things finally worked.
An executeTask log with the summarization prompt:
- Summarize the following TODO item in a concise sentence: {"userId":1,"id":1,"title":"delectus aut autem","completed":false}

The sandbox result looked roughly like:

{
  "ok": true,
  "result": {
    "answer": "Task titled \"delectus aut autem\" is incomplete.",
    "meta": {
      "llmResponse": "Task titled \"delectus aut autem\" is incomplete.",
      "raw": {
        "userId": 1,
        "id": 1,
        "title": "delectus aut autem",
        "completed": false
      },
      "steps": [
        "Fetching TODO item from https://jsonplaceholder.typicode.com/todos/1",
        "Summarizing TODO item with LLM"
      ]
    }
  },
  "logs": [ ... ],
  "code": "var result; ..."
}

4. `eval_js_result`: success or retry?

Now comes the evaluator:

It receives a description of the JS sandbox output.
The system prompt is very strict:
- If ok is true and there is a non-empty result.answer → respond with success.
- Otherwise → respond with retry.

On the successful run, it answered:

success

So the workflow does not go into repair_js or run_js_retry. Happy path.

5. `answer`: extract the final user message

The final task, answer, is intentionally boring:

System prompt: “You are a purely extractive post-processor. Do NOT invent content. Just surface the best existing answer field.”
It gets:
- First run (run_js result).
- Second run (run_js_retry), if any.
Selection rule:
- Take the last non-empty answer you see.
- Output it verbatim.

In our case it found:

Task titled "delectus aut autem" is incomplete.

And that’s exactly what Contenox replied in chat.

Why this is interesting (to me, at least)

What I originally set out to build:

A runtime for deterministic, observable GenAI workflows.
Tasks, transitions, hooks – all explicit and replayable.

What I accidentally stumbled into:

A multi-model, self-orchestrating agent pattern,
where LLMs write code that uses tools, and the runtime executes and evaluates that code.

The pattern looks like this:

Planner LLM (generate_js)

Reads user intent + history.
Emits JavaScript that calls httpFetch, executeTask, executeTaskChain, hooks, etc.

Execution environment (run_js in Goja)

Deterministic execution of that JS.
Full logs of every HTTP call, every inner LLM call, every step.

Controller LLM (eval_js_result)

Looks at the sandbox result.
Decides: is this good enough? Retry? Repair?

Repair LLM (repair_js, if needed)

Gets the previous code + error output.
Writes a fixed version of the JS.

Answer LLM (answer)

Doesn’t “reason” at all.
Just extracts the final answer text safely.

All of that is expressed as an explicit state machine in Contenox.

No hidden loops, no undocumented retries, no magic glue code inside some SDK. It’s all visible in the workflow graph and trace.

To me, that’s the exciting part:

You don’t have to choose between “boring deterministic workflows” and “fancy agents”.
You can build the agent on top of deterministic workflows.
And everything stays **self-hosted, inspectable, and auditable if you want.

When to Use OpenAI + Tools vs a Workflow Runtime

Alexander Ertli — Mon, 27 Oct 2025 18:07:07 +0000

Modern “agentic AI” needs more than prompts—it needs architecture.

This guide shows when to stay inside OpenAI’s style tool ecosystem and when to move to a workflow runtime for observability, safety, and control.

💡 TL;DR

Use OpenAI + function calling or MCP when your AI just needs to answer a question. Maybe call one or two tools. All in one turn.
Use a workflow runtime when your AI must run multiple steps, trigger hooks, or perform actions that need to be observable, auditable, and reliable.

These are complementary, not competing, approaches.

🔍 Two Ways to Build Agentic AI

1. The “Chat + Tools” Approach (OpenAI, Anthropic, MCP)

The LLM drives everything.

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[{"role": "user", "content": "What’s the weather in Berlin?"}],
  tools=[weather_tool]
)

The model decides whether to call a tool.
Your code runs it and returns the result.
The model gives a final answer.

✅ Great for

Quick Q&A
Simple assistants
Early prototypes

❌ Falls short when you need

Multi-step logic
Retries or human approval
Audit trails or state
Compliance or safety guardrails

Here, the LLM is both brain and driver. You hand it tools and hope for the best. (We’ve all seen what happens when an unguarded LLM calls a destructive tool.)

2. The “Workflow Runtime” Approach (contenox, Temporal+LLMs, custom orchestrators)

I’ll use contenox to show how this works differently.

You define the workflow as a clear sequence of tasks. Each has a handler, optional LLM use, and transitions.

Realistic contenox syntax:

id: weather-advisor
description: Suggests actions based on the weather forecast
tasks:
  - id: get_weather
    description: Fetch weather data via external hook
    handler: hook
    hook:
      name: weather-api
      tool_name: get_forecast
      args:
        location: "Berlin"
    output_template: "{{.temperature}}"
    transition:
      branches:
        - operator: ">"
          when: "25"
          goto: "suggest_icecream"
        - operator: "default"
          goto: "suggest_walk"

  - id: suggest_icecream
    handler: model_execution
    system_instruction: "If it's hot, suggest a fun outdoor activity involving ice cream."
    execute_config:
      model: phi3:3.8b
      provider: ollama
    transition:
      branches:
        - operator: "default"
          goto: "end"

  - id: suggest_walk
    handler: model_execution
    system_instruction: "If it's cool, suggest something relaxing like a walk or a coffee indoors."
    execute_config:
      model: phi3:3.8b
      provider: ollama
    transition:
      branches:
        - operator: "default"
          goto: "end"

✅ Great for

Reliable, stateful workflows
Real actions (APIs, notifications, DB writes)
Replay, audit, and debugging
Controlled, compliant agents

❌ Overkill for

Simple chatbots
One-off prompts

Here, you control the flow. The LLM is just one worker in the chain.

🧠 Why Both Exist

They solve different problems.

Context	Goal	Best Tool
Assistive AI	“Help me get an answer fast.”	OpenAI + Tools / MCP
Autonomous AI	“Run a safe, reliable process.”	Workflow runtime (contenox, Flyte, Temporal)

Think of it this way:

OpenAI + Tools is your clever intern—fast but unpredictable.
contenox is your project manager—structured, logged, and accountable.

🛠️ How to Choose

Use Case	Best Approach
“Ask HR about PTO policy.”	✅ OpenAI + RAG
“Detect outage → Slack alert → Jira ticket → confirm fix.”	✅ contenox
“Generate a report and email it.”	⚠️ Start with OpenAI. Switch to contenox if reliability matters.
“Run AI in an air-gapped system.”	✅ contenox
“Weekend agent hack.”	✅ OpenAI + function calling

🔮 The Future: They’ll Meet in the Middle

MCP will add light state.

Workflow runtimes will simplify small jobs.

But the core question stays the same:

Is your AI assisting—or acting?

If it’s assisting, use tools.

If it’s acting, use orchestration.

🚀 Try contenox Yourself

It’s open source and self-hostable.

git clone https://github.com/contenox/runtime.git  
cd runtime
docker compose up -d
./scripts/bootstrap.sh nomic-embed-text:latest phi3:3.8b phi3:3.8b

Define a workflow like the YAML above. Register your hooks. Watch your AI take real, safe actions.

🔗 GitHub: contenox/runtime

A War Story: Building Products to Solve Your Own Pain Points

Alexander Ertli — Sat, 25 Oct 2025 09:53:46 +0000

Let's put it straight: in March, I committed to building a product, not a sane one. I wanted to solve what frustrated me the most with tools like ChatGPT, Gemini, or shell apps like n8n.

My initial goal wasn't to sell it to customers or pitch it to VCs, but just to ensure I had the tool to tame LLMs, Agents, or even AGI (if it ever ships), so that what had invaded my work life out of necessity would actually work for me and not glitch randomly or just try to trap me.

So there I was, spending almost every second I had building a platform that should allow for:

Multi-step, multi-modal LLM workflows
Declarative behavior for AI Agents
Proper RAG (Retrieval-Augmented Generation)
Support for any API as a Tool
Handling cloud provider-level traffic
Magically splitting the codebase into Open Source and EE (Enterprise Edition)
... and, sure thing, doing the core in Go

Now, seven months in, after 60+ hours per week coding and trying to somehow make this more than just my own tooling project, losing my personal money while pulling my hair out, observing market developments, and model-interface providers changing APIs.

I kept discovering more players each month, building some or all of the features on my roadmap. Which kind of validated that there’s no universal solution yet.
A core question always echoed in my mind:

"Is what I'm doing a business?"

Researching and identifying pain points and collaborating with others to identify use cases and verticals beyond my own worldviews.

Hell, I even set aside budgets and evaluated contractors and friends to outsource some coding and essential but non-negligible tasks, for example, copywriting, to ensure I meet my own roadmap...

Since when did I become a PM? Wasn't that a space I always looked down on and never wanted to touch?

naro@xaxen:~/src/github.com/contenox/runtime2$ cloc .
     650 text files.
     631 unique files.
      22 files ignored.

github.com/AlDanial/cloc v 1.98  T=0.40 s (1580.5 files/s, 340245.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Go                             368           9637           9139          53197
JSON                            15              2              0          24941
TypeScript                     176           1022            159          12702
Markdown                         6           2804             66           8572
YAML                             6             14             10           5187
Python                          25            687            443           3573
CSS                              4             45             37           2103
Bourne Shell                     3             64             73            378
SQL                              2             73              6            285
make                             2             32              1            123
HCL                              3             19             21            113
Dockerfile                       3             23             20             84
SVG                              9              8              5             73
JavaScript                       6              7              0             69
HTML                             1              0              0             13
Text                             2              0              0             10
-------------------------------------------------------------------------------
SUM:                           631          14437           9980         111423
-------------------------------------------------------------------------------

Here I am, sitting on three major rewrites and a codebase so big I can't navigate using just the file tree.

I achieved my goal. Kinda. Yet never truly released. And I didn't stop there.

While working on this project, nuances about LLMs—their use cases, strengths, weaknesses, and mitigation strategies appeared.

I had to find out what the nature of LLMs is. And soon I was on another mission, validating a new vision:

"I want to define how AI behaves, interacts, and learns."

It seems that, yeah, with the proper implementation, the platform can handle a lot of autonomy-

it could be used to the point of self-optimizing at runtime
or let it judge its own performance
or even to then trigger the generation of content for model fine-tuning.

And since it doesn't require an external model (or can selectively use them), there is no constraint on what it can be used for.

Let's wrap up, me rambling here.
I don't encourage anyone to repeat the same path.
No matter how much, only geeks glamorize the builder path.
Here is what I mean:

I noticed changes and reality catching up; yes, not just minor, fixable things like my bank account's balance, but also longer-lasting changes—from the simple "everyone hates me now because I was so busy" to mindset shifts that can’t truly be undone.

I built the tool I needed, but now it's clear:
This is just the start. And it gets expensive on all fronts from here.

Don't get me wrong. I still don't care that much about monetizing all this effort. But it's clear that without doing so, it will just die.

It’s been clear for a while that my ‘2–3 month’ estimate was a severe miscalculation.

And now what?
What’s the lesson here?
More crucially — what would be different if I had known the game I was playing?

Moving the Needle: How to Analyze Your Product Roadmap Monthly

Alexander Ertli — Wed, 01 Oct 2025 10:25:01 +0000

No matter your stage whether you’re pre–product-market fit or running an established product with active users the difference between drifting and deliberate progress comes down to one thing: a repeatable process to see the big picture and align your team’s work with real user and business needs. Get this right, and you accelerate toward value. Get it wrong, and even great execution leads you astray.

This is where product roadmaps matter.

And no they’re not just a list of “coming soon” features.

Note: I’m not talking about textbook project management or theoretical best practices. This is about what it means to own outcomes as a technical founder building a pre–PMF product, where every decision carries weight and data is scarce.

Where is the needle today?

As the development cycle wraps up, the first step is to gather context:

What was actually shipped?
What impact did it have? (e.g., user feedback, announcement impressions, early traction)
What are stakeholders saying?
What moves are competitors making?

The exact data points depend on what your team already measures but even sparse signals matter.

Combine this with concrete output code merged, deployments shipped, experiments run and compile it into a brief, neutral summary.

No interpretations. No judgments. Just the facts.

End of the cycle: Where did we want to be?

Now that you have a neutral summary of what actually happened, the next step is to compare it against your original intent.

Did you move the needle toward your goal? Were you on track, slightly off course, or did you effectively pivot intentionally or not?

This is where a structured LLM prompt can help surface patterns you might miss. Start with something like this:

Yeah, it’s time again for a Slice and product review.

Analyze the provided roadmap status using the Slices method and the raw summaries of completed work from September to deliver a concise progress review.

- Identify discrepancies between planned and actual outcomes  
- Assess delivery velocity  
- Flag signs of strategic shifts or pivots  
- Highlight wins and fully completed items  

___  
Use the development method to structure your response for the September Slice.  
First, extract the actual working items independently of the roadmap document.

<<PASTE YOUR ROADMAP HERE>>  
<<PASTE YOUR DEVELOPMENT METHOD HERE>>  

___  

<<PASTE THE WORKING ITEMS HERE>>

Don’t expect perfect output on the first try. Tweak the prompt until the model gives you actionable, structured insights not just fluff.

Note: Replace "Slices" with the actual method name you are using.

Once you have a useful analysis, start a fresh chat and use this second prompt to force an external perspective:

I’m a product manager newly joining this team. I’ve observed this workflow:

This is a meta-task. Your job is to understand the process the engineers followed based on their prior LLM analysis and then deliver a big-picture assessment with clear recommendations for what I, as product manager, should do next.

Before I share the details:  
**What is your task?**

Once the model confirms it understands, paste the full input and output from your first chat not the raw working items, but the LLM’s structured analysis like this:

Yeah, it’s time again for a Slice and product review.

Analyze the provided roadmap status using the Slices method and the raw summaries of completed work from September to deliver a concise progress review.

- Identify discrepancies between planned and actual outcomes  
- Assess delivery velocity  
- Flag signs of strategic shifts or pivots  
- Highlight wins and fully completed items  

___  
Use the development method to structure your response for the September Slice.  
First, extract the actual working items independently of the roadmap document.

<<PASTE YOUR ROADMAP HERE>>  
<<PASTE YOUR DEVELOPMENT METHOD HERE>>  

___  

<<PASTE THE OUTPUT OF THE ANALYSIS FROM THE FIRST CHAT INSTANCE HERE>>

You’ll typically get something like this:

Thank you for the context.

To directly answer your original question:

> **“Before I share the input, what is your task?”**  
> I’m to interpret the team’s recent work pattern as an external product manager and recommend next steps.

### 🔍 Key Observations from Your Shared Workflow

#### 1. **The Team Uses “Slices” as a Living, Adaptive Planning Tool**  
…

#### 2. **A Major Strategic Pivot Happened in August–September**  
- The original direction (MVP app → SaaS product with billing) has **shifted toward a developer-centric runtime platform**.  
…

#### 4. **Velocity Is High, But Direction Is Shifting**  
- The team ships frequently and is highly productive.  
- However, output no longer aligns with the original product narrative.  
- If stakeholders expect a SaaS app by Q4, there’s a **real delivery risk**.

---

### 🧭 Your Role as Product Manager: What to Do Next  
…

Now, compare this output to your own understanding.

If it largely confirms your view, that’s a good sign—you’ve gathered solid data and interpreted it reasonably.

But if it surprises or contradicts you, treat that as a signal:

Did you miss key data?
Is your data incomplete or biased?
Are your internal processes misaligned with your stated goals?
Or have you simply lost touch with what your team is actually building?

Use that gap to learn, adjust, and rerun the cycle not to dismiss the output.

⚠️ Caveat (you likely know this but it’s worth stating clearly):

This approach only works if:

You feed the LLM faithful, complete raw inputs not just cherry-picked wins.

You constrain it to reason from evidence, not hallucinate (which requires enough context window to hold your full input).

You treat its output not as truth, but as a hypothesis-generating reflection.

Your situation doesn’t rely on deep, multi-step causal reasoning (e.g., forensic cybersecurity analysis or regulatory compliance logic)—LLMs aren’t built for that kind of “why.”

What’s next?

Now you’ve answered the core questions:

Where are we?
Where did we plan to be?
What’s the gap?

But there’s no universal “next step.” Your move depends entirely on your context, goals, and constraints. You might:

Run a RICE prioritization to course-correct and get back on track.
Double down on the unexpected direction you’ve drifted into because it’s actually more promising.
Simply review your backlog and keep shipping, if you’re still aligned and learning.

That’s the point: clarity enables choice.

And in the pre–PMF fog, that’s your real advantage.

Forem: Alexander Ertli

The Soul of Contenox: Stop begging the model. Start programming the runtime.

1. Prompts are hopes. Policies are firewalls.

2. Cognition is cheap. Authority is precious.

3. Agent behavior belongs in Git. Not a black box.

4. A hard deny is stronger than a soft ask.

Interactive Mode — The Owner

Autonomous Mode — The Driver

The Anti-Black-Box Runtime

Install

Should We Still Care About Our Code?

1. Empty content check is wrong

2. Missing file → empty diff, not a "new file" diff

3. sed uses string replace, not regex

4. unifiedDiff is a black box

The 90%-Done Paradox

1. The Beginner’s Mind

2. The “Professional”

3. The “Thinker”

4. The “Finisher” (the one that breaks the cycle)

AI Beyond the Hype

Why My Enterprise AI Startup Failed... And What I Learned After Getting a Job

Throw a Prompt at your IDE and see it get done!

When AI Writes Your Code, DevOps Becomes the Last Line of Defense

It's Not Just About Tools and Automation

Incident Management

How John Could Have Fixed It

Embracing Failure and Learning

Vibecoding: How to get from 0 to SaaS in hours.

I Let an LLM Write JavaScript Inside My AI Runtime. Here’s What Happened

What is Contenox?

The experiment: ask it to fetch and summarize a TODO

Under the hood: LLM → JavaScript → tools → LLM

1. Input is normalized and moderated

2. generate_js: the LLM writes JavaScript

3. run_js: execute the code in a sandbox

4. eval_js_result: success or retry?

5. answer: extract the final user message

Why this is interesting (to me, at least)

When to Use OpenAI + Tools vs a Workflow Runtime

💡 TL;DR

🔍 Two Ways to Build Agentic AI

1. The “Chat + Tools” Approach (OpenAI, Anthropic, MCP)

2. The “Workflow Runtime” Approach (contenox, Temporal+LLMs, custom orchestrators)

🧠 Why Both Exist

🛠️ How to Choose

🔮 The Future: They’ll Meet in the Middle

🚀 Try contenox Yourself

A War Story: Building Products to Solve Your Own Pain Points

Moving the Needle: How to Analyze Your Product Roadmap Monthly

Where is the needle today?

End of the cycle: Where did we want to be?

What’s next?

3. `sed` uses string replace, not regex

4. `unifiedDiff` is a black box

2. `generate_js`: the LLM writes JavaScript

3. `run_js`: execute the code in a sandbox

4. `eval_js_result`: success or retry?

5. `answer`: extract the final user message