Forem: Ertugrul

OpenAnima v1 — Open-Source Desktop Overlay Engine for Windows

Ertugrul — Fri, 15 May 2026 06:15:20 +0000

OpenAnima v1 — Open-Source Desktop Overlay Engine for Windows

After months of building, experimenting, rewriting systems, debugging strange desktop issues, and testing different asset formats, OpenAnima v1 is finally out.

OpenAnima is an open-source desktop overlay engine for Windows that lets you place animated assets directly onto your desktop.

You can use:

GIFs
Static images
Sprite strips
Spritesheets
Frame-folder animations
RPG-style HUD elements
Transparent animated assets
Pixel-art characters
Desktop companions

The goal of the project was simple:

Create a lightweight desktop overlay system that feels flexible, customizable, and fun to experiment with.

What OpenAnima Can Do

OpenAnima allows you to spawn movable overlay windows directly on top of your desktop.

Each overlay can be:

dragged freely
resized
locked in place
made click-through
set always-on-top
adjusted for opacity
animated with custom FPS settings

The application also restores overlay states between sessions, so your desktop setup persists after restarting the app.

Supported Asset Types

One of the biggest goals of the project was supporting multiple animation workflows instead of only GIFs.

Currently supported:

GIF animations
Static images (.png, .jpg, .webp)
Frame-folder animations
Horizontal sprite strips
Vertical sprite strips
Spritesheets with metadata
Basic HUD/UI overlay assets

The engine also includes an asset analyzer system that tries to detect asset types automatically.

Built With

OpenAnima is primarily built using:

Python
PyQt6
QMovie / QPixmap rendering systems
Custom animation parsing logic

A lot of time went into handling edge cases like:

hidden overlays
broken configs
off-screen windows
click-through recovery
corrupted animation states

v1 focuses heavily on stability and recovery tools.

Features Added During Development

Some systems added throughout development:

Persistent overlay state saving
Safer config recovery
Recovery tools for invisible overlays
Logging system
Diagnostics panel
Asset metadata support
Import wizard
Sprite animation handling
Basic layered UI rendering
System tray controls

Why I Built It

I always liked desktop customization tools, animated desktop companions, game HUD overlays, and lightweight desktop effects.

Most existing solutions were either:

too limited
too complicated
too specific
abandoned
or focused on only one asset type

So I started experimenting with building my own system.

The project slowly evolved from “just display GIFs on desktop” into a more general desktop overlay engine.

Future Plans

Some ideas planned for future versions:

Better transparent video support
More advanced animation systems
Improved UI tools
Better performance optimizations
Asset marketplace/import improvements
More overlay interaction systems
Possible 3D asset support in the future

Download

Website / Download:

OpenAnima Website: https://ertugrulmutlu.github.io/OpenAnima/

GitHub Repository:

OpenAnima GitHub Repository: https://github.com/Ertugrulmutlu/OpenAnima

itch.io Page:

OpenAnima itch.io Page: https://ertugrulmutlu.itch.io/openanima

Instagram:

OpenAnima Instagram: https://www.instagram.com/openanimaengine

Final Thoughts

OpenAnima started as a small side project and slowly became a much larger system than I originally expected.

There is still a lot to improve, but releasing v1 feels like a huge milestone for the project.

Feedback, ideas, bug reports, and experiments are always welcome.

I’m excited to see what people create with it.

PromptLedger v0.6 — Turning prompt history into a local workspace dashboard

Ertugrul — Mon, 04 May 2026 16:23:46 +0000

Devlog — Part 5

PromptLedger v0.6 is out.

This release changes how PromptLedger feels to use.

Until now, PromptLedger was primarily a terminal-first tool with a small read-only viewer. The core workflow already existed: store prompt versions, compare changes, label releases, mark important versions, and keep everything local in SQLite.

That worked.

But once a prompt library grows, a simple viewer stops being enough.

Prompt iteration is not just about storing text. It is about navigating decisions:

Which version worked best?
Which version became stable?
What changed between versions?
Which prompt belongs to which workflow?
Which prompt should be reused?

Those questions are easier to answer when the history is not only stored, but also visible and interactive.

So v0.6 turns the old viewer into a local prompt workspace dashboard.

Workspace dashboard

The dashboard now starts with a card-based workspace instead of a single prompt view.

Each prompt appears as a card showing:

latest version
collection and role
markers such as stable or milestone
short preview of the prompt
last updated time

This makes it easier to scan a prompt library and understand what exists without opening each item.

Prompt detail view

Clicking a card opens a detailed view of the prompt.

This view includes:

full prompt text
version metadata
markers and labels
version timeline
side-by-side comparison

The goal is to make it easy to understand how a prompt evolved over time.

What changed in v0.6

Workspace instead of viewer

The dashboard is no longer just a viewer. It is a workspace where prompts can be explored and organized visually.

Card-based interaction

Prompts are now treated as objects instead of rows in a list. Cards provide quick context and actions.

Marker actions in the UI

Stable and milestone markers can now be applied directly from the dashboard. These actions use the same underlying marker system as the CLI.

Compare workflow

The compare view has been improved to clearly show differences between versions.

Usability improvements

keyboard shortcuts for faster navigation
copy actions with feedback
better empty states
improved hover and selection behavior

Design direction

PromptLedger remains intentionally limited in scope.

It is:

local-first
SQLite-backed
CLI-driven for write operations
dashboard-driven for inspection and workflow

It does not include:

cloud services
telemetry
external APIs
AI features

This boundary is important.

The goal is not to build another platform, but to provide a reliable tool for working with prompt history.

Installation

pip install --upgrade promptledger

Run the dashboard

promptledger dashboard

Closing

v0.6 is a shift from storing prompt history to working with it.

The dashboard is not meant to replace the CLI, but to complement it by making prompt iteration easier to inspect, compare, and organize.

There is still a lot to improve, but this version establishes a clearer direction:

PromptLedger is becoming a tool for thinking about prompts, not just storing them.

Links

PyPI: PyPI
GitHub: GitHub
LinkedIn: LinkedIn
Website: Website

OpenAnima v0.2 Preview: Turning the Windows Desktop into a Living Canvas

Ertugrul — Fri, 01 May 2026 10:31:23 +0000

OpenAnima v0.2 Preview: Turning the Windows Desktop into a Living Canvas

I recently published OpenAnima v0.2 Preview, and this release is a big step for the project.

OpenAnima started as a small experiment: what if I could place animated GIFs directly on my Windows desktop as movable overlay objects?

That simple idea quickly became more interesting.

Instead of only supporting GIFs, OpenAnima is now moving toward a more general goal:

An open-source desktop asset overlay engine for Windows.

The idea is to make the desktop feel less static. OpenAnima lets you place animated assets, sprites, frame animations, HUD elements, and game-style visual assets directly on your desktop, then control how they behave.

Website: https://ertugrulmutlu.github.io/OpenAnima/
Itch.io: https://ertugrulmutlu.itch.io/openanima
GitHub: https://github.com/Ertugrulmutlu/OpenAnima
Release: https://github.com/Ertugrulmutlu/OpenAnima/releases/tag/v0.2.0-preview

What OpenAnima does

OpenAnima is a Windows desktop application that lets users add visual assets on top of their desktop.

These assets can be moved around, configured, and used as lightweight desktop overlays.

Some example use cases:

Desktop companions or animated mascots
Game-style HUD elements
Stream or recording overlays
Sprite and animation preview experiments
Ambient desktop widgets
Weird little visual experiments

The project is still early, but the direction is becoming clearer: OpenAnima is not just a GIF player. It is becoming a small visual layer above the desktop.

Why I built it

I like projects that feel small at first, but slowly reveal a larger design space.

At the beginning, OpenAnima was simply about putting GIFs on the desktop. That was already fun, but it also felt limited.

Once I started thinking about game assets, HUD elements, sprites, and layered UI assets, the project became more like a desktop rendering playground.

I wanted to explore questions like:

Can desktop overlays be treated like reusable visual assets?
Can a normal desktop become a small interactive canvas?
Can game-style assets live outside a game engine?
Can asset packs become something users can import, configure, and reuse?

That is where the v0.2 release comes in.

What changed in v0.2

The v0.2 preview expands OpenAnima from a GIF overlay tool into a more metadata-driven desktop asset engine.

The biggest changes are:

Generic asset analyzer and import wizard
asset.json metadata support
Support for multiple asset formats
Sprite strip setup workflow
Spritesheet rendering with named animations
Composite UI/HUD assets
Runtime sliders for layered UI assets
Improved Editor tab and Asset Setup dialog layouts
Safer metadata validation and error handling
Backward compatibility for existing GIF/static/frame-folder workflows

This release is still a preview, but the foundation is much stronger than the first version.

Supported asset types

OpenAnima v0.2 supports several asset types:

GIF
Static image
Frame-folder animation
Sprite strip
Spritesheet
Composite UI / HUD

The first version was mostly focused on simple animated overlays. v0.2 starts building the system needed for more structured assets.

Metadata-driven assets

One of the most important changes in v0.2 is asset.json support.

Instead of hardcoding every asset type, OpenAnima can now use metadata to understand how an asset should be loaded and rendered.

A simplified example could look like this:

{
  "name": "demo_hud",
  "type": "composite_ui",
  "layers": [
    {
      "name": "background",
      "file": "background.png",
      "x": 0,
      "y": 0
    },
    {
      "name": "bar",
      "file": "bar.png",
      "x": 12,
      "y": 20
    }
  ]
}

This makes the project more flexible. Instead of treating every asset as just an image or GIF, OpenAnima can start understanding assets as structured objects.

That opens the door for asset packs, richer UI overlays, and more advanced configuration later.

Sprite strips and spritesheets

Another focus of v0.2 was better support for game-style assets.

Sprite strips and spritesheets are common in game development, but they usually need some setup before they can be used correctly.

OpenAnima now includes workflows for:

Frame count
Frame size
Crop fields
Preview grid
Frame export
Named animations for spritesheets

This is still not perfect, especially for unusual sprite layouts, but it is a good step toward making OpenAnima useful for more than simple GIF overlays.

Composite UI and HUD assets

I also added support for composite UI/HUD assets.

The idea is simple: an overlay does not have to be one image. It can be made from multiple layers.

For example:

A health bar background
A fill layer
A frame layer
A text or icon layer
Runtime sliders controlling values

This makes it possible to experiment with game-like desktop HUDs.

The current Composite UI editor is functional, but it is not meant to be a professional layout tool yet. It is more of a foundation for future versions.

The control panel

OpenAnima has a small control panel with three main areas:

Library
Active
Editor

The Library tab is used to import and manage assets.

The Active tab is used to manage overlays currently placed on the desktop.

The Editor tab is used to adjust selected assets with controls like:

Scale
Opacity
Speed
Always on top
Click-through
Locked

The goal is to keep the interface practical and lightweight. I do not want OpenAnima to become a huge desktop suite. I want it to stay small, hackable, and focused.

Website and distribution

For this release, I also created a small website for the project:

https://ertugrulmutlu.github.io/OpenAnima/

The website explains the idea, shows the demo, links to the GitHub repository, and provides access to the Windows executable through GitHub Releases.

I also published the project on itch.io as a free tool:

https://ertugrulmutlu.itch.io/openanima

This was mainly to make the project feel more like a small product rather than only a repository.

Known limitations

This is still a preview release, so some limitations are expected.

Known issues include:

Sprite strips may require manual frame count or frame size correction.
Some sprite strips with unusual padding may need manual crop values.
Spritesheets require metadata or setup through the import wizard.
Composite UI assets may require manual layer alignment.
The Composite UI editor is functional but not a full professional layout tool.
3D model support is not included yet.
Some unusual asset packs may still need manual asset.json editing.

I prefer being clear about this because v0.2 is not a polished final product. It is a foundation release.

What I want to explore next

There are several directions I want to explore after v0.2:

Asset packs
Better first-run experience
More polished installer/distribution flow
Better preview and import tools
More reliable spritesheet workflows
Simple 3D overlay experiments
Linux experiments later
Better documentation for custom assets

The 3D direction is especially interesting, but I do not want to rush it before the 2D asset foundation is stable.

What I learned

This project reminded me that even small desktop tools can become surprisingly deep.

At first, the problem looked simple:

Put an animated object on the desktop.

But once I started supporting different asset types, the project became about importing, validating, describing, previewing, rendering, and controlling assets.

That means the real challenge is not only drawing something on the desktop. The real challenge is creating a flexible system around desktop assets.

OpenAnima v0.2 is my first serious step in that direction.

Links

Website: https://ertugrulmutlu.github.io/OpenAnima/
GitHub: https://github.com/Ertugrulmutlu/OpenAnima
Release: https://github.com/Ertugrulmutlu/OpenAnima/releases/tag/v0.2.0-preview
itch.io: https://ertugrulmutlu.itch.io/openanima

Feedback, bug reports, weird desktop overlay ideas, and asset workflow suggestions are very welcome.

Thanks for reading.

PromptLedger v0.4 — Faster prompt logging, lightweight markers, and better prompt organization

Ertugrul — Mon, 27 Apr 2026 15:19:18 +0000

Devlog — Part 4

PromptLedger v0.4 is mostly about making prompt versioning easier to use repeatedly: faster logging, clearer organization, and small release signals for versions worth remembering.

In the earlier parts of this series, PromptLedger started as a deliberately small local-first prompt version control tool. Then it gained labels, status semantics, better diffs, review workflows, semantic summaries, warnings, and Markdown export.

Those additions made the history more useful after prompts had already been logged.

Some of the direction for v0.4 also came from feedback and from watching where the workflow still felt a bit too manual. So before going into the details: thank you to everyone who tried the earlier versions, shared thoughts, pointed out rough edges, or simply asked practical questions about how the tool should behave during real prompt iteration.

v0.4 focuses more on the moment before review: the actual day-to-day act of adding, organizing, and revisiting prompt versions.

Because in practice, a prompt version control tool only works if logging a prompt is cheap enough that you actually keep doing it.

Why this part was needed

PromptLedger has always been intentionally limited in scope.

It is SQLite-backed. It is terminal-first. It does not need a hosted backend. It does not try to execute prompts for you. It does not try to become an evaluation platform.

The core job is still simple:

store prompt versions
compare them
review changes
organize prompt history
make the history inspectable later

But once I started using it more like a real prompt library, a small problem became obvious.

Adding a prompt version was technically easy, but repetitive.

During actual prompt iteration, the prompt text changes often, while the surrounding metadata usually stays the same. The same author. The same environment. The same tags. The same library grouping. The same role.

Having to retype that metadata every time creates friction.

And friction matters. If logging is annoying, the history becomes incomplete. If the history is incomplete, review becomes less useful. And if review becomes less useful, the tool stops doing its main job.

So v0.4 is not a big architectural release.

It is a workflow release.

It makes repeated prompt logging faster, adds better organization primitives, and introduces lightweight markers for important versions.

Quick add: less typing during real iteration

The main usability change in v0.4 is the new add --quick workflow.

A normal add can still be explicit:

promptledger add --id onboarding --text "..." --collection support --role system

That is useful when creating a new prompt or when metadata should be stated clearly.

But once a prompt already exists, most iterations do not need all metadata to be typed again. With --quick, PromptLedger can reuse safe metadata defaults from the latest version of the same prompt id.

promptledger add --id onboarding --text "..." --quick

If the latest onboarding version already had metadata like author, tags, env, collection, and role, those values can be reused unless explicitly overridden.

That means the common workflow becomes much smaller:

edit the prompt
add the new version
keep moving

This is not a flashy feature, but it changes the feel of the tool.

Prompt logging becomes closer to a habit than a chore.

That matters because PromptLedger is not useful because one perfect prompt was saved once. It is useful because a sequence of changes becomes reviewable later.

Quick add is there to protect that sequence.

Collection and role are now first-class metadata

v0.4 also adds two first-class metadata fields on prompt versions:

collection
role

The reason is simple: once a prompt library grows, ids and tags are not always enough.

A collection gives prompts a lightweight grouping. It can represent a product area, a project, a use case, a customer support flow, an internal tool, or just a personal folder-like grouping.

For example:

promptledger add --id onboarding --text "..." --collection support --role system

This makes it easier to ask questions like:

Which prompts belong to the support collection?
Which prompts are part of an onboarding workflow?
Which versions were written for a specific environment or use case?

The second field, role, is about what kind of prompt artifact this version represents.

Built-in roles are:

system
user
template
modelfile
eval

This matters because prompt libraries usually contain different kinds of artifacts.

A system instruction is not the same thing as a reusable template. An evaluation prompt is not the same thing as a user-facing message. A model file prompt is not the same thing as an onboarding assistant instruction.

Before v0.4, these differences could be represented with tags, but tags are free-form and tend to become messy over time.

Making role first-class gives PromptLedger a small amount of structure without turning it into a large framework.

That is the balance I wanted here: enough organization to be useful, but not so much that the tool starts dictating how every prompt library must be designed.

Markers: small signals for important versions

v0.4 introduces a new marker system for prompt versions.

The core commands are:

promptledger marker set --id onboarding --version 8 --name stable

promptledger marker remove --id onboarding --version 8 --name stable

promptledger marker list --id onboarding

promptledger marker show --id onboarding --version 8

There are also convenience commands for the built-in markers:

promptledger stable --id onboarding

promptledger milestone --id onboarding

The built-in markers are:

stable
milestone

Markers are intentionally lighter than labels.

That distinction is important.

Labels in PromptLedger are useful when you want release-like semantics or named pointers. They can represent states such as a current production prompt, a reviewed version, or a version used in a specific workflow.

Markers are smaller than that.

A marker says: this version is worth noticing.

Maybe it was the first version that worked well. Maybe it was a milestone during a rewrite. Maybe it was stable enough to revisit later. Maybe it is just a checkpoint that should not get lost in the version list.

Not every important prompt version needs full label semantics.

Sometimes you only need a small flag attached directly to a version.

That is what markers are for.

They are deliberately simple. They do not try to encode deployment state, environment ownership, evaluation results, or production routing. They are just lightweight release signals inside the local history.

Search, list, and show became more useful

Organization only helps if the tools expose it.

So v0.4 updates list, show, and search to surface and use collection, role, and marker information.

For example, searching by metadata becomes more natural:

promptledger search --collection support --role system

Search can also work with an empty --contains, which means metadata-only filtering is now possible.

That sounds small, but it changes how PromptLedger can be used.

Before, search was mostly about finding prompt text. Now it can also be used to navigate the prompt library itself.

For example:

show all system prompts in a collection
find evaluation prompts across a project
inspect versions marked as stable
separate templates from user prompts
review one slice of the library without relying on text search

This moves PromptLedger a little further from “prompt version storage” toward “prompt library workflow”.

Still local. Still small. Still inspectable.

But more practical once the number of prompt artifacts grows.

The Streamlit UI is still read-only

PromptLedger also has a small Streamlit UI for inspecting prompt history.

In v0.4, the UI now surfaces collection, role, and markers in timeline, detail, and comparison views.

That makes the UI more useful when browsing a prompt library. You can see not only how a prompt changed, but also what kind of prompt it is, which collection it belongs to, and whether a version was marked as stable or as a milestone.

The important part: the UI is still read-only.

That is intentional.

For now, PromptLedger keeps editing and logging in the CLI, while the UI remains focused on inspection. This keeps the implementation smaller and avoids turning the viewer into a second source of write behavior.

The terminal is where versions are created.

The UI is where history can be reviewed more comfortably.

That split still feels right for the project.

Database changes stayed local and simple

v0.4 moves the schema version from 3 to 5.

The main database changes are straightforward:

a new markers table
new collection and role columns on prompt_versions
migrations for the local SQLite database

There is no remote migration service. No hosted registry. No account state. No backend coordination.

The migration story stayed aligned with the rest of the project: local, inspectable, and boring in a good way.

That matters because PromptLedger should remain easy to understand. A small SQLite-backed tool should not require infrastructure thinking just to store prompt history.

Tests expanded around the new workflows

v0.4 also expands the test coverage substantially around the areas that changed most:

add workflows
quick add behavior
marker commands
search and metadata filtering
related list/show behavior

This is not the kind of testing expansion that makes for a dramatic release note, but it is important for this project.

PromptLedger deals with history. If history is stored incorrectly, the tool loses trust quickly.

So the goal is practical correctness:

metadata should be reused only when expected
explicit overrides should still win
markers should attach to the right versions
search should return the intended slice of the prompt library
existing workflows should not become harder to use

For a local-first developer tool, confidence matters more than feature count.

Design tradeoffs

The main tradeoff in v0.4 was deciding how much structure to add.

PromptLedger could have gone further.

It could have introduced nested collections, custom role registries, marker categories, richer release channels, or project configuration files.

I avoided that for now.

The project works best when the concepts stay small:

ids identify prompt histories
versions preserve changes over time
labels provide stronger named semantics
markers provide lightweight version signals
collection groups prompt versions
role explains what kind of prompt artifact a version is

That is enough structure to make a growing prompt library easier to navigate, without making the tool feel like a platform.

This is also why markers are not labels with another name.

Labels are useful when you want something closer to a maintained pointer or status. Markers are useful when you want to annotate a version as notable.

Both can exist because they answer different workflow questions.

A label asks:

What does this version represent in a larger workflow?

A marker asks:

Is this version worth noticing later?

That difference is small, but it keeps the model clean.

What did not change

v0.4 does not change the basic philosophy of PromptLedger.

It is still a small, local-first prompt version control tool.

It still does not add:

hosted registry
prompt execution APIs
agent tooling
telemetry pipelines
cloud sync
automatic scoring
evaluation harnesses

Those are not bad ideas in general. They are just outside the current scope of this project.

PromptLedger is not trying to run prompts, score prompts, deploy prompts, or orchestrate agents.

It is trying to make prompt changes easier to store, compare, review, and organize.

That boundary is useful.

Small scope is not a missing feature here. It is part of the design.

Closing thoughts

PromptLedger v0.4 is a usability and organization release.

It does not radically change what the tool is. Instead, it makes the existing workflow smoother.

Quick add reduces friction during iteration. Collection and role make prompt libraries easier to navigate. Markers create a lightweight way to remember important versions. Search, list, show, and the read-only UI now expose more of that structure.

The result is still intentionally modest.

A local SQLite database. A CLI. A read-only inspection UI. Deterministic exports and reviewable history where possible.

But the day-to-day workflow feels better now.

And for a tool like this, that matters.

Because prompt version control only becomes useful when it is easy enough to use consistently.

Links

PyPI: PyPI
GitHub: GitHub
LinkedIn: LinkedIn
Website: Website

PromptLedger v0.3 — Turning prompt history into a practical review workflow.

Ertugrul — Sat, 28 Mar 2026 13:37:59 +0000

Devlog — Part 3

Turning prompt history into a practical review workflow.

In Part 1, I introduced PromptLedger as a deliberately small, local-first tool for treating prompts like code.

In Part 2, I added release semantics: labels, label history, and status views that made it easier to answer questions like what is in production right now?

With v0.3, the next question became harder:

Even if I can diff two prompt versions, can I review them in a way that feels closer to a real release workflow?

That is the focus of this release.

PromptLedger v0.3 adds a small but practical Prompt Review layer on top of the existing history model — while still staying local-first, SQLite-backed, and intentionally limited in scope.

Why a third part?

After the release semantics work in v0.2, the project could already answer questions like:

Which prompt does prod currently point to?
When was that label changed?
How does prod differ from staging?

But another gap became obvious.

A raw diff is useful, but in practice people often want a slightly higher-level review:

Did the prompt become stricter?
Did the tone change?
Was the output format changed from bullets to JSON?
Did safety or refusal wording get stronger or weaker?
Is this a release change or a likely regression risk?

Those are not execution questions. They are not observability questions either.

They are review questions.

So instead of adding prompt execution, external APIs, or any hosted layer, I kept the project focused and added a review workflow built entirely on top of the existing local data.

The main addition: `review`

The new command is:

promptledger review --id onboarding --from prod --to staging

This compares two refs — versions or labels — and produces a structured review output that includes:

resolved refs and versions
a semantic summary
metadata changes
label context
warning flags
a few conservative notes

This is deliberately not an evaluation system. It does not score prompts. It does not call a model. It does not guess too much.

It simply makes a prompt diff easier to interpret.

From line diff to semantic summary

Traditional diffs are still useful, and PromptLedger keeps all previous diff modes.

But v0.3 adds a new summary-oriented mode:

promptledger diff --id onboarding --from 7 --to 9 --mode summary

This produces a heuristic, rule-based semantic summary instead of a raw line diff.

The important design decision here is that the summary is:

local
deterministic
transparent
intentionally conservative

In other words: it only says something when the change looks clear enough.

Current summary categories include:

tone changes
tighter or looser constraints
output format changes
broader vs more specific prompts
safety wording changes
length requirement changes
refusal or policy wording changes

This is not meant to replace reading the actual prompt.
It is meant to make review faster and more structured.

Why heuristics instead of an LLM?

Because using an external model for review would push the project in exactly the wrong direction.

It would introduce:

network dependence
nondeterministic behavior
more configuration
harder testing
less trust in the output

PromptLedger is supposed to be inspectable.
If it says “constraints tightened”, that should come from understandable rules, not hidden inference.

That made a heuristic system the better fit.

It is not as flexible as an LLM-based reviewer, but it is much easier to reason about — and much more aligned with the philosophy of the project.

Reviews now export cleanly to markdown

Another practical gap in earlier versions was sharing review output.

Reading a diff in the terminal is fine.
Sharing it in a PR, issue, or internal document is another matter.

So v0.3 adds markdown export for reviews:

promptledger export review --id onboarding --from prod --to staging --format md --out review.md

The exported markdown is deterministic and structured.
It includes:

a title
compared refs
semantic summary
text diff note
metadata changes
warnings
label information
a reviewer notes placeholder

That makes PromptLedger more useful in real workflows without adding any collaboration backend.

The file is still just a file.
You can paste it into GitHub, attach it to docs, or keep it locally.

Metadata changes are now first-class in reviews

Prompt text is only part of the story.

A release change may also involve metadata updates:

reason
author
tags
env
metrics

Earlier versions could already diff metadata, but v0.3 makes metadata changes part of the review object itself.

That matters because some changes are metadata-only.
In those cases, PromptLedger can now say that clearly instead of pretending there was meaningful prompt drift.

This is a small feature, but an important one.
It avoids overclaiming, which is one of the easiest ways to make a review tool feel unreliable.

Warning flags and likely drift hotspots

Prompt review is not just about summarizing what changed.
It is also about drawing attention to changes that deserve extra care.

v0.3 adds simple warning flags for cases such as:

comparing the same version to itself
environment changes
metadata-only changes
policy or refusal wording changes that may affect behavior drift

These warnings are not meant to be dramatic.
They are meant to make the review output more useful in practice.

For example, a wording change around refusal or safety does not automatically mean the prompt got worse — but it probably means a reviewer should read it more carefully.

The Python API now returns structured review objects

The review workflow is not just a CLI feature.

The Python API now exposes review results as structured domain objects rather than just formatted strings.

That means callers can programmatically access:

resolved refs
semantic summary items
metadata changes
warnings
notes
label context

This keeps the CLI and the API aligned while also making formatting a separate concern.

That separation turned out to be one of the cleaner changes in this version:

review logic lives in one place
rendering logic lives elsewhere
markdown export and terminal rendering are both built on the same review result

Small project, but still worth keeping modular.

UI update: review without write access

The Streamlit UI is still read-only.
That did not change.

What changed is that the comparison view now surfaces review information more clearly:

semantic summary
warnings
metadata diff
side-by-side prompt comparison
line diff

This keeps the UI aligned with the CLI review flow without turning it into an editor.

That constraint still matters.
The UI is there to inspect history, not to mutate it.

What did not change

Just as important as the new features is what was left out.

v0.3 does not add:

a hosted registry
prompt execution APIs
agent tooling
telemetry pipelines
tracing dashboards
cloud sync
automatic scoring
evaluation harnesses

There are already plenty of tools going in those directions.

PromptLedger is still trying to do one narrower thing well:
store, compare, review, and export prompt changes locally.

No schema expansion was needed

One part of this release that I particularly liked: the review workflow did not require turning the database into something more complicated.

SQLite remains the single source of truth.
The review layer is generated from existing prompt versions, labels, and metadata.

That kept the implementation smaller and the migration story simpler.

Not every useful feature needs a bigger schema.
Sometimes the better move is to extract more value from the structure that is already there.

Closing

v0.3 did not try to make PromptLedger smarter in a flashy way.
It tried to make it more reviewable.

The result is still a local tool.
Still inspectable.
Still deterministic where possible.
Still intentionally limited.

But now it is easier to answer a more realistic question:

Not just “what changed?” — but “how should I review this change before I move it forward?”

That is a better place for the project to be.

Links

PyPI: PyPI
GitHub: GitHub
LinkedIn: LinkedIn
Website: Website

DataLens: A Read-Only Image Dataset Sanity Checker

Ertugrul — Tue, 20 Jan 2026 13:43:50 +0000

DataLens: A Read‑Only Image Dataset Sanity Checker

Training a model rarely fails loudly.

Most of the time, it kind of works — loss decreases, accuracy moves, but the results feel unstable, brittle, or just wrong.

In my experience, when that happens, the root cause is often not the model, but the dataset.

So I built DataLens: a lightweight, read‑only sanity checker for image datasets.

The Problem: Silent Dataset Failures

Before training even starts, datasets often contain issues like:

Corrupted or unreadable images
Duplicate or near‑duplicate samples
Broken CSV → image mappings
Large numbers of orphan images
Severe class imbalance
Extremely small images or extreme aspect ratios
Mixed image modes (RGB / RGBA / grayscale)

None of these necessarily crash training.
They just quietly degrade everything downstream.

Why Read‑Only Matters

Many tools try to auto‑fix datasets.

I deliberately didn’t.

DataLens follows a simple rule:

Inspect, don’t mutate.

No files are moved
No labels are rewritten
No assumptions are silently applied

The tool’s job is to surface problems clearly, so you can decide what to do next.

What DataLens Does

DataLens is a Streamlit‑based audit tool with two modes.

Mode A — Images Only

For raw image folders:

Recursively scans images
Detects corrupted files
Finds exact and near‑duplicate images
Optionally infers classes from subfolders
Correctly handles unlabeled datasets (no fake “missing label” warnings)

Mode B — Images + Labels (CSV)

For supervised datasets:

Robust CSV reading (UTF‑8 with fallback)
Automatic filename & label column detection
Support for IDs without file extensions
Optional label normalization
Coverage analysis:
- How many CSV rows actually resolve to images?
Orphan analysis:
- How many images are never referenced by the CSV?

Duplicate Detection That Actually Helps

Exact duplicates are easy.
Near‑duplicates are not.

DataLens supports three methods:

sha256 — byte‑exact duplicates
quick hash — fast approximation
pHash — visually similar images (resized, recompressed, slightly cropped)

This is especially useful for datasets collected via scraping or merging multiple sources.

Data Hygiene Warnings

Beyond basic checks, DataLens flags issues that usually show up too late:

Very small images (e.g. <64px)
Extreme aspect ratios
High RGBA share (alpha channel surprises)
High image mode variance
Extension mismatches between CSV references and actual files

These are the kinds of things that quietly break training pipelines or augmentations.

Outputs You Can Share

After a run, DataLens produces:

An interactive dashboard (Streamlit)
A deterministic dataset_report.md
An issues.csv containing:
- missing images
- orphan images
- corrupted files
- duplicate groups

The report is designed to be:

commit‑friendly
reviewable
attachable to issues or PRs

Design Principles

I kept the scope intentionally tight:

Read‑only
Deterministic
Transparent
Pre‑training focused

This is not a dataset cleaning tool.
It’s a dataset inspection tool.

When I Use DataLens

Before starting any new training run
When receiving datasets from external sources
When debugging unstable or suspicious training behavior
As a lightweight QA step before investing GPU hours

Final Thought

We spend a lot of time tuning models.

But models are only as good as the data we feed them — and bad data usually doesn’t announce itself.

DataLens is my way of making datasets talk.

PromptLedger v0.2 — Labels, Status, and Better Diffs

Ertugrul — Tue, 13 Jan 2026 14:44:54 +0000

Devlog — Part 2

What changed after the first release, and why those changes matter in practice.

In the first part, I introduced PromptLedger as a deliberately small, local-first tool for treating prompts like code.

Since then, the project has moved from a minimal prompt history tracker to something closer to a prompt release ledger — without adding servers, agents, or execution layers.

This post covers what changed in v0.2, why those changes exist, and how they affect real prompt workflows.

Why a second part?

After publishing Part 1, the most common follow-up questions were:

Which prompt is actually in production right now?
How do I compare prod vs staging without remembering version numbers?
Can I see when a release pointer changed?

Answering these questions required release semantics, not more prompt editing features.

That is the focus of v0.2.

Label history: prompts need an audit trail

In v0.1, labels existed as movable pointers, similar to git tags. However, once a label moved, its previous state was lost.

In v0.2, every label update is recorded in an append-only history log.

This means you can now answer questions like:

What prompt was in production yesterday?
When did prod move from version 7 to version 9?

promptledger label history --id onboarding

Under the hood, label history is implemented as a separate label_events table. Label pointers remain mutable, but the history is immutable.

This keeps the system simple while adding real auditability.

Label-based diff: stop thinking in version numbers

Version numbers are great for storage, but humans think in environments.

v0.2 allows diffs to be expressed in terms of labels:

promptledger diff --id onboarding --from prod --to staging

This resolves labels to their underlying versions and performs a normal diff.

The important detail is that nothing new is stored. Labels are only references; diffs always operate on immutable prompt versions.

Status command: a one-line overview

Another recurring problem was simply understanding the current state of prompts.

The new status command provides a concise, git-style overview:

promptledger status

For each prompt, it shows:

The latest version number
The timestamp of that version
Active labels and the versions they point to

This is intentionally read-only and summary-focused. If you need details, you still drill down using list, show, or diff.

Better diff modes

Prompt changes are not always best reviewed line-by-line.

v0.2 introduces multiple diff modes built on top of Python’s difflib:

unified — the default, git-style view
context — useful for wider structural changes
ndiff — character-level insight for small edits
metadata — diff only metadata (reason, tags, env, metrics)

promptledger diff --id onboarding --from 1 --to 2 --mode metadata

Metadata diffs are always rendered in unified format to keep them readable and deterministic.

UI update: history without write access

The Streamlit UI remains strictly read-only, but v0.2 adds label history visibility.

You can now:

Inspect prompt timelines
See which labels point where
Review label movements over time

All mutations still go through the CLI or Python API.

This constraint is intentional: the UI is for inspection, not experimentation.

What did not change

Several things were deliberately left out:

No prompt execution or playground
No agent framework
No cloud sync or backend service
No automatic evaluation or scoring

PromptLedger is still a ledger, not an environment.

Migration and compatibility

v0.2 introduces a database schema migration, but it is fully backwards compatible.

Existing databases are upgraded in place, with no data loss.

The new tables add information, but do not change existing semantics.

Closing

v0.2 did not make PromptLedger bigger — it made it clearer.

By adding release semantics, auditability, and better inspection tools, prompts can now be reviewed and promoted with the same discipline as code.

No servers. No dashboards. No magic.

Just history, diffs, and intent — stored locally.

Links

Pypi : Pypi
Github : Github
My Linkedin : Linkedin
My Website : Website

Hysteresis in Neural Networks — Part 1

Ertugrul — Fri, 09 Jan 2026 14:09:14 +0000

Training Order Is Not Innocent

What if seeing the same data is not enough?

A common, almost implicit assumption in machine learning is that if a model is trained on the same dataset with the same architecture and optimizer, it should end up learning essentially the same thing. Training order is usually treated as an implementation detail — a convenience, not a defining factor.

In this post, I show a simple but striking counterexample:

Even when two neural networks see exactly the same data, the order in which the data is presented can determine what the model permanently remembers and what it completely forgets.

This is the first part of a short series on hysteresis in neural networks. Here, I focus only on observable behavior (accuracy and forgetting). In the next part, we will look inside the model and explain why this happens geometrically.

The Core Question

Assume we have two datasets, A and B.

We train the same model in two different ways:

SAB: first on A, then on B
SBA: first on B, then on A

Crucially:

The architecture is identical
The optimizer and hyperparameters are identical
The random seed is fixed
The union of the data is the same: A ∪ B

The only difference is chronological order.

The standard intuition is that this should not matter.

This intuition is wrong.

Experimental Setup (Intentionally Simple)

To avoid hiding effects behind complexity, I used a deliberately minimal setup.

Dataset: MNIST
Split:
- A = digits {0,1,2,3,4}
- B = digits {5,6,7,8,9}
Model: small CNN + MLP head
Training: 20 epochs total (10 per phase)

Several things were explicitly disabled:

No Batch Normalization
No data augmentation
Deterministic initialization (fixed seed)

This ensures that any difference we observe is not an artifact of randomness or regularization, but a consequence of training order alone.

What Happens During Training?

Let’s start with the SAB scenario: the model learns A first, then B.

SAB: A → B

Observed behavior:

During the first phase, accuracy on A quickly rises to ~99%
Accuracy on B remains near 0% (expected)
After switching to B:
- Accuracy on B rises to ~99%
- Accuracy on A collapses to 0%

SBA: B → A

The mirror experiment produces the mirror result:

Observed behavior:

During the first phase, accuracy on B rises to ~99%
Accuracy on A remains near 0%
After switching to A:
- Accuracy on A rises to ~99%
- Accuracy on B collapses to 0%

Accuracy Curves

SAB accuracy over epochs

Description: Line plot showing acc_A, acc_B, and acc_full across epochs. A vertical dashed line marks the phase transition (A → B). Accuracy on A drops sharply after the transition.

SBA accuracy over epochs

Description: Same plot as above, but for SBA. Accuracy on B drops sharply after switching to A.

A Subtle but Important Observation

Despite the dramatic forgetting, overall accuracy on the full test set remains around ~50% in both cases.

This is not a contradiction.

Each model performs extremely well on half of the classes and completely fails on the other half. Aggregated metrics hide this asymmetry.

Two models can have similar overall accuracy while representing fundamentally different worlds internally.

Quantifying the Effect: Hysteresis Loss

We can define a simple, order-dependent metric:

[\mathcal{L}_{\text{hyst}}(A) = |\text{Acc}_A(SAB) - \text{Acc}_A(SBA)|]

In this experiment:

Acc_A(SAB) ≈ 0.00
Acc_A(SBA) ≈ 0.99

This yields a hysteresis loss close to 1.0, i.e. the maximum possible difference.

The same holds symmetrically for dataset B.

Hysteresis summary

Description: Bar chart showing absolute accuracy differences between SAB and SBA.
Hysteresis for the individual subsets A and B is near-maximal, while hysteresis in the aggregate (full test accuracy) is close to zero and visually compressed due to the shared scale.

This highlights a key point: global performance metrics can remain almost invariant, even when class-conditional behavior is maximally path-dependent.

What This Means

This result shows that training order is not just an optimization detail.

The network does not converge to a single, order-independent solution
Learning leaves path-dependent traces in weight space
Once the model commits to one subset, returning to a balanced representation is not trivial

This behavior is strongly reminiscent of hysteresis in physical systems, where the final state depends on the path taken, not only on the endpoint.

What This Post Does Not Explain (Yet)

This post only shows that hysteresis exists.

It does not explain:

Whether the two final models lie in different basins of attraction
Whether their internal representations are aligned or incompatible
Whether one can smoothly interpolate between them without loss spikes

These questions require looking at the geometry of the weight space, not just accuracy curves.

That is exactly what Part 2 will address.

Reproducibility

All experiments were run with a fixed configuration and deterministic setup. The code used for this post (FAZ 1) is available on GitHub:

Hysteresis in Neural Networks

The geometric analysis (weight trajectories, representation similarity, interpolation barriers) will be released together with the next post.

Closing Thoughts

If training order alone can erase entire subsets of knowledge, then:

Curriculum learning has hidden costs
"Same data" does not imply "same model"
Optimization is not just minimization — it is a history-dependent process

In the next post, we will open the model and examine how these irreversible choices are encoded in the geometry of neural manifolds.

Part 2: Inside the Weight Space — Geometry of Hysteresis

Links

Linkedin: Linkedin
Github: Github
Website: Website

Deterministic Decision Making in Non-Deterministic Environments: Why all my projects fight the same problem

Ertugrul — Sun, 04 Jan 2026 02:34:46 +0000

The real world is not deterministic.

Data is noisy.
Sensors drift.
Timing slips.
Humans are inconsistent.
Systems change quietly.

And yet, we expect systems to do one thing reliably:

make decisions.

Ideally, those decisions should be:

repeatable,
explainable,
auditable.

This is the question that connects all of my work:

How can decision-making systems remain deterministic in a non-deterministic world?

This is not a project list.
It is a description of the problem I keep returning to.

The Model Fallacy

For a long time, I followed the standard recipe:

more data
larger models
better metrics

Real systems challenged that belief.

The model can be correct.
The system can still fail.

The root cause is often not the loss function,
not the optimizer,
not the architecture.

It is the system design.

Hidden state.
Implicit dependencies.
Untracked changes.
Unmeasured timing effects.

At that point, one thing became clear:

Decision-making is not only an ML problem.
It is a systems problem.

One Question, Different Layers

That realization forced me to stop thinking in terms of single projects.

Instead, I began exploring the same question across different layers of the stack —
not to accumulate work,
but to expose failure modes.

Same Question, Different Layers

Layer	Example Projects	What It Revealed
Prompt / Decision Logic	PromptLedger	If prompts are not versioned like code, decision logic becomes unreliable.
System Design	Bad Decision System	Determinism does not come from the model, but from design discipline.
Perception → Action	JumpNet (real-time latency failures)	A correct decision made at the wrong time is still wrong.
Hardware / Edge	Pico Trend Alarm, Sound Classifier, Mini SCADA	Noisy sensors and tight resources force explicit FSMs and hysteresis.
Data Control	Custom data collectors, viewers, logging tools	Data you don’t control will control system behavior.

Decision Logic: The Prompt Layer

While building agentic workflows, I noticed a recurring issue:

Prompts behave like executable logic,
but are rarely treated as such.

Which prompt is in production?
Why did it change?
What behavior shifted as a result?

PromptLedger emerged from this gap.

Not as a prompt playground,
but as an attempt to make decision logic inspectable, versioned, and deterministic.

If the logic that guides decisions is unstable,
the system itself cannot be trusted.

System Design: When Correct Models Fail

To isolate system-level failure, I intentionally built a flawed decision pipeline.

Same inputs.
Same model.
Different outputs across identical runs.

The cause was not randomness in the model,
but global state, side effects, and hidden coupling.

That experiment reinforced a core principle:

Determinism is a property of design,
not of the model.

Perception → Decision → Action: Timing Matters

In JumpNet, offline metrics looked nearly perfect.

In real-time execution, the system failed.

A 50 ms delay was enough to alter behavior.
Minor prediction errors cascaded into visible mistakes.

The lesson was unavoidable:

A correct decision made at the wrong time
is still a wrong decision.

Determinism is about when as much as what.

Hardware: Constraints Shape Decisions

Deploying models on embedded hardware exposes assumptions that rarely surface in simulation.

Limited memory.
Noisy sensors.
Strict latency budgets.
Restricted numerical precision.

Models simplify.
Control logic becomes explicit.
FSMs and hysteresis replace soft heuristics.

Here, determinism is not optional.

It is required for the system to function at all.

Data: Control the Inputs or Lose the System

When a system behaves unexpectedly, I start by inspecting the data.

That led me to build custom:

data collection tools
logging pipelines
dataset viewers
synchronization and inspection utilities

Because one rule keeps repeating itself:

Uncontrolled data produces non-deterministic behavior.

This Is Not a CV

This text is not a list of skills or frameworks.

It is a stance.

I am not optimizing for larger models.
I am not chasing more impressive demos.

I am interested in one problem:

Reliable decision-making under real-world uncertainty.

Conclusion

Deterministic Decision Making in Non-Deterministic Environments

This is not a slogan.
It is a filter.

It determines:

which problems I choose to work on,
which ones I deliberately avoid,
and where I draw the line.

That is why the work spans multiple domains.

They all wrestle with the same question.

If you care more about building reliable systems than bigger models,
and about understanding failure modes rather than showcasing demos,
you may find the following useful:

GitHub: https://github.com/ertugrulmutlu
dev.to series: https://dev.to/ertugrulmutlu
Try PromptLedger: pip install promptledger

PromptLedger: Local-first prompt version control

Ertugrul — Sat, 03 Jan 2026 21:18:11 +0000

Treat prompts like code: version them locally, diff them, label releases, and inspect history — without any backend services.

Prompt engineering has quietly become production work. Prompts evolve, regress, get tuned for edge cases, and eventually land in “prod”. Yet most teams still track them in scratch files, notebooks, or chat logs.

PromptLedger is a deliberately small tool that fixes this by treating prompts like code: every change is versioned, diffable, and labeled — all stored locally in a single SQLite database.

This post is a technical deep dive into how PromptLedger works, what data it stores, and why its design is intentionally boring.

Design goals

PromptLedger is built around a few strict constraints:

Local-first: no server, no SaaS, no telemetry
Single source of truth: one SQLite file
Git-aware: works naturally inside repositories
Read-only UI: all writes go through the CLI/API
Deterministic output: exports are stable and diffable

If you are looking for an agent framework or a prompt playground, this is not it.

tecture overview

The codebase is intentionally small and split by responsibility:

cli.py – argument parsing and user-facing commands
core.py – domain logic (PromptLedger class)
db.py – SQLite connection, schema, migrations
ui.py – Streamlit-based read-only viewer

There are no background services and no long-running processes. PromptLedger runs when you invoke it and exits.

Storage and path resolution

PromptLedger always stores data locally in promptledger.db.

Resolution rules are deterministic:

Inside a git repo: <repo_root>/.promptledger/promptledger.db
Outside git: <cwd>/.promptledger/promptledger.db
Override with PROMPTLEDGER_HOME=/custom/path
Hard override via PromptLedger(db_path="/abs/path/to.db")

This avoids accidental duplication when running commands from nested directories and keeps prompt data out of git history by default.

SQLite schema

Two tables make up the core of PromptLedger:

prompt_versions: immutable prompt history
labels: mutable pointers to specific versions

Each prompt version stores:

prompt_id
version
content (the prompt text)
content_hash (SHA-256)
created_at (UTC ISO timestamp)
optional metadata: reason, author, tags, env, metrics

Labels store:

prompt_id
label (e.g. prod, staging)
version
updated_at

This separation lets you move release pointers without creating new versions.

Versioning algorithm

When you add a prompt, PromptLedger:

Normalizes newlines (CRLF/CR → LF)
Hashes the normalized content
Fetches the latest version for that prompt
Skips insertion if the hash matches (no-op)
Otherwise inserts a new version with incremented number

This keeps history clean and avoids formatting-only noise.

Newline normalization

Cross-platform newline differences are a common source of useless diffs.

PromptLedger normalizes line endings before hashing and diffing, which means:

Windows CRLF and Unix LF content are treated as identical
Diff output focuses on real textual changes

Metadata model

PromptLedger supports lightweight metadata for each version:

reason – why the prompt changed
author – who made the change
tags – arbitrary labels for grouping
env – dev, staging, prod
metrics – JSON blob (accuracy, latency, cost, ratings)

This turns raw text history into something closer to an audit trail.

Labels: release-style pointers

Labels are the feature that pushes PromptLedger beyond simple history tracking.

Think of labels like git tags that move:

promptledger label set --id onboarding --version 7 --name prod
promptledger label set --id onboarding --version 9 --name staging

Now you can answer questions like:

What prompt is currently in production?
Which version was deployed last week?

Without copying or duplicating prompt content.

CLI workflow

Core commands:

init – create DB and .gitignore entry
add – add or skip a version
list – list versions
show – show content + metadata
diff – unified diff between versions
search – content + metadata search
export – deterministic JSONL / CSV
label – manage release pointers
ui – launch Streamlit viewer

A typical flow:

promptledger init
promptledger add --id demo --text "Hello"
promptledger add --id demo --text "Hello World"
promptledger diff --id demo --from 1 --to 2
promptledger label set --id demo --version 2 --name prod

Streamlit UI (read-only)

The UI is intentionally non-destructive:

Timeline view of versions
Filters by prompt id, tags, env
Full content preview
Unified diff and side-by-side comparison

All writes remain in the CLI/API path.

Export and determinism

Exports are designed for reproducibility:

JSONL uses sorted keys
CSV has a fixed column order
Repeated exports of the same data are byte-for-byte identical

This makes PromptLedger suitable for audits, reviews, and downstream tooling.

Security notes

PromptLedger never sends data anywhere.

It includes a minimal warning for common secret patterns (sk-, AKIA, -----BEGIN). This is advisory only and can be disabled. The responsibility remains with the user to avoid storing secrets in prompt text.

What PromptLedger is not

Not an LLM framework
Not an agent system
Not a hosted service
Not a playground

It is a local ledger for prompt evolution.

Workflow for those interested

Closing

If your prompts matter enough to review, promote, and roll back, they matter enough to version properly.

PromptLedger keeps that history local, inspectable, and boring — which is exactly the point.

Contact and Links

Pypi : Pypi
Github : Github
My Linkedin : Linkedin
My Website : Website

I Intentionally Built a Bad Decision System (So You Don’t Have To)

Ertugrul — Fri, 19 Dec 2025 11:00:00 +0000

A tiny benchmark that exposes silent failure modes in AI and ML pipelines

Most AI blog posts show best practices: clean architectures, neat abstractions, and impressive demos. I decided to do the opposite.

I intentionally built a bad AI system — one that works, produces outputs, and even looks reasonable at first glance — and then compared it to a boring, well-designed version of the same pipeline.

The goal was not performance. The goal was to understand how systems fail silently when design principles are ignored.

The task: same problem, two implementations

Both systems solve the exact same problem:

Input text → extract keywords → compute a score → recommend an action

The action space is deliberately small:

WAIT_AND_SEE
BUY_MORE_STOCK
PANIC_REORDER

Keeping the task simple allows us to focus entirely on system behavior, not model quality.

The benchmark idea

The benchmark is intentionally minimal:

Take a single, fixed input text
Run it multiple times through the system
Observe whether the outputs stay stable

Why this matters:

A system that only works once is not a system — it’s a coincidence.

If the same input produces different outputs, something is fundamentally wrong at the system level.

Benchmark results: BAD vs GOOD

The following results were produced by running the same input five times through both systems.

BAD system output (excerpt)

The BAD system gradually escalates its decisions:

Run 1 → score 14, action WAIT_AND_SEE
Run 3 → score 42, action BUY_MORE_STOCK
Run 5 → score 74, action PANIC_REORDER

Same input. Same keywords. Completely different decisions.

Aggregated benchmark summary

BAD system

Runs: 5
Unique scores: 5
Scores: [14, 28, 42, 58, 74]
Unique actions: 3

GOOD system

Runs: 5
Unique scores: 1
Scores: [14, 14, 14, 14, 14]
Unique actions: 1

The GOOD system behaves like a function. The BAD system behaves like a memory leak.

Failure Taxonomy: How the BAD System Breaks

The bad system does not fail in a single obvious way. Instead, it exhibits multiple interacting failure modes that are common in real-world AI and data systems. Naming these failure modes makes them easier to detect—and harder to accidentally ship.

1) Drift

Definition: The system’s output changes over time even when the input stays exactly the same.

Root cause:

Global score accumulation across runs
State that grows monotonically without reset

Why this is dangerous:

Business logic mutates without any explicit change
Historical execution order influences current decisions
Monitoring dashboards often miss the problem because values remain “reasonable”

Drift is especially dangerous because it looks like learning—but it isn’t.

2) Non-determinism

Definition: Identical inputs produce different outputs.

Root cause:

Random noise injected into scoring
Implicit dependency on execution history

Why this is dangerous:

Bugs cannot be reliably reproduced
Test failures become flaky and untrustworthy
A/B experiments lose statistical meaning

If you can’t reproduce a decision, you can’t debug it.

3) Hidden State

Definition: Functions rely on data that is not visible in their interface or inputs.

Root cause:

Global variables such as CURRENT_SCORE, LAST_TEXT, and RUN_COUNT

Why this is dangerous:

Code cannot be understood locally
Refactoring changes behavior in non-obvious ways
New contributors unknowingly introduce regressions

Hidden state turns every function call into a guessing game.

4) Silent Corruption

Definition: The system continues to run without errors while its decisions become increasingly wrong.

Root cause:

No explicit failure signals
No invariants or sanity checks

Why this is dangerous:

Incorrect outputs propagate downstream
Problems surface only through business impact
Rollbacks become difficult or impossible

Loud failures get fixed. Silent failures get deployed.

Why This Taxonomy Matters

These failure modes rarely appear in isolation. In the BAD system, they reinforce each other:

Hidden state enables drift
Drift amplifies non-determinism
Non-determinism hides silent corruption

Understanding these patterns is more valuable than fixing any single bug—because the same taxonomy applies to much larger and more complex AI systems.

A single metric: Stability Score

To summarize system behavior, I used a single metric:

stability_score = 1 - (unique_scores / runs)

1.0 → perfectly stable
0.0 → completely unstable

Stability results

BAD system → 0.0
GOOD system → 0.8

This one number already tells you which system you can trust.

Minimal Fixes: Four Small Patches That Change Everything

This is not a rewrite. These are surgical changes. Each patch removes an entire class of failure modes without introducing new abstractions or frameworks.

Patch 1 — Remove Global State

Before (BAD):

# global mutation + history dependence
GS.CURRENT_SCORE += base
return GS.CURRENT_SCORE

After (GOOD):

def score_keywords(keywords, text):
    return sum(len(w) % 7 for w in keywords) + len(text) % 13

What this fixes:

Eliminates score drift
Removes hidden history dependence
Makes the function deterministic and testable

A function that depends on global state is not a function — it’s a memory leak.

Patch 2 — Push Side-Effects to the Boundaries

Before (BAD):

def extract_keywords(text):
    print("Extracting keywords...")
    open("log.txt", "a").write(text)
    return tokens[:k]

After (GOOD):

def extract_keywords(text):
    return tokenize(text)[:k]

# side-effects handled explicitly at the edge
logger.info("Extracting keywords")

What this fixes:

Core logic becomes reusable
Logging becomes configurable
Unit testing becomes trivial

Side-effects inside core logic silently infect everything upstream.

Patch 3 — Make Dependencies Explicit

Before (BAD):

if GS.LAST_TEXT is not None:
    base += len(GS.LAST_TEXT) % 13

After (GOOD):

def score_keywords(keywords, text):
    base = sum(len(w) % 7 for w in keywords)
    return base + (len(text) % 13)

What this fixes:

No hidden inputs
Clear data flow
Safe refactoring

If a dependency isn’t in the function signature, it’s a liability.

Patch 4 — Name the Magic Numbers

Before (BAD):

if score > 42:
    action = "PANIC_REORDER"

After (GOOD):

@dataclass(frozen=True)
class Config:
    panic_threshold: int = 42

if score > cfg.panic_threshold:
    action = "PANIC_REORDER"

What this fixes:

Decisions become explainable
Parameters become reviewable
Behavior changes become intentional

Magic numbers turn engineering decisions into superstition.

Summary

These four patches:

Remove hidden state
Eliminate non-determinism
Make behavior explainable
Restore trust in the system

No agents. No frameworks. Just engineering discipline.

Final takeaway

The BAD system works. That’s the problem.

It fails in the most dangerous way possible: plausibly and quietly.

The GOOD system is boring, predictable, and easy to reason about — which is exactly what you want in production.

Working code is not the same as a working system.

Code & Reproducibility

All code used in this article — including the intentionally broken system, the clean implementation, and the benchmark — is available on GitHub:

👉 https://github.com/Ertugrulmutlu/I-Intentionally-Built-a-Bad-Decision-System-So-You-Don-t-Have-To

If you want to reproduce the results, run:

python compare.py

The benchmark will run the same input multiple times through both systems and show, in a few lines of output, why predictability matters more than flashy abstractions.

How Do You Actually Optimize Agents? It Depends on the Task

Ertugrul — Thu, 18 Dec 2025 18:28:14 +0000

After my recent talk on Agent-in-the-Loop systems, I was asked a seemingly simple question: How do you optimize agents?
Link for Talk: https://www.youtube.com/watch?v=HwCR59VuYn4&t=1888s

At first glance, this sounds like a technical question. Many people expect a concrete answer involving prompt engineering, temperature tuning, or model selection. My response, however, was far less satisfying — but far more honest:

It depends on the task.

This answer often feels like a cop-out. In reality, it reflects a deeper truth about agentic systems: you don’t optimize agents in isolation — you optimize the system they operate in.

The Common Misconception: Optimization Means Tuning the Model

When people talk about optimizing agents, they usually mean optimizing the underlying model. Adjust the prompt, lower the temperature, swap the model, and expect better behavior.

These adjustments can help at the margins, but they rarely address the root cause of failure. That’s because an agent is not just a language model.

An agent is a system composed of:

a task definition
an action space (what the agent is allowed to do)
constraints and boundaries
feedback and evaluation mechanisms
stop and escalation conditions

If these components are poorly designed, no amount of prompt tuning will make the system reliable.

Agent Optimization Is a Task Design Problem

In practice, most agent failures are task design failures.

Agents struggle when objectives are too broad, success criteria are vague, or responsibilities are overloaded. Instructions like “do your best” or “solve this end-to-end” leave too much room for interpretation and lead to unpredictable behavior.

Consider the difference between these two prompts:

Poorly framed task:

"Analyze this document and decide what to do."

This instruction hides multiple decisions inside a single step: analysis, prioritization, and action selection. The agent has no clear notion of success or failure.

Well-framed task:

"Summarize the document, estimate uncertainty, and escalate to a human if confidence falls below a defined threshold."

Here, the task is explicit, bounded, and testable. The agent’s role is clear, and human intervention is intentionally designed rather than left implicit.

Optimizing an agent often means narrowing the task:

defining what success actually means
specifying what the agent should not do
breaking complex goals into smaller, verifiable steps

A well-framed task reduces the need for aggressive model-level optimization.

Feedback Loops Matter More Than Prompts

Another common failure point is feedback design. Agents frequently evaluate their own outputs, but self-evaluation can be misleading or overly optimistic.

Effective agent systems rely on feedback loops that are:

timely
aligned with real objectives
capable of triggering escalation

If feedback arrives too late or measures the wrong thing, the agent may appear functional while gradually drifting away from its intended behavior.

Human involvement is most valuable here — not in validating every decision, but in designing how feedback is generated and when intervention is required.

Constraints Are Not a Limitation — They Are a Guide

One of the most overlooked aspects of agent optimization is constraint design.

Constraints define:

which tools an agent can use
how often it can retry
how much context it can consume
when it must stop or ask for help

Rather than limiting performance, constraints provide structure. They prevent runaway behavior and make agent actions easier to reason about.

Constraints don’t weaken agents — they guide them.

The Role of Humans in Optimized Agent Systems

In optimized Agent-in-the-Loop systems, humans are not prompt engineers or micro-managers. Their role is to design the system boundaries and supervision mechanisms.

Humans are best positioned to:

define goals and constraints
decide which failures are acceptable
interpret ambiguous situations

In other words, humans optimize the decision space, not individual decisions.

Key Takeaways

Agent optimization starts with task design, not model tuning
Prompts and temperatures are secondary levers
Feedback loops determine long-term behavior
Constraints increase reliability and predictability
Humans belong above the loop, not inside every step

Final Thoughts

Optimizing agents is not about making them smarter. It’s about making the system clearer.

When tasks are well-defined, feedback is meaningful, and constraints are explicit, agents don’t need to be aggressively optimized — they simply work better.

Forem: Ertugrul

OpenAnima v1 — Open-Source Desktop Overlay Engine for Windows

OpenAnima v1 — Open-Source Desktop Overlay Engine for Windows

What OpenAnima Can Do

Supported Asset Types

Built With

Features Added During Development

Why I Built It

Future Plans

Download

Final Thoughts

PromptLedger v0.6 — Turning prompt history into a local workspace dashboard

Devlog — Part 5

Workspace dashboard

Prompt detail view

What changed in v0.6

Workspace instead of viewer

Card-based interaction

Marker actions in the UI

Compare workflow

Usability improvements

Design direction

Installation

Run the dashboard

Closing

PromptLedger is becoming a tool for thinking about prompts, not just storing them.

Links

OpenAnima v0.2 Preview: Turning the Windows Desktop into a Living Canvas

OpenAnima v0.2 Preview: Turning the Windows Desktop into a Living Canvas

What OpenAnima does

Why I built it

What changed in v0.2

Supported asset types

Metadata-driven assets

Sprite strips and spritesheets

Composite UI and HUD assets

The control panel

Website and distribution

Known limitations

What I want to explore next

What I learned

Links

PromptLedger v0.4 — Faster prompt logging, lightweight markers, and better prompt organization

Why this part was needed

Quick add: less typing during real iteration

Collection and role are now first-class metadata

Markers: small signals for important versions

Search, list, and show became more useful

The Streamlit UI is still read-only

Database changes stayed local and simple

Tests expanded around the new workflows

Design tradeoffs

What did not change

Closing thoughts

Links

PromptLedger v0.3 — Turning prompt history into a practical review workflow.

Why a third part?

The main addition: review

From line diff to semantic summary

Why heuristics instead of an LLM?

Reviews now export cleanly to markdown

Metadata changes are now first-class in reviews

Warning flags and likely drift hotspots

The Python API now returns structured review objects

UI update: review without write access

What did not change

No schema expansion was needed

Closing

Links

DataLens: A Read-Only Image Dataset Sanity Checker

DataLens: A Read‑Only Image Dataset Sanity Checker

The Problem: Silent Dataset Failures

Why Read‑Only Matters

What DataLens Does

Mode A — Images Only

Mode B — Images + Labels (CSV)

Duplicate Detection That Actually Helps

Data Hygiene Warnings

Outputs You Can Share

Design Principles

The main addition: `review`