Forem: Vineeth N Krishnan

How I ended up buying vinelabs.de

Vineeth N Krishnan — Sun, 10 May 2026 17:47:53 +0000

How I ended up buying vinelabs.de

TL;DR: I bought vinelabs.de last weekend. Was not planning to. The trigger was the author field of a manifest file, the same kind you fill into a composer.json, a package.json, a Cargo.toml, or whatever your stack of the day calls it. The realisation was that shipping serious packages under my personal GitHub username reads like a hobby for code that will sit in someone's finance pipeline. Trust problem, not a code problem. So I bought a domain. Set up an org. Built a small landing site. Here is the short version.

So here is what happened. I was in the middle of finishing up xrechnung-kit, which started as a small Shopware plugin and grew into a monorepo with eight packages. I have already written about that one separately, so if you want the long story you can find it here.

But the boring scene that mattered was this. I was filling in the manifest files for the Shopware sibling package and the small Astro showcase site that was going to live next to it. So the composer.json for the PHP package on one side, the package.json for the site on the other. I got to the author block, and I paused. The whole list of packages at that point was going to live under vineethkrishnan/xrechnung-kit-* on Packagist, and the showcase site under my personal GitHub username too. All in my personal namespace. For a library that will sit inside finance and accounting pipelines, the vibe of "github.com/vineethkrishnan/anything" reads as hobby. Even if the code is solid. Even if the tests pass. The address itself does the talking before the code gets a chance to.

That was a trust problem, not a code problem. I needed a brand.

If you have ever flinched while writing your own name into a composer.json, a package.json, or whatever manifest your stack uses, for a package you actually want people to take seriously, you know exactly what I mean.

The shortlist that did not happen

I sat for a bit with name options. The first instinct was, of course, .com. Tried vinelabs.com. Already taken. Looked at vinelabs.io and vinelabs.app next, the standard "labs" fallbacks people reach for.

But .de had been in the back of my head the whole time, and I will tell you why.

I have been working in German work culture for a long while now. Handled many .de domains across many German shops. Shopware itself is German-scoped. The first XRechnung use case is German. EN 16931 is a EU thing, but XRechnung 3.0 is a federal German standard. If the projects I am putting under this brand are going to focus on the DE and EU region, which they will, then .de is not a quirky choice. It is the home address.

So vinelabs.de. Bought it.

What I set up

The bare minimum to make a brand feel real, in order:

The org github.com/vinelabs-de. This is where the public-facing repos live.

Two mailboxes, info@vinelabs.de and support@vinelabs.de. Forwarded to where they need to go. Nothing fancy.

A small landing site, Astro 5 + Tailwind v4, deployed to Cloudflare Pages. The site is driven by a markdown content collection at src/content/projects/. Every project I want to showcase is one markdown file with a tagline, a description, a license, and a few highlights. New project equals new file. There is no CMS, no admin panel, no database. I keep saying this about Astro to anyone who will listen, but Astro continues to be unreasonably nice when you do not need a backend.

Why now, and why DE

The timing is not accidental. Germany is right in the middle of phasing in mandatory B2B e-invoicing. The receive-side mandate is already live, and the send-side mandate is rolling out behind it. EN 16931 / XRechnung 3.0 is what has to come out the other end. A small library that does that correctly, sitting under a brand that is clearly in the DE / EU lane, has a place.

I should also be clear about who I am here. I am an Indian developer, not a German one. I have been working with German teams and German shops for a long time, picked up a fair bit of the working culture, handled enough .de domains and Shopware shops to feel at home in this stack. But I am not pretending to be local. The brand is in the DE / EU lane because that is where the work is, not because I am putting on a costume.

The mirror trick

Here is the part I am quietly pleased about. I did not want to actually move my repos out of my personal GitHub account. That account has my history, my issues, my CI configurations, my settings. I did not want a hard fork, a rename, or a redirect.

So I wrote a tiny workflow template, mirror-to-vinelabs.yml. Lives in a workflow-templates/ folder. I drop it into any of my personal repos, and on every push to main it syncs that repo into the vinelabs-de org.

My personal repo stays the source of truth. The labs org stays the public face. If I ever pull out of the labs branding, it costs me nothing because the canonical code never moved. It is already wired up for xrechnung-kit. vaultctl is next, then probably a couple of the smaller tools that have outgrown my personal username.

The honest part

I do not have a roadmap. There is no team. There is no monetisation plan. No funding round, no big launch.

The labs domain exists because I would rather under-promise on a brand than over-promise on my own name. xrechnung-kit deserved a home that says "this is built to be used", not "this is what one developer made on a long weekend." It did start on a long weekend. What it is not going to stay is a weekend project. I plan to maintain it like something that has to keep working.

The V is a stem. Everything else is what grew off it.

Alright, that is me done rambling for today. Hope something in here was useful to you. Catch you in the next blog, take care until then.

The disk that filled itself

Vineeth N Krishnan — Thu, 07 May 2026 15:44:29 +0000

The disk that filled itself

TL;DR: my homelab box hit 100 percent disk full out of nowhere. I deleted half the things I could find, df still said full, du said I had plenty of space. Turned out the disk was holding on to files I had already deleted, because a long-running process still had them open. lsof +L1 was the magic. A service restart was the fix.

So there I was, on a perfectly normal evening, ssh'd into the homelab box because something had stopped responding. The first thing I check on any "why is this dying" run is df -h, almost as a reflex.

Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p2  450G  448G   2G  100% /

Cool. So that is why nothing is working.

I have a deal with this box. It runs my self-hosted things, it does not ask for much, and once a quarter or so I prune some old container images and we move on. So I went straight to the usual cleanup playbook, mildly annoyed that I had let it fill up.

docker system prune -a --volumes
journalctl --vacuum-size=200M
apt clean
rm -rf ~/.cache/*

Felt good. Watched the percentages tick down in du as I went. Ran df -h again, full of optimism.

/dev/nvme0n1p2  450G  448G   2G  100% /

Excuse me?

When df and du disagree

I went and added it up the long way. du -sh / took its time, came back with about 130G used. Big folders identified, nothing weird. Half the disk should have been free.

But df sat there, smug, telling me I had two whole gigabytes of breathing room. Same disk. Same minute.

This is the moment in any disk-full story when you realise the problem is not actually the disk. It is who is asking.

If you have hit this exact mismatch before, you already know where this is going. If you have not, here is the thing that took me longer to internalise than I want to admit: df and du are not measuring the same thing.

du walks the directory tree. It adds up files it can see, file by file. If a file is not in some directory, du does not know it exists.

df asks the filesystem itself how many blocks are in use. The filesystem does not care about directories. It cares about which blocks have been handed out to a file, any file, anywhere.

Most of the time these two views agree. The interesting case is when they do not. And the most common reason they disagree is files that are not in any directory but are still very much being used.

The deleted file that is not deleted

In Linux, rm does not actually delete a file. It just removes the entry from a directory. The file's data only goes away when the last process holding it open lets go.

Which means: if a process has a log file open, and you rm that log file, the directory entry is gone, du cannot see it, your file browser shows it as deleted, you are happy. But the process is still writing to it. The blocks are still held. df is still counting them.

Until that process closes the file or dies, those bytes are real, just invisible.

This is the part of Linux that feels like a magic trick once you see it. lsof exposes it directly.

sudo lsof +L1

+L1 means "show me files with a link count less than 1", which is exactly the deleted-but-still-held case. I ran it expecting maybe a couple of stray MB. The output was a wall of text. The same process kept showing up, holding a frankly embarrassing number of "deleted" files.

The culprit was not exotic. It was the docker daemon, sitting on a container's json-file log that had ballooned to hundreds of gigs across the time the box had been running. Some time back, in a cleanup session I do not really remember anymore, I had rm'd that log file directly, thinking I was reclaiming space. Docker had no idea I had done that. The file was gone from disk as far as I was concerned. Not gone from docker's open file descriptor.

So every byte that container had been logging since that day, plus every byte before, was still there. Held. Counted by df. Invisible to du.

Tell me I am not the only one who has done this exact "smart" cleanup move and quietly made it worse.

The fix, and the not-fix

The actual fix was embarrassing in its simplicity.

sudo systemctl restart docker

That is it. The daemon restarted, every file descriptor it was holding got closed, every "deleted" file finally got a chance to be properly deleted, and df was suddenly back to a sensible number.

The not-fix, the thing I should have done in the first place to avoid this whole thing, would have been to never rm an active log file. The right move on a docker container log is to truncate it through the existing file descriptor.

truncate -s 0 /var/lib/docker/containers/<id>/<id>-json.log

truncate writes through the file descriptor instead of unlinking the directory entry. Docker keeps writing. Disk space comes back. Nobody gets confused.

Or, even better, configure the json-file log driver with max-size and max-file so it rotates itself and you never have this conversation.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "3"
  }
}

That goes in /etc/docker/daemon.json, you restart the daemon once, and then this whole class of bug stops being a thing on that box.

The tools I built so I do not have to do this manually again

After this exact kind of incident, and the embarrassing number of du -sh /* sessions that came before it, I went and built a few small things to take the manual labour out of disk-full nights. They are the tools I now reach for before I touch anything by hand.

dfree is the first one I run. It is a shell script. No arguments, no flags to remember. It scans the disk in a few passes and shows me what is taking space across docker, system caches, dev caches, and logs. Same playbook I tried to do by hand at the start of this story, except it adds the numbers correctly and shows me the docker side first.

$ dfree

=== System Analysis ===

[INFO] Scanning disk usage...
500G 448G 2G 100%

[INFO] Scanning Docker usage...
Images: 18.2GB (12.4GB reclaimable)
Containers: 287GB (281GB reclaimable)
Build Cache: 4.1GB

[INFO] Scanning Developer Caches...
  - /home/vineeth/.cache: 480MB
  - /home/vineeth/.npm/_cacache: 1.1GB

[INFO] Scanning Logs...
  - /var/log/journal: 320MB

Look at the docker line. Containers: 287GB (281GB reclaimable). On the actual night this happened, I could have read that one line and known exactly where the trouble was, without going on a find expedition. After the analysis, dfree asks me one item at a time what I want cleaned, and I say yes or no.

=== Cleanup Process ===

Prune Docker system (images, containers, networks)? [y/N] y
[INFO] Pruning Docker...
Total reclaimed space: 12.4GB

Clean system cache at /var/log/journal? [y/N] y
Clean developer cache at /home/vineeth/.npm/_cacache? [y/N] y

[SUCCESS] Cleanup complete.

For when a flat list is not enough and I want to actually see the shape of the disk, I built diskdoc, a Rust TUI that walks the filesystem in parallel and lets me browse the result like a tree. Useful when the offender is buried somewhere weird and I want to wander through the directory structure instead of reading a summary. It is not what saves you on the night of. It is what saves you the third time you keep ending up in the same neighbourhood and want to understand why.

But the tool that would have actually short-circuited this whole post is dockit, a Go CLI that talks to the docker daemon directly. It has a logs subcommand built for this exact failure mode.

$ dockit logs
Finding container log paths on disk...

--- CONTAINER LOG SIZES (Total: 287 GB) ---
CONTAINER            SIZE            WARNINGS
notes-app            287 GB          🚨 EXCESSIVE - Consider adding 'log-opt max-size=10m'
nextcloud            42 MB
gitea                8.3 MB
media-server         2.1 MB

That first row is the entire war story compressed into one line. One container, no rotation, hundreds of gigabytes of json sitting on disk, and the tool literally tells me what to do about it. If I had been running dockit logs on a cron and getting a ping when any single container crossed a sensible threshold, none of this would have happened. The investigation would have been "fix the log driver config" months ago, not "why is my disk lying to me" at midnight.

If you want the tools, all three are open source:

dfree: github.com/vineethkrishnan/dfree
diskdoc: github.com/vineethkrishnan/diskdoc
dockit: github.com/vineethkrishnan/dockit

Two lessons I keep relearning

df and du measure two different worlds. When they agree, life is easy. When they disagree, the answer is almost always "something is being held open". lsof +L1 is the single command that tells you exactly what. I have probably typed it a hundred times in my career and I still forget it exists for the first stretch of every disk-full incident.

rm on an active log file is a trap. It looks like cleanup. It is actually just hiding bytes from du while the process keeps appending to invisible disk. Use truncate if the process supports being truncated under it, signal the process to reopen its log if the app supports that, or rotate properly with logrotate or the platform's native rotation.

Early on in this incident, I was completely sure I had simply not deleted enough stuff yet. I was a few minutes away from ordering another drive. The fix was a service restart, and the cause was a rm from months ago that I had thought was helpful at the time.

If you have an old box with self-hosted things on it and you have ever cleaned up a "huge log file" by deleting it directly, today is a good day to run sudo lsof +L1 and see what your processes are still holding. Worst case you find nothing. Best case you find a sizeable chunk of your disk waiting to be freed.

Closing

The thing that bothers me about this kind of bug is not the bug itself. It is that I had a wrong mental model of rm for years and never really noticed, because most of the time the wrong model and the right model produce the same result. The penalty only shows up at the edges, in long-lived processes with open files, on a box you have neglected for long enough that you forget what you did last summer.

So that is where I will stop. If you have a different way of catching this kind of thing earlier, or a cleaner way of dealing with active logs on a homelab box, I genuinely want to hear it, drop me a note. Otherwise, see you when the next interesting problem shows up.

MCP is the USB-C of AI tools, and most devs are still using their AI assistant like it is 2023

Vineeth N Krishnan — Thu, 07 May 2026 13:17:44 +0000

MCP is the USB-C of AI tools, and most devs are still using their AI assistant like it is 2023

So here is a small thing I noticed the other day. I was watching a friend debug a production issue, and the workflow was painful in a very specific way. Tab to their AI chat of choice, paste an error. Read the answer. Tab to Sentry, copy the stack trace. Tab back to the chat, paste the stack trace. Tab to the codebase, copy the function. Paste it again. Repeat until coffee gets cold. It honestly does not matter which AI they were using. ChatGPT, Claude, Codex, Gemini, take your pick. The flow was the same.

The whole thing felt like watching someone use a phone in 2010. Functional. Slow. And clearly a generation behind something that already exists.

That is the gap I want to talk about today. Because there is a very real protocol shift happening in AI tooling right now, and most developers are completely unaware of it.

The cable drawer in your house

Open the drawer where you keep your old chargers. Go on, I will wait.

If you are anywhere over thirty, you probably have a small museum in there. Mini USB. Micro USB. The old Apple 30-pin. Lightning. That one weird Samsung cable that nobody can identify. A barrel charger from a router you threw away in 2014. Each one was the only way to talk to a specific device. Each one was useless for anything else.

USB-C did not appear and instantly fix the world. It just slowly became the one cable that worked for everything. Laptop, phone, headphones, monitor, the toothbrush my wife uses, my Kindle. One connector. No drawer.

AI tooling is going through the exact same moment right now. Most people have not noticed.

The drawer of integrations

For the last couple of years, every AI integration was its own custom cable.

You wanted your AI assistant to read your Notion? Cool, here is a custom plugin that runs on that vendor's plugin system, with its own auth, its own schema, its own quirks. You wanted a different model to query your database? Different system. You wanted to do something with Slack? Build a function-calling wrapper, write the schema by hand, host it somewhere, deal with the auth yourself. You wanted to switch from ChatGPT to Claude, or Claude to Codex, or any of them to a local model? Throw all of it away and start over.

Every "AI integration" was bespoke. Every developer who built one had to figure out the same five problems from scratch. Auth. Schema. Transport. Tool descriptions. Error handling. Five problems times one hundred SaaS tools times five model vendors gives you a number that should have scared us all.

And then a small thing called the Model Context Protocol showed up and said: what if this was just one shape?

What MCP actually is

I will keep this short because the spec is honestly not that interesting and you can read it later if you want.

MCP is a protocol. Your AI client (Claude, ChatGPT, Codex, Gemini, Cursor, whoever) speaks one shape. Any tool, any service, any local script can implement that shape and the client can talk to it. The client does not care if it is reading from Notion, posting to Slack, querying Postgres, or running a Playwright browser. They all expose the same kind of interface. Tools, resources, prompts. That is basically the whole story.

The cleverness is not in the protocol design. The cleverness is in the agreement. Anthropic shipped it. OpenAI adopted it. The big SaaS companies started writing servers for their own products. Atlassian has one. Figma has one. Slack has one. Notion. Vercel. Gmail. Google Calendar. Playwright. The list is now embarrassing in length.

It is the same thing USB-C did. Not a technical breakthrough. A standardisation moment.

What this looks like in practice

Here is what my actual day looks like now, and I want to be honest, this is the part that took me a while to internalise.

When something breaks in production, I open my editor. I do not open Sentry. I do not open Notion. I do not switch tabs. I just say something like, "pull the latest unresolved issue in the api project, show me the stack trace, and tell me which file it points to". The agent calls the Sentry MCP, gets the issue, reads the file from the codebase, and tells me where the bug is. Sometimes it offers a fix. Sometimes I tell it to write the fix and resolve the issue. The whole loop, including writing the patch and closing the ticket, lives in one window.

And that is for one tool. The same agent, in the same session, can also pull a Linear ticket, check a Figma frame, post an update to Slack, query a Postgres database, and run a quick Playwright test against staging. All without me leaving the editor.

Compare that to the friend I mentioned at the start. Tab to chat, paste, copy, paste, copy. Same problem. Different decade. And again, it is not about which AI tool they picked. ChatGPT, Claude, Codex, Gemini, all of them now speak MCP or are in the process of adding it. The bottleneck is not the model. The bottleneck is whether you have actually plugged anything into it.

Tell me I am not the only one who finds this gap funny.

I built a thing because I felt the pain

A while back I started building MCP servers for the SaaS tools I actually use at work. It started with one. Then two. Then before I knew it I had eleven of them, plus a shared OAuth library, plus a docs site, plus a Docker setup so they would show up properly in the public registries. The repo is called mcp-pool and I wrote a whole separate post about how it grew, so I will not retell that story here.

The thing I want to point out is that the painful part was never writing the servers. The SDKs are decent. The protocol is small. You can scaffold a basic server in an afternoon if you have done it once before.

The painful part was running them. Six different Node processes on my machine, each one with its own config file, each one needing its own auth token, each one occasionally crashing for no reason and silently disappearing from the agent's tool list. That is the part nobody warns you about. Once you have more than two or three MCP servers, the operations side starts to look a lot like running a small fleet of microservices on your own laptop. Which, when you put it that way, is kind of an absurd thing to be doing.

But that is the price of being early. Same way the first USB-C laptops needed three dongles in your bag. The protocol was right. The ecosystem was still catching up.

The 2023 dev versus the 2026 dev

So here is the bit I keep coming back to.

The 2023 developer treats the language model as a smarter Stack Overflow. You type a question. You read the answer. You copy something out. You paste it into your code. Your context lives in the chat window. The model has no memory of your repo, your team, your tools, your tickets, your design files, your runbooks, anything.

The 2026 developer treats the language model as the centre of a small workshop. The model has access to the actual systems. It can read the ticket. Open the file. Run the test. Check the design. Post the update. Close the ticket. The dev is no longer copy-pasting context in. The dev is just describing what they want done, and the agent is fetching, reading, deciding, writing.

This is not about AI being smarter. It is about AI being plugged in.

And I would gently suggest that if you are still in the first group, you are leaving an embarrassing amount of productivity on the table. Not because you are bad at your job, but because you are using a 2023 workflow on a 2026 toolchain. Same way someone might still be charging their phone with a cable they keep in a drawer with seven other cables.

The bit nobody is putting on the marketing slide

So far this post has been mostly cheerful. A new protocol, a nicer way to work, a cable drawer that finally got cleaned up. Honest moment now.

Plugging more tools into your AI assistant is also plugging more attack surface into your daily workflow. The MCP ecosystem has had a genuinely rough run on the security front, and if you are about to install a few servers this weekend, you should know what has actually happened in the last year before you do it.

A short and very much not comprehensive list of real incidents (the authzed MCP breach timeline has the fuller version, and is what I cross-checked these against):

April 2025, WhatsApp MCP: a tool-poisoning attack disguised a backdoor as a legitimate server and quietly exfiltrated chat histories.
May 2025, GitHub MCP: a prompt injection in a malicious public issue hijacked the agent into leaking private repository contents, using a token whose scope was way too broad.
September 2025, Postmark MCP: a trojanized package on a public registry was BCC-ing every email it handled to attacker infrastructure.
October 2025, Smithery Registry: a path traversal bug exposed builder credentials and compromised thousands of hosted MCP servers in one go.
April 2026, core MCP STDIO design flaw: an architectural decision in Anthropic's official SDKs that, depending on who you read, exposes upwards of a hundred and fifty million downloads across Cursor, VS Code, Windsurf, Claude Code and others.

And right next to this, a related incident that was not strictly an MCP breach but is exactly the pattern you should be watching for. In April 2026, Vercel disclosed that an employee was compromised through Context.ai, a third-party AI tool that held a Google Workspace OAuth app with broad permissions. Malware on the AI vendor's laptop, then OAuth pivot, then into Vercel customer environment variables (TechCrunch and Trend Micro have the cleanest writeups). Not MCP-specific. But the shape is exactly the shape MCP makes more common.

The pattern across all of these is the same. An AI tool sits in the middle of your stack, holding tokens that reach into your real systems. If that tool is malicious, vulnerable, or just sloppily run, the blast radius is whatever those tokens can reach. And tokens for "read my Notion" or "post to Slack" are not low-privilege things in 2026. They are basically the keys to an entire workspace.

How to actually check if an MCP server is safe for you

This is not a perfect checklist. It is the rough rubric I run before I install a server. Steal it, sharpen it, throw it away, whatever works.

Who publishes it. Is the server from the SaaS vendor whose API it wraps, from a known community maintainer, or from a username you have never seen before? Vendor-official is safest. A maintainer with a real track record is fine. A brand new account with one package and no GitHub history is a hard no.
Read the source. Most MCP servers are small. Cloning the repo and skimming the tool list takes a few minutes. Look at what tools are exposed, what their descriptions actually say, and whether anything is doing something the README does not mention. Tool poisoning lives in exactly this gap.
Check the dependency tree. A small wrapper with two hundred transitive dependencies is a very different risk profile from a small wrapper with five. Shorter is better.
Token scope, ruthlessly. When you generate the token the server will use, give it the smallest set of permissions that gets the job done. Read-only beats read-write. Single-project beats organisation-wide. Single-channel beats whole-workspace. Never reuse a token you already use somewhere else.
Run it locally, not on a hosted gateway. Hosted MCP gateways are convenient. They are also a single point at which someone else is holding your credentials. If a server can run as a local stdio process on your own machine, prefer that.
Read-only first, write tools opt-in. If the server supports read-only mode, start there. Only enable write tools after you have used it long enough to trust both the server and how the agent behaves with it.
Watch for updates that change tool descriptions. This is one of the sneakier attack patterns. A server you trusted last month silently expands its tool descriptions in this week's update to include something new and harmful. Pin versions if you can.
Check the registry verification badges. Glama and the official MCP registry now flag servers that have been smoke-tested. Not perfect signal, but a server with zero badges, zero stars, and no recent commits is at least worth a second look.

If a server fails most of these, do not install it. If it fails one or two, decide whether the convenience is worth it for your specific situation. None of this is paranoia. It is the same hygiene most of us already apply to npm packages, just adapted to a newer ecosystem that is still figuring out the basics.

What I would tell a friend

If you read this far and you are wondering whether to bother, here is what I would actually say to a friend over coffee.

Pick one tool you use every day. One. Sentry, Notion, Linear, Slack, your database, whatever. Find an existing MCP server for it on GitHub, or look at the official ones from Anthropic, or check mcp-pool if any of those line up with your stack. Run the safety checklist above before you install. Then wire it into Claude Desktop or Claude Code or your client of choice. Spend a single evening doing this and nothing else.

The first time you say "summarise the last five Sentry issues from this morning" and an actual answer comes back, with real data, from the real system, you will get it. The shift will feel obvious in hindsight. You will wonder how you spent so long copy-pasting things into a chat box.

That is basically the whole point of this post. Not "MCP is cool". Not "here are the seven best servers to install today". Just: a thing has changed, and most people I know in tech have not yet noticed it has changed. Which is normal. Standardisation moments are always quiet. The drawer of cables does not announce itself. One day you just notice you have not opened the drawer in years.

Closing

If your AI workflow today involves a lot of tab switching and copy-pasting, that is the cable drawer. It is fine, it works, it is not broken. But there is a different way of doing it now, and the gap between the two is going to keep widening every month as more SaaS companies ship MCP servers for their products.

You do not have to rush. Nobody is keeping score. But it might be worth at least poking at one server this weekend, just to see.

That is all I had on this one. If you made it till here, thank you, genuinely. See you in the next one, where I will probably be complaining about something else that broke.

The webhook that worked in Postman and nowhere else

Vineeth N Krishnan — Mon, 04 May 2026 11:38:41 +0000

The webhook that worked in Postman and nowhere else

TL;DR: an app I work on was firing webhooks at a third-party device API. The receiver kept returning 401. Postman, with the same payload, got 200 every time. The cause was not signing logic, not auth, not network. The app had two completely different bootstrap paths, the secret-loading config was wired into only one of them, and a silent-skip guard quietly hid the real failure under a misleading 401.

So there I was, staring at a wall of 401 responses in the logs. The app was firing webhooks at a third-party device API every time something on our side changed state. Every single one was bouncing back as "unauthorized".

Fine, must be the signature. I copied the raw request body straight out of the logs, dropped it into Postman, signed it the same way the app does, and fired it at the same URL. 200 OK. First try.

So Postman was happy. The app was not. Same payload, same URL, same headers (so I thought), and yet only one of them was getting through.

If you have ever been in this situation, you know the feeling. There is no Stack Overflow post for "works in Postman, fails from my own app". You have to walk yourself through it.

First, rule out the obvious stuff

I went through the standard checklist before doing anything clever.

Same URL? Yes, copy-pasted from the same config.
Same body? Yes, byte for byte.
Same auth header? Yes, same shared secret loaded from the same env file.
Time skew? The timestamp inside the signature was within a few seconds of the receiver's clock.
IP whitelist? No, the receiver does not even check the source IP.

So on paper the two requests were the same. The receiver clearly disagreed. Which meant I had to see what the app was actually putting on the wire, not what I thought it was putting on the wire.

The diff that made the cause obvious

I added a logger that dumped the full outgoing HTTP request right before the dispatch: method, URL, every header, body. Then I triggered an event from the app and let it fire. Side by side with the Postman request:

Postman                              App
-----------------------------------  -----------------------------------
POST /webhook                        POST /webhook
Content-Type: application/json       Content-Type: application/json
X-Signature: sha256=a3f4...e991      X-Signature:
User-Agent: Postman                  User-Agent: GuzzleHttp/...
{"event":"door.unlocked",...}        {"event":"door.unlocked",...}

Look at the second-to-last line on the right. The app was sending the X-Signature header. The value was just an empty string. Postman had a signature, the app had nothing.

That was a relief in a small, sad way. At least there was something to find.

Why is the signature empty?

Easy enough to check. The dispatcher looked roughly like this:

function dispatch(event, payload):
    secret = config.get("device_api.signing_secret")
    if secret is empty:
        // skip signing, send anyway
        send(payload, headers={})
        return
    signature = hmac_sha256(secret, payload)
    send(payload, headers={"X-Signature": signature})

Two things wrong here, but bear with me.

I dropped a log line on the secret = ... line. The value came back null. At runtime, in the queue worker's process, the signing secret was just not there.

But the same config file. The same env. The same code reading from the same key. Why was it empty in the worker and full in the HTTP layer?

Has this happened to you also, where two parts of the same app behave like they live in different universes? Welcome to bootstrap drift.

Two doors that look the same from the outside

The app, like a lot of older codebases, has more than one entrypoint. There is the HTTP entrypoint that serves the website, the API endpoints, anything that comes in over a request. And separately there is a queue worker entrypoint that handles background jobs: sending mails, replicating data, dispatching webhooks (yes, that webhook).

Both entrypoints share most of the codebase. They both load the same config files. They both connect to the same database. From the file tree, they look identical.

But they boot through different paths. The HTTP entrypoint has its own bootstrap routine. The queue worker has its own. And somewhere along the way, the config that loaded the third-party device API secret had been added only to the HTTP entrypoint's bootstrap.

When a request came in over HTTP, the bootstrap ran, the secret got loaded, the dispatcher had what it needed. Tested manually with Postman replay against the HTTP entrypoint? Worked, because Postman was hitting the side that had the config.

But the actual production trigger was a queue job. The job ran inside the queue worker process, which booted through the other path, which never loaded that config. So config.get("device_api.signing_secret") came back null. Every single time.

The two entrypoints had drifted apart. Whoever added the config load had put it where they could see it being needed (the HTTP layer, where the test was easy), and nobody noticed that the queue worker was also calling the same dispatcher.

The second bug: the silent-skip guard

Look at the dispatcher again:

if secret is empty:
    // skip signing, send anyway
    send(payload, headers={})
    return

That comment is the second crime scene.

When the secret was missing, instead of throwing an error, the dispatcher quietly stripped the signature header and sent the request anyway. So the receiver, who is doing what every signed-webhook receiver does, saw an unsigned request and answered 401.

From the outside, what we saw was: webhooks fail with 401. The obvious assumption is that the signature is wrong. We spent a good while looking at HMAC code, hashing algorithms, payload encoding, header casing. All of that was fine. The bug was four layers up the stack from where the symptom was showing.

If the dispatcher had just thrown a loud MissingSecretError: device_api.signing_secret is null, the cause would have shown up the very first time a webhook tried to fire. Instead it whispered "no signature, oh well", and the receiver did the polite thing and rejected it. Two pieces of code, each individually being defensive, together producing a misleading symptom.

The fix, and the meta-fix

The local fix was a one-liner. Move the config load into the shared bootstrap that runs for every entrypoint. Now every process that boots, whether HTTP, worker, CLI, or cron, has the secret loaded by the time anything else runs.

The meta-fix was the silent-skip guard. I changed it to throw if the secret is missing in any non-test environment. If somebody, some day, manages to start a worker process without that config loaded, I want it to crash on the first webhook attempt with a useful error, not soldier on producing 401s for hours.

if secret is empty:
    if env != "test":
        throw MissingSigningSecret("device_api.signing_secret")
    // tests can opt in to unsigned mode
    send(payload, headers={})
    return

Took maybe ten minutes to write. The bug had been confusing me for a good chunk of the day.

Two lessons I am writing on the wall

Cross-cutting config belongs in the shared bootstrap, not in the entrypoint-specific one. If a piece of config is needed by code that runs in more than one process type, the only safe place to load it is somewhere all of those processes pass through. Not the HTTP bootstrap. Not the worker bootstrap. The one underneath both. Otherwise you are building two apps that pretend to be the same app, and they will eventually disagree.

Silent-skip guards turn loud failures into quiet ones. If a value being missing is going to make the next operation meaningless, do not paper over it. Throw. The sound of a real error in a dev environment is so much cheaper than the silence of a wrong-but-running production. There are exceptions, where degrading gracefully is genuinely the right answer. But the default should be loud, and "quiet on missing config" is almost never the right answer.

If you have hit this kind of bootstrap drift in your own apps, I would love to hear how you spotted it. Mine was pure luck. The request logger I added was actually for an unrelated thing, and I noticed the empty header by accident. Without that I might still be reading HMAC source somewhere.

Closing

Looking back, this whole thing was less about webhooks and more about how easy it is for two parts of the same app to grow apart without anyone noticing. The codebase looks like one app from the file tree. It runs as two different apps from the operating system's point of view. That gap is where bugs like this live.

If your app has more than one entrypoint, today is a good day to grep for bootstrap and check whether all of them are setting up the same world.

That is pretty much it from my side today. Let me know what you think, or if you have been through something similar, those stories are always the best ones. See you soon in the next blog.

The 20,000-line PR that was actually 47 lines: building ClearPR

Vineeth N Krishnan — Fri, 01 May 2026 08:54:38 +0000

The 20,000-line PR that was actually 47 lines: building ClearPR

Some time back, a teammate opened a PR. The diff said 20,847 lines changed. I clicked, my MacBook fan kicked in, and GitHub started painting the page in those familiar green and red blocks. I scrolled. Scrolled some more. Then a bit more. Eventually I got to the part where I realised what had happened: someone had run Prettier on the whole repo before pushing.

The actual change was 47 lines.

I sat there for a moment thinking about the rest of my afternoon, which was now going to involve scrolling past twenty thousand lines of trailing-comma additions and quote-style flips just to find the part of the code that actually did something different. I tried the GitHub "Hide whitespace" toggle. It did nothing useful, because Prettier does not just touch whitespace. It rewraps lines. It reorders imports. It changes single quotes to double quotes. The toggle was built for a simpler time.

I closed the tab, went and made a coffee, and on the walk back to my desk I started thinking: why am I the one doing this work? Why is my eyeball the noise filter? This is the kind of thing a parser figures out in a few milliseconds.

That is roughly when ClearPR started.

What ClearPR actually is

ClearPR is a self-hosted GitHub App. You install it on your repos, point it at your own server, and from then on every time someone opens or updates a PR, it does three things:

Parses the changed files into an AST and computes a semantic diff that ignores formatting noise.
Sends the clean diff to an AI (Claude by default, though you can swap in OpenAI, Mistral, Gemini, or any local LLM that speaks an OpenAI-compatible API: Ollama, LM Studio, LocalAI, llama.cpp, vLLM) along with your project's own guidelines.
Remembers what reviewers caught in past PRs, so the same mistake does not slip through quietly six months later.

It posts inline comments on the lines it has something to say about. It does not approve PRs. It does not block PRs. It does not request changes. It is advisory, deliberately, because nobody on a Friday evening needs an AI bot blocking the merge button.

The whole thing runs in Docker. One docker compose up -d and it is alive. You do not send your code anywhere except your own server and the LLM API of your choice.

Why an AST and not a regex

The first version I prototyped used regexes. Strip trailing whitespace. Collapse blank lines. Normalise quote style. Sort imports alphabetically before diffing. Easy. Worked for the boring cases.

It also broke in beautiful ways. A regex that strips trailing commas does not understand that the comma inside a string literal is not the same as a syntactic trailing comma. A regex that normalises quotes does not know that the apostrophe inside it's is not a string delimiter. I got bitten by this almost immediately on real PRs and decided I was building the wrong thing.

The right thing was tree-sitter. Tree-sitter parses your code into an actual abstract syntax tree, the same kind of tree your IDE uses for syntax highlighting and code folding. If two ASTs are structurally identical, the code does the same thing, no matter how it is formatted. That is the whole insight, and it is not even mine. It is just what compilers have known forever.

So ClearPR parses both sides of the diff into ASTs, walks them, and only reports the nodes that actually changed in shape. Whitespace differences? Same tree. Trailing commas? Same tree. Single-to-double quote flip? Same tree. Reordered imports where the set of imports is identical? Same tree. Once you strip all of that, what is left is the part you actually wanted to review.

Has this happened to you also, where you spent ages reviewing a PR only to realise the only thing that mattered was a one-line bug fix hidden inside a Prettier sweep? If yes, you know exactly why I kept building this thing on weekends.

Then the AI part

Stripping formatting noise was the easy half. The harder half was the review itself, because every "AI code reviewer" I had used until then had the same personality: a slightly anxious junior who flagged everything, suggested "consider adding error handling" on every function, and never seemed to actually know what your project looked like.

I did not want that. I wanted a reviewer that read the project's actual rules and stuck to them.

So ClearPR looks for config in your repo, in this order:

claude.md at the repo root
agent.md at the repo root
.reviewconfig at the repo root, which can point at multiple guideline files

If it finds them, it reads the full text and uses it as review context. Your team's naming convention, your error handling rules, your "we never do X here" notes, all of it. The reviews stop saying generic things and start saying specific things like "this function name does not match the verb-first rule from naming-conventions.md line 14".

The .reviewconfig itself looks like this:

guidelines:
  - docs/coding-standards.md
  - docs/naming-conventions.md
  - docs/api-patterns.md
severity: medium
ignore:
  - '**/*.generated.ts'
  - 'migrations/**'

Boring on purpose. The whole point is that anyone in the team can edit it without learning a new DSL.

The part I am most pleased with: PR memory

This is the bit that took the longest and is also the bit I had the most fun building.

Every team I have ever worked with has the same problem. Someone reviews a PR, leaves a thoughtful comment ("hey, you forgot to wrap this in a transaction, that has bitten us before"), the author fixes it, the PR merges, and some months later somebody else writes the same bug and nobody catches it because the original reviewer is busy or on leave or has moved teams.

The institutional memory lives inside one human's head. When the human leaves, the memory leaves.

ClearPR indexes the last 200 merged PRs on install. For each one it pulls the review comments, embeds them with a sentence-transformer model, and stores the vectors in pgvector inside Postgres. From then on, whenever it reviews a new diff, it does a similarity search against past comments and includes the relevant ones in the prompt. So if your team caught "missing transaction wrap" once, ClearPR has it on file, and the next time something looks similar it flags it with context: "this is similar to the issue found in PR #342 where the booking creation was not wrapped in a transaction."

It also tracks which feedback was accepted (the code actually changed after the comment) versus dismissed (the author replied "actually that is intentional"). Over time it learns what your team genuinely cares about and stops nagging about the things you have already collectively decided are fine.

Tell me I am not the only one who has watched the same review comment pop up across years on different PRs. The whole point of ClearPR's memory module is to give that knowledge somewhere to live that is not just one senior engineer's brain.

The cost angle, briefly

A side effect of the AST filtering is that you are sending way fewer tokens to the LLM. On a PR where the raw diff is five thousand lines and the semantic diff is four hundred, you are paying for four hundred lines of input plus the project guidelines, not five thousand. That is not the reason I built it, but for a team of ten doing a couple of hundred PRs a month it adds up to roughly the difference between a thirty-dollar-a-month Claude bill and a two-hundred-dollar one. People notice when their LLM bill is one fifth of what their colleague's is.

Architecture, very briefly

The stack is what I tend to reach for these days when I want something boring and reliable: NestJS for the API, Postgres with the pgvector extension for the memory store, Redis with BullMQ for the job queue, tree-sitter for the parsing, and the Anthropic SDK (or whichever LLM provider you pick) for the actual review.

The flow is roughly:

GitHub webhook
       |
       v
NestJS receives it, validates the signature, queues a job
       |
       v
BullMQ worker picks it up
       |
       +--> tree-sitter computes the semantic diff
       +--> pgvector pulls similar past comments
       +--> LLM gets the diff + guidelines + memory hits
       |
       v
Octokit posts inline comments back on the PR

Nothing exotic. The interesting parts are the diff engine and the memory store. Everything else is plumbing.

I went with DDD-flavoured hexagonal architecture inside the NestJS app because I knew there were going to be multiple LLM providers, multiple token-store strategies, multiple language parsers, and I did not want any of those choices baked into the domain layer. So the review module talks to a LlmProvider interface and does not care whether the implementation is Anthropic or OpenAI or Ollama. Same for the diff-engine module, which talks to a LanguageParser interface and does not care whether the file is TypeScript or PHP or YAML. This sounded like overengineering on day one. By the time I added the second LLM provider it had already paid for itself.

What I got wrong the first time

Two things stand out, both about doing too much too early.

First, I tried to support every language tree-sitter supports out of the gate. There are over a hundred parsers. I started wiring them all up. Halfway through I realised I was solving a problem I did not have, because nobody runs Prettier on Haskell. I cut the supported list down to TypeScript, JavaScript, PHP, JSON, and YAML, with a whitespace-only fallback for everything else. Languages can be added when somebody actually asks for them.

Second, the first version of the AI prompt was way too clever. I had it doing a multi-step chain: summarise the diff, extract the intent, compare against guidelines, then write feedback. It was slow, it was expensive, and the reviews were not noticeably better than a single carefully written prompt that did the whole thing in one pass. I deleted the chain. The single-prompt version is faster, cheaper, and the comments are punchier because the model is not trying to fit its reasoning into a structured pipeline.

Both of these are versions of the same lesson: you do not actually know what your tool needs to do until somebody real has tried to use it. Build the smallest thing that could possibly work, ship it, then let the actual usage tell you what to add.

What is next

The roadmap inside the repo has the public version, but the short version is:

Auto-fix suggestions through GitHub's suggested-changes UI, so reviewers can click "commit suggestion" instead of copy-pasting from a comment.
A small analytics dashboard so a tech lead can see which kinds of issues their team keeps making.
Multi-repo support with shared guidelines, for teams that want one source of truth across many services.
A pre-push IDE plugin, so you get a ClearPR review locally before you even open the PR.

Some of that is in flight already. Some of it is still a checkbox in a markdown file. Either way, the project is open source and self-hosted by design, so if any of it is interesting to you, the repo is the place to start: github.com/vineethkrishnan/clearpr.

The README has the install steps, the GitHub App setup, and the full list of config options. Full docs are at clearpr-docs.vineethnk.in. The Docker image is on Docker Hub at vineethnkrishnan/clearpr. License is MIT, so do whatever you want with it.

Closing

Honestly, the thing I am most happy about with ClearPR is not the AST trick or the memory module or the LLM-provider abstraction. It is that I no longer scroll past twenty thousand lines of Prettier output to find a one-line bug fix. The first time I opened a PR after installing it on my own repos and saw the clean diff comment with the actual change highlighted, I just sat back and laughed. It was such a small thing. It saved me a real chunk of time. And then it did the same thing the next day, and the next.

That is the whole reason any of this exists.

Okay, that is enough from me for today. If any of this saved you some time, that is the whole point of writing it down. Until the next one, take it easy.

I blocked Tor exit nodes, then I opened Tor Browser

Vineeth N Krishnan — Thu, 30 Apr 2026 13:23:21 +0000

I blocked Tor exit nodes, then I opened Tor Browser

A SaaS I work on has no business serving Tor traffic, and the box had no Tor block of any kind on it. A firewall-level deny felt like the clean, sufficient answer: drop the packets at the kernel, never let them touch the application, never argue with a user agent. So I wrote a small setup_tor_block.sh, fewer than 50 lines, that pulled the Tor Project's bulk exit list into an ipset and dropped matching packets at INPUT. It looked like it worked. I just wanted to harden it before I let it loose under cron.

Several hardening passes later, I deployed the new version on admin@app-prod-1. To confirm everything was in place, I opened Tor Browser and pointed it at the application.

The page loaded.

That is where this story actually starts.

The hardening pass that felt great

The first cut was the kind of thing you write in 20 minutes. No locking, no rollback, no validation, no question of what happens when curl returns an HTML error page instead of a list of IPs. Fine for a one-off run on my own laptop, not fine for cron on a production box. So I went back in and made the responsible-adult version.

It got set -euo pipefail. It got a root check. It got a flock so two cron jobs could not race each other. The list went into a temp tor_new ipset first, got validated against a minimum-size threshold, and then atomic-swapped into the live tor set. Worst case during a reload was zero dropped legitimate packets, not a half-loaded set.

It got a backup step that wrote iptables-save and ipset save into /var/backups/tor-block/ with a timestamped filename and a latest.env pointer, plus a --rollback flag that restored both. Because firewalls have a way of meeting other firewalls in surprising orders at 11pm.

It got a --precheck mode that audited what was already on the box: existing iptables rule counts, ufw and firewalld and nftables state, fail2ban jails, the DOCKER-USER chain, and an optional Cloudflare or WAF probe via a --domain flag. If you are about to be the third firewall on a server, you want to know who else is there.

It even got around a small Ubuntu server thing where iptables-save lives in /usr/sbin and an unprivileged user PATH does not include /usr/sbin. The script now resolves binaries explicitly with a resolve_bin() helper instead of trusting $PATH.

I deployed it. Ran --precheck. Clean. Ran the real thing. List downloaded, atomic swap fired, rule installed in INPUT, no errors. Counter at zero, which is exactly what you would expect from a fresh deploy.

I opened Tor Browser to confirm.

The page loaded

Tor Browser routes through a fresh exit node on every connection. The point of opening it was to see the connection get refused at the firewall. Instead, the page rendered. Login form, footer, the works.

I went back to the box.

sudo iptables -L INPUT -n -v --line-numbers | head

The rule that was supposed to drop everything matching match-set tor src showed pkts 0 bytes 0. Not a low number. Zero. Across the entire window since the deploy.

So either my Tor Browser request was not reaching that chain, or the source address was not in the set. I asked the access logs which IP I had come in as.

2a0b:f4c2::27

That is an IPv6 address.

The IPv6 side door

The IPv4 fortress was perfect. Atomic swap, signed list, rollback, the lot. The tor ipset had family inet, the rule was iptables, the persistence was iptables-persistent. All of it was IPv4.

ip6tables -L INPUT -n -v was empty. Policy ACCEPT. Nothing on the IPv6 side at all. The box was dual-stacked, the application listened on both, and Tor's IPv6 path went straight in past the IPv4 wall like it was not there. Which it was not.

The first instinct was to mirror the v4 work for v6. Pull a list, build a tor6 ipset with family inet6, install an ip6tables rule, done. The problem is that the list does not really exist.

https://check.torproject.org/torbulkexitlist is IPv4-focused. You will see the occasional IPv6 in there, but mostly not. The cleanest IPv6 source is the Tor Project's own Onionoo:

https://onionoo.torproject.org/details?search=flag:exit&fields=exit_addresses

That returns relays flagged as exits with their exit addresses, IPv4 and IPv6 mixed. On the snapshot I pulled at the time, the IPv6 count was depressingly small. Not because Tor does not have IPv6 exits, but because relay operators do not always advertise an IPv6 in the field this query returns, and flag:exit throws away anything not currently flagged at the moment of the call.

So the answer was not "swap one source for another". The answer was to merge several sources and accept that no single feed is complete:

torbulkexitlist for IPv4, the canonical bulk source
Onionoo for IPv4 and IPv6 with the flag:exit filter
dan.me.uk/torlist/?exit as an additional feed for broader relay coverage, filtered by the Exit flag

Three sources, deduplicated into two persistent files (tor_exit_nodes.txt and tor_ipv6_exits.txt), each loaded into its own ipset, each enforced by the matching firewall, each backed up and rolled back together.

I rewrote the script around dual-stack. Two ipsets (tor and tor6). Two enforcement layers (iptables and ip6tables). One atomic swap per stack. Backup files for both. The Docker DOCKER-USER chain got the same match-set drop on both stacks, so containerised services were covered without per-container rules.

Re-deployed. Re-opened Tor Browser. Connection refused at the firewall, finally. The counter started moving on both v4 and v6 rules within minutes.

That was the actual ship.

The thing I open-sourced as TorShield

Once the dust settled I cleaned the script up, gave it a name, wrote a small BATS suite around the bash, and put it on GitHub as vineethkrishnan/tor-shield. It is the same idea, packaged so anyone with a Linux production box and no business answering Tor can drop it in without writing the same script for the third time.

The shape of it is small on purpose. One main setup.sh does everything. You run it once with --install-deps to pull ipset, iptables-persistent, and curl, then again without flags to apply. You can run --precheck first to audit the existing firewall stack before changing anything. You can run --rollback when, not if, you need to revert.

A typical first install on a fresh box looks like this:

git clone https://github.com/vineethkrishnan/tor-shield.git
cd tor-shield

# Audit the box first, no changes
sudo ./setup.sh --precheck

# Install dependencies and apply the blocks
sudo ./setup.sh --install-deps

The first run takes about a minute. It downloads the lists, builds the ipsets, installs the rules, persists everything via netfilter-persistent, and writes a backup so the rollback path exists from the moment the rules go live.

Tor exit node lists change constantly, so the value of running this once is approximately zero. The value comes from running it on a schedule. The repo's getting-started has a cron block I use myself:

# Twice daily, skip the dan.me.uk source to avoid its rate limit
0 3,15 * * * /opt/tor-shield/setup.sh --skip-additional < /dev/null >> /var/log/torshield.log 2>&1

# Once a week, full enrichment from all three sources
0 4 * * 0 /opt/tor-shield/setup.sh < /dev/null >> /var/log/torshield.log 2>&1

The < /dev/null is there because the script asks for confirmation when it detects an existing setup and cron has no TTY to type "yes" into. The --skip-additional flag exists specifically because dan.me.uk rate-limits and will quietly start serving you HTML errors if you hit it more than once a day. Twice-daily refresh from the canonical sources, weekly enrichment from all three, log to a file, rotate weekly. That is the whole automation.

If you ever need to back out, there are two ways. sudo ./setup.sh --rollback restores the most recent backup. Or, the manual nuclear path:

sudo iptables  -D INPUT       -m set --match-set tor  src -j DROP
sudo iptables  -D DOCKER-USER -m set --match-set tor  src -j DROP
sudo ip6tables -D INPUT       -m set --match-set tor6 src -j DROP
sudo ipset destroy tor
sudo ipset destroy tor6
sudo netfilter-persistent save

That hand-removes the rules and the sets. The backups stay in /var/backups/tor-block/ either way.

What I am taking away

Three things, then I am out.

An IPv4-only Tor block is theatre on a dual-stack box. I had a perfectly engineered IPv4 firewall: atomic swap, validation, rollback, the lot. The counter sat at zero because the actual traffic walked in over IPv6. If you only block one stack and your origin answers on both, you have not blocked Tor. You have blocked the IPv4 half of Tor and labelled the box "secure". Next time you stand up any list-driven firewall, do v4 and v6 in the same change, or do not bother yet.

Test by being the threat. I would have caught this in five minutes if my first action after deploying had been to open Tor Browser and watch the counter, instead of reading my own log lines and feeling good about the deploy. "Did the rule install" is not "is the rule blocking". pkts 0 bytes 0 on a rule that should be popping is louder than any green log line.

No single Tor list is complete. The bulk exit list is IPv4. Onionoo is sparse on v6. dan.me.uk rate-limits. The way to get reasonable coverage is to merge several sources, dedupe, and accept that the union is bigger than any one feed will ever be. That is what TorShield does, and that is what kept it useful past day one.

If you run a SaaS, an internal API, or anything with no legitimate Tor user, TorShield is on GitHub. Clone it, run --precheck, drop the cron in. If you find a gap, or a better source, pull requests are welcome. Otherwise, see you when the next thing breaks in an interesting way.

The node_modules That Wouldn't Die

Vineeth N Krishnan — Wed, 29 Apr 2026 13:06:19 +0000

The node_modules That Wouldn't Die

TL;DR - An internal app of mine refused to deploy because the build kept importing the wrong version of a Vite plugin. The lockfile said one thing, the build was doing another. I blamed the codegen. Then I blamed git. Both times I was wrong. The actual culprit was a node_modules directory sitting on the deploy host from a previous era of the project, surviving every git reset --hard because it was never tracked in the first place. Once I cleared that out, the build broke a second time for almost the same reason. Here is the story.

The error that started it

Deploy of an internal app of mine fails at the build step with this beauty:

SyntaxError: The requested module './chunk-XYZ.js' does not provide an export named 'tanstackRouter'

I knew this one. @tanstack/router-plugin renamed its main export from TanStackRouterVite to tanstackRouter at some point. The lockfile on main was pinned to a version where the new name was correct. The Vite config was importing the new name. Everything on my machine was happy.

So why was the live host trying to call the new name on an older module that did not export it?

Suspect one, the codegen

The app uses Orval to generate its API client off a Swagger spec. My first thought was that one of those generated files was importing the plugin somehow, and that the codegen had drifted on the host. I went hunting through the generated output. Nothing there even touched Vite plugins.

Dead end. Time wasted. Moving on.

Suspect two, git not really resetting

The deploy script does git fetch && git reset --hard origin/main before building. So I started suspecting the reset was not really happening. Maybe the script was running in the wrong directory. Maybe the working tree was somehow detached and the reset was a no-op. I sshed in, ran the commands by hand, watched them tell me everything was clean.

Tell me I am not the only one who has stared at a "nothing to commit, working tree clean" and refused to believe it.

The tree was clean. The lockfile was right. So what was I building from?

The actual culprit

Here is the line in the Dockerfile that I had not been thinking hard enough about:

COPY . .

That copies everything in the build context into the image. Including node_modules if one happens to be sitting in the build context.

And here is what I had completely forgotten about git reset --hard. It does not delete untracked files. Neither does git checkout -f. Both will happily clobber tracked files back to their committed state. But anything that was never committed in the first place is invisible to them. It just sits there. Forever. Quietly.

Sitting on the deploy host, undisturbed across who knows how many deploys, was a node_modules directory from a much older incarnation of the project. The pnpm install step inside the Dockerfile was running, sure. But COPY . . ran first and dropped a years-old node_modules into the image, and whatever pnpm did on top of that was not enough to overwrite the bits that mattered. The version of @tanstack/router-plugin that ended up in the final image was the one that had been sitting on the host since the previous era, where the export was still called TanStackRouterVite.

A folder older than the bug. Quietly winning every deploy.

The cleanup that broke things again

Easy fix, right? rm -rf node_modules on the host, redeploy, done.

The build broke again. A missing API client file this time. And then I noticed it. The same gitignored exception was hiding two more freeloaders. The Orval output directory and a generated swagger.json, both gitignored, both supposed to be regenerated by the build, were also surviving across deploys. They had been sitting on the host so long that nobody had noticed the build itself never actually ran the generators properly. The host filesystem was the only reason the app had a working API client at all.

So I cleaned those out too, and then fixed the actual generation step in the Dockerfile. Because if a fresh checkout of the repo into a clean container could not produce a working build, that was the real problem all along.

What I changed

Three small things, none of them clever.

A proper .dockerignore in the repo. node_modules, dist, and the generated client directories all listed. The build context never sees the host's leftovers again.

The Dockerfile now runs the generators itself. The API client is produced inside the build, off a swagger.json that is also generated inside the build. No host artifact is load-bearing.

One full cleanup of the deploy host, by hand, of every gitignored thing. Then a redeploy from scratch. It worked on the first try, which felt suspicious until I remembered that is what builds are supposed to do.

The lesson

A long-lived deploy host is a museum. Every gitignored thing you have ever built on it is still there unless you actively remove it. git pull, git reset, git clean without the right flags, none of them touch the museum. Your Dockerfile does not know it is being lied to. Your lockfile does not know it is being overruled. The build just shrugs and ships you whatever the host happens to be wearing that day.

Two rules from now on.

Anything gitignored is regenerated, never inherited. If your build relies on a file the repo does not track, that file must be produced inside the build. Period. If you are shrugging at this rule because "it has been working fine", that is exactly what I was doing.

.dockerignore is not optional. Without it, your build context is a snapshot of whatever weird state the host has accumulated, and COPY . . is a great way to ship that weirdness into your image.

The whole fiasco was three cleanups, an embarrassing number of wrong guesses, and a lesson I should have learned the first time I saw git reset --hard and assumed it meant what it sounds like. It does not. Untracked is invisible.

Not going to pretend this was a perfect writeup. But if even one part of it helped someone avoid the headache I went through, then it was worth putting down. See you in the next one.

The Sentry signup nobody could finish

Vineeth N Krishnan — Tue, 28 Apr 2026 14:48:12 +0000

The Sentry signup nobody could finish

TL;DR - A teammate pinged me on Slack saying he had signed up on our self-hosted Sentry but never got the verification email. I assumed PEBKAC because I had been receiving Sentry mail just fine for as long as I could remember. So I went and signed up myself from a Workspace account, and sure enough, nothing arrived. The bundled Exim container in our Sentry stack had been failing DMARC against every strict mail provider for a long time. 26 frozen messages were sitting in the queue waiting to bounce. The reason I had never noticed is that my own mailbox is on a lenient provider that does not enforce DMARC, so I had been getting Sentry mail the whole time while everyone else got nothing. The shell trick I used to get my own account in worked beautifully. The same trick for my teammate did not. This post is the whole arc, ending in the one shell command that actually got him in.

The ping

It was a perfectly ordinary message on Slack from a colleague.

"I signed up on our self-hosted Sentry but I am not getting any email."

I almost told him to check spam. Sentry sends mail to me regularly, my weekly reports show up, alert emails show up, password reminders show up. So my first instinct was that he had typed the wrong address or his Workspace was filing things into a folder he had not opened.

Before I sent that reply I caught myself. He is not new to email. He had checked the obvious places. If a teammate tells you twice that an email is not arriving, the email is not arriving.

So I went and looked.

What was actually happening on the Sentry box

Self-hosted Sentry runs its own little mail stack inside the Compose file. There is a bundled smtp service that is just an Exim container. The web and worker containers hand outbound mail to it, and Exim delivers direct-to-MX for whatever recipient domain the message is bound to. Out of the box, no relay, no authentication, no DKIM signing.

A read-only walk through the running stack confirmed exactly that.

docker compose exec -T web sentry config get mail.backend
docker compose exec -T web sentry config get mail.host
docker compose exec -T web sentry config get mail.from

mail.backend was smtp. mail.host was literally smtp, the bundled Exim container in the same Compose file. mail.from was sentry@mycompany.com. So Sentry was handing every outbound message to local Exim, which was then trying to deliver it itself, with no authenticated relay anywhere in the picture.

The Exim main.log made the rest of the story clear in 3 lines.

** signup-user@mycompany.com
   R=dnslookup T=remote_smtp H=aspmx.l.google.com
   550-5.7.26 Unauthenticated email from mycompany.com is not accepted
              due to domain's DMARC policy.

Google was rejecting the message at SMTP time. The reason given was DMARC. To know what that meant in our case I had to pull 3 TXT records.

dig +short TXT mycompany.com
dig +short TXT _dmarc.mycompany.com
dig +short TXT default._domainkey.mycompany.com

What came back is the strictest DMARC configuration you can ship.

SPF:    v=spf1 include:_spf.google.com include:<a few marketing
        and ticketing services we use> ~all
DMARC:  v=DMARC1; p=reject; sp=reject; adkim=s; aspf=s; pct=100; ...
DKIM:   (a key, but only for Google's selector)

p=reject means "drop anything that fails". pct=100 means "every message, no sampling". adkim=s and aspf=s mean "the From domain has to align exactly". And SPF lists Google plus a couple of outbound services as the only authorised senders. Our Sentry server is not in any of those includes. The bundled Exim does not DKIM-sign. So mail leaving Sentry has neither a passing SPF nor a passing DKIM, and DMARC drops it on the floor. That is exactly what the 550-5.7.26 line was telling me.

The bounces piling up sideways

There was a second mess sitting next to the first. The Exim queue was holding 26 "frozen" messages.

docker compose exec smtp exim -bp | head

Frozen, in Exim speak, means "I tried to deliver this and gave up, and I cannot even bounce it back to the sender". The original signup mails had MAIL FROM: sentry@mycompany.com. That mailbox does not exist on our Workspace. So when Google rejected the original message, Exim dutifully tried to send a Delivery Status Notification to sentry@mycompany.com, and Google rejected that too with 550-5.1.1 ... NoSuchUser. The DSNs had nowhere to go, and Exim parked them.

2 independent failures wearing the same costume. Outbound mail failing DMARC. Inbound bounce notifications failing because the configured From has no mailbox. 26 of them sitting in line.

The lie my inbox had been telling me

This is the part I want to dwell on.

I had been receiving Sentry email forever. Weekly reports. Alert pings. Everything. So when a colleague said he was not getting mail, my prior was strongly that something on his side was wrong.

Both things were true at the same time. Sentry was sending mail. Sentry was failing DMARC against every strict provider. The reason I was getting it and he was not is that my personal mailbox sits on a small mail host that does not enforce DMARC strictly. It accepts unauthenticated mail. Google does not. So I had a working pipe to my own inbox and a completely broken pipe to every Workspace inbox in the company, and there was no symptom anywhere I would have looked.

Tell me I am not the only one who has assumed something works because it works for me, and missed a problem the rest of the team has been quietly living with for months.

To prove this to myself I tried the signup flow again from a Workspace account I have access to. Same outcome. No email. Exim log showed the same DMARC reject line. The colleague was right. This had been broken for everyone except me.

The fix I could not apply

The clean answer is the obvious one. Stop sending direct-to-MX. Send through an authenticated relay that is allowed to sign mail for mycompany.com. Google Workspace SMTP relay, SendGrid, Mailgun, anything that authenticates on the way in and DKIM-signs on the way out. With that in place, SPF passes, DKIM passes, DMARC aligns, Google delivers, life is good.

What that needs from me is admin access to the Workspace console and the DNS provider. I have neither. Both are locked down on a separate account, which means the proper fix is a ticket through someone else's queue. The colleague waiting to get into Sentry does not particularly care about the reasons.

So I went looking for a way to onboard him today, by hand, while the proper email fix waits its turn.

Pulling the invite token out of Sentry directly

Self-hosted Sentry's UI sometimes shows a "Copy invite link" action on each pending invite. On our version it does not. Only "Resend" is exposed. So you reach for the shell. Sentry has a pending invite stored as an OrganizationMember row, complete with an unused token. You can read that out and assemble the accept URL yourself.

docker compose exec -T web sentry exec - <<'PY'
from sentry.models.organizationmember import OrganizationMember

email = "me@mycompany.com"

members = OrganizationMember.objects.filter(email=email, user_id__isnull=True)
for member in members:
    print(f"org={member.organization.slug}  id={member.id}")
    print(f"link={member.get_invite_link()}")
PY

sentry exec - runs a Python snippet against the Sentry web process without dropping you into the interactive shell. The filter user_id__isnull=True keeps it to invites that have not been accepted yet. The output is the URL you would have received in the email.

org=mycompany  id=16
link=https://sentry.mycompany.com/accept/16/<token>/

I built the URL, opened it in the Workspace account I had been testing with, and got into Sentry. The accept link redirected to a login page, the page showed a Register tab next to Sign in, I registered through it, and the pending invite auto-bound to my new user on signup. Total time about 5 minutes. Treat the URL like a credential, by the way, because anyone with it can claim that membership until used.

That worked, so I did the same for the teammate. Pulled his invite link from the same shell. Sent it on a private DM. Calmly went back to my day-to-day work.

When the same trick failed for the next person

The Slack ping came back fairly quickly.

"It is not working. There is no Register or Signup option."

He sent a screenshot.

He was right. The link took him to the login page and there was nowhere to register. The same URL shape that had worked for me had no Register tab on his side. I rotated the token. Same thing. Created a fresh invite. Same thing. Whatever flow had worked for me 20 minutes ago was just not appearing for him.

I will be honest, this is where I sat back in my chair. We had already burnt enough time on this. The clean thing to do was stop trying to make the invite flow work and just create his account directly. He could change the password the moment he got in.

So I told him I would set him up on the server side and DM him a temp password.

The conflict in the database

Before running createuser I went back into the Sentry shell to see why the link approach had refused to play ball. Looking at the rows for his email, there were extra entries. Old OrganizationMember rows from earlier invite attempts, in a state that was confusing the accept flow. The token I had pulled was for the most recent row, but the older rows were tangled up in there too, and Sentry was not reliably attaching the invite token to the session in the redirect.

I cleaned up the duplicates first. One pending member row, no orphaned entries, no half-claimed users.

Then ran the one command that would have saved me an hour if I had reached for it sooner.

docker compose exec -T web sentry createuser \
    --email mycolleague@<workspace>.com \
    --password '<temp password>' \
    --no-superuser

That created the user account directly. Active, password set, ready to log in. No email, no token, no redirect dance. Sentry sees the matching email on first login, finds the pending OrganizationMember row, binds them automatically, and the user shows up as a normal member with the role from the original invite.

A quick sanity check after that, just to be sure I had not left any stale state behind.

from sentry.models.user import User
from sentry.models.useremail import UserEmail
from sentry.models.organizationmember import OrganizationMember

email = "mycolleague@<workspace>.com"
print("users:", User.objects.filter(email=email).count())
print("user_emails:", UserEmail.objects.filter(email=email).count())
print("members:", OrganizationMember.objects.filter(email=email).count())

One of each. Clean state. I sent him the login URL, the email, and the temp password on a DM, told him to change the password from Account Settings the moment he got in. He did. Account works. Project access works. Done.

What I am taking away

3 things, then I am out.

A self-hosted thing that sends mail "directly" is a half-broken thing. The bundled Exim container in self-hosted Sentry will keep dispatching messages forever, and a benevolent ISP-grade mail host will keep accepting some of them, and you will keep believing things work. They do not. The first day a Workspace user needs an email from it, the whole thing falls apart. If you run anything self-hosted that sends email, point it at an authenticated relay on day one, even if you "do not need email yet". You will, and finding out at 3 in the afternoon is not the moment to set up SPF.

"It works for me" can be a lie your own inbox is telling you. Strict DMARC enforcement is a per-recipient choice. If your "evidence" of working email is one mailbox on a lenient provider, that is not evidence at all, that is survivorship bias. To check whether your mail setup is healthy, send a test message to a Gmail or a Microsoft 365 address and read the headers. The Authentication-Results line will tell you immediately whether SPF, DKIM and DMARC pass.

Reach for createuser sooner. When the pretty invite-link flow refuses to cooperate, do not spend an hour rotating tokens and chasing redirects. Self-hosted apps almost always have a backdoor command that does the thing directly. sentry createuser, plus a quick check that the database does not have stale rows, would have saved me a chunk of time. I will reach for it first next time.

So that is where I will stop on this one. If you have a different way of catching this kind of silent regression in your own self-hosted setup, I genuinely want to hear it - drop me a note. Otherwise, see you when the next interesting problem shows up.

The sed that didn't stick

Vineeth N Krishnan — Mon, 27 Apr 2026 14:41:16 +0000

The sed that didn't stick

TL;DR - The nightly backup on one of my self-hosted servers kept failing. I patched the running container with a single sed command, ran the backup by hand, watched it succeed, and went to bed thinking I had it. The next morning's cron run failed all over again. Node's require cache had quietly held on to the version it had loaded into memory at container start, and never read the patched file from disk. Fixing it the proper way then exposed a second problem: my production runtime image strips npx for safety, so the upgrade migration step fell over the moment it had something to do. This is the story of both, and the small migrator Docker stage I added so neither one bites me again.

The cron that kept failing

So there I was, opening the audit log on a quiet morning expecting another row of green ticks. Instead, a wall of red.

Command "pg_dump" failed: Command failed: pg_dump --host postgres
  --port 5432 --username psql-user --dbname myapp
  --format=custom --file /data/backups/myapp/myapp_backup_20260418_040000.dump

Same error every night. The database in question was around 2 GB, not huge by anyone's standards but big enough that on a slow link the dump would crawl. The pattern made sense once I saw it. pg_dump would start, run for a while, and then backupctl would kill it because my own tool had a five-minute child-process timeout baked in.

So that part was easy to diagnose. My helper had a timeout = 300000 sitting in the compiled JS at /app/dist/common/helpers/child-process.util.js, and the real fix was to bump that number, recompile, and ship a new image.

I did not have time for a release cycle that night.

The sed that worked, for exactly one run

Here is what I reached for, the way you would reach for a screwdriver in your kitchen drawer.

docker exec -i backupctl \
  sed -i 's/timeout = 300000/timeout = 1800000/' \
  /app/dist/common/helpers/child-process.util.js

Five minutes to thirty. One line. No restart, no rebuild, no release. I ran backupctl run myapp from the host. It chugged along for a bit, finished cleanly, the restic snapshot landed on the storage box, the Slack message fired, a clean green row in the audit table. I closed the laptop.

The next morning, the 4 AM cron had failed. Same error. Same dump file. Same five-minute kill.

I went back and checked the file inside the container. The patched line was still there. sed had done its job. The 1800000 was sitting in the bytes on disk. The scheduler running inside the same container was somehow ignoring it.

Tell me I am not the only one who has stared at a file with the right content while the running process insists it is wrong.

Why the manual run worked but the cron did not

The thing I had not been thinking about, and should have been, is how Node loads code.

When the backupctl container starts, NestJS boots up, and along the way Node reads child-process.util.js from disk and parses it into memory. The require() call that pulled it in is cached, by module path, for the lifetime of that process. From that point on, every other file inside the running app that asks for the helper gets the same in-memory object back. The disk version stops mattering.

sed had patched the disk. The long-running scheduler process inside the container was still using the parsed-and-cached version it had loaded at container start. It would happily go on using that cached version until the process died.

The reason the manual backupctl run had worked is the part I had missed at the time. The CLI command does not run inside the long-lived NestJS process. It spawns a fresh Node process, which loads the helper from disk, which is the patched version. So the manual run picked up the new timeout. The scheduler, sitting in the long-running process from before the patch, never did.

Two different processes. Same container. Same file on disk. Different versions in memory.

What I should have done from the start

The proper fix was boring. Pull the next release that had the timeout configurable, restart the container so the scheduler picks up the new code, done.

backupctl-manage.sh upgrade is the script I have for exactly this. Pull the new image, run any migrations, recreate the container, run a smoke test, fire a notification. So I ran it.

And then the next thing broke.

The second surprise: npx, missing in action

The upgrade script chugged through its checklist, and then died on this:

[5/7] Running database migrations
OCI runtime exec failed: exec failed: unable to start container process:
  exec: "npx": executable file not found in $PATH: unknown

For a moment I thought I had pulled the wrong image. I had not. The error was perfectly correct.

A while back, when I was tightening up the production Docker image, I had added a line near the end of the runtime stage that strips npm and npx out of the final layer. Something close to this in the Dockerfile:

RUN rm -rf /usr/local/lib/node_modules/npm \
           /usr/local/bin/npm \
           /usr/local/bin/npx

The reasoning was simple enough. Production does not need a package manager. Pulling npm out makes the runtime image smaller, and gives anyone who breaks into it less to work with. Both genuine wins.

Except my migration step was literally this:

docker exec backupctl npx typeorm migration:run -d dist/db/datasource.js

The script had been written before the npm strip. The two of them had never met in the wild because there had not been any new migrations to run since I added the strip. The first time the upgrade actually had something to migrate, the strip would have eaten my migration step alive. I got lucky on this run too. When I checked the audit DB, both migrations the new image carried were already applied. The runner would have been a no-op even if it had worked. Pure luck.

So my migration step had been quietly broken for who knows how long. That stops being acceptable the moment the next release actually adds a migration.

The migrator stage

The fix I went with is a separate Docker stage, sitting beside the runtime image, that exists only to run migrations.

Here is the shape of it inside the same Dockerfile:

# Migrator stage: kept around so production migrations have npm/npx
FROM node:20-alpine3.22 AS migrator
WORKDIR /app
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules/
COPY --from=builder /app/dist ./dist/
CMD ["npx", "typeorm", "migration:run", "-d", "dist/db/datasource.js"]

It reuses the install and build stages. It still has npm and npx because nothing strips them. It is opt-in via a Compose profile, so the default docker compose up -d does not start it. It runs once, exits, and gets cleaned up:

services:
  migrator:
    build:
      context: .
      dockerfile: Dockerfile
      target: migrator
    profiles: ["migrate"]
    restart: "no"
    # ...env, network, depends_on

And the upgrade script changed from this:

docker exec backupctl npx typeorm migration:run -d dist/db/datasource.js

To this:

docker compose --profile migrate run --rm --build migrator

--profile migrate activates the new service. run --rm boots a one-off container, lets it run the migrations, and removes it on exit. --build makes sure the migrator image is fresh against whatever release the upgrade is rolling out. Same one-line invocation, but now backed by an image that actually has the tools it needs.

One small detail I tripped on while wiring this up. I had originally added container_name: backupctl-migrator to the Compose service. docker compose run --rm generates its own ephemeral container name, and a hard-coded container_name will trip over itself the moment a previous run lingers. Drop the field, let Compose name the container, problem gone.

Manual in dev, automatic in prod, on purpose

There is one detail I want to call out, because it took me a beat to get comfortable with.

In dev, I do not auto-run migrations. I have a tiny helper at scripts/dev.sh migrate:run that I call myself when I am ready. Sometimes I want to inspect a migration before it touches my local database. Sometimes I am rebasing a branch and the migration files are temporarily messy. The dev workflow leaves that decision to me, which is what I want for a workflow I touch every day.

In production, the deploy and upgrade scripts auto-run the migrator service. I do not want a half-asleep version of me, in the middle of an incident, to forget the manual migration step. The cost of accidentally running a no-op migration is zero. The cost of forgetting one is downtime.

Same domain, same migrations, same tool. Different harness on each end. It used to feel like a wart. Today I would call it the right shape. Humans get to choose in dev because choosing is cheap there, and machines do the safe thing in prod because forgetting is expensive.

The follow-up: a timeout you can actually configure

The migrator stage closed the loop on the upgrade side. The original problem, though, was a hard-coded five-minute child-process timeout. Even with the upgrade landed, that number was still going to bite the next project that grew past it.

A handful of commits later, I made the dump timeout per-project. The same YAML that already names the database now takes an optional dump_timeout_minutes:

projects:
  - name: myapp
    cron: '0 3 * * *'
    timeout_minutes: 30
    database:
      type: postgres
      host: postgres
      name: appdb
      user: appuser
      password: ${APP_DB_PASSWORD}
      dump_timeout_minutes: 120

The resolution order is deliberate. database.dump_timeout_minutes wins first, timeout_minutes next, the safe default last. A small project gets the default and never thinks about it. A medium project bumps timeout_minutes for the whole run. A heavy one with a slow link sets dump_timeout_minutes on just that database, without inflating the warning timer for everything else.

Paired with that, a --verify-dump flag on the dry-run path. Plain --dry-run only checks config and database connectivity. With --verify-dump, the tool actually runs the dumper into a temp directory, verifies the file integrity, reports the duration and size, then cleans up:

backupctl run myapp --dry-run --verify-dump

If a project's database needs longer than the configured timeout, this is where you see it. On your terms, in a dry-run report you ran on purpose. Not in a 4 AM cron failure you find out about over coffee. The change I most wish I had made before the original incident.

Two short lessons, then I am out

If you are reading this and you are one sed away from doing exactly what I did, here is what I want you to take with you.

A patch on disk is not a patch in a running Node process. If you sed a .js file inside a long-running container, the only thing that will pick up the change is a fresh process. The scheduler that has been holding child-process.util.js in its require cache since boot does not care what your bytes look like now. Restart the container. Or, better, do not patch live containers in the first place.

A stripped runtime image needs a thinking partner. If you have removed npm and npx from production for sensible reasons, you have also removed every script that was quietly assuming they were there. Migrations are the obvious one. Make a separate stage that has the tools, profile-gate it so it does not run when you do not want it to, and let your deploy script call it on purpose.

That is pretty much it from my side today. Let me know what you think, or if you have been through something similar with a hotfix that quietly refused to take. Those stories are always the best ones. See you soon in the next blog.

I Mistook gpt-oss for an Image Generator. Now My Mac Runs FLUX Offline.

Vineeth N Krishnan — Sat, 25 Apr 2026 18:32:03 +0000

I Mistook gpt-oss for an Image Generator. Now My Mac Runs FLUX Offline.

TL;DR - I went down a small rabbit hole today after asking if gpt-oss could generate images. It cannot. It is a text-only language model. That detour ended with FLUX.1-schnell running locally on my Mac through Draw Things, exposed over a tiny HTTP API, and a one-line shell function I can call from anywhere. The hero image above? Generated by that exact setup. Below is the full walkthrough so anyone can replicate it without bumping into the same walls I did.

So there I was, casually asking my local LM Studio if I could just hand it a prompt and get an image back.

Spoiler: no.

I was running gpt-oss locally and somehow expected it to also handle image generation. Which, in hindsight, is a bit like asking your calculator to play music. gpt-oss is a text-only language model. It generates tokens, not pixels. There is no image head bolted onto it. I knew this. I had just convinced myself otherwise for a few minutes.

Anyway, that small confusion sent me looking at what it would actually take to do local image generation on my Mac. Pollinations.ai already covers most of my blog hero images, but it goes over the wire. I wanted something offline. Something I could call from a script when there is no internet. Something that uses the same FLUX family of models pollinations is built on, just running on my own hardware.

What I ended up with surprised me a little. The setup is simpler than I expected. The latency is worse than I expected. And the conclusion is more boring than I expected.

Let me walk you through every step.

Why Draw Things and not ComfyUI

If you have read anything about local image generation, ComfyUI shows up first. It is the node-based, fully-featured, every-knob-exposed option. Power users love it. I did not pick it.

Reason is simple. I wanted the lowest-friction path to find out "do I even need this." ComfyUI on Mac means Python environments, model downloads, a queue server, custom workflow JSON, and a web UI to drive it. That is a lot of setup just to discover I would only use it once a month.

Draw Things is the opposite of that. Free Mac App Store app. Native Apple Silicon. Built-in model manager. Click, install FLUX.1-schnell, click generate, done. The trade-off is less control. You get the knobs Draw Things decides to expose. For my use case, that was fine.

Tell me I am not the only one who picks the easier option first and only graduates to the harder one when the easier one breaks. That is basically my entire approach to tooling.

Step 1: Install Draw Things from the Mac App Store

Open the Mac App Store, search for "Draw Things", and pick the one by Draw Things, Inc. with the astronaut-on-horseback icon. There are a few image apps with similar names floating around, so confirm the developer before clicking install.

Things worth noting from the listing:

Size: about 152 MB. The app itself is small. The big downloads happen later when you pick a model.
Platforms: Mac, iPad, iPhone. Universal app, so the same purchase works across devices.
Price: free.
Update cadence: active. Mine had just added FLUX.2, LTX-2.3 and a few others on my install day. New models keep landing.

Click Open after install. The app launches into a blank canvas with a settings panel on the left and a tools panel on the right. We are not generating anything yet. First we need a model.

Step 2: Pick the model, FLUX.1 [schnell]

In the left settings panel, switch to the All tab at the top. Scroll till you find the Model dropdown. Click it. You will get a search field plus two sections, Local and Official Models.

src="/blog/local-flux-model-picker.png"
alt="Draw Things model picker dropdown open, showing FLUX.1 schnell selected at the top with the Local section listing one installed model and the Official Models section listing LTX-2.3 and ERNIE Image variants below."
style="max-width: 320px; width: 100%; height: auto; display: block; margin: 1.5rem auto;"
/>

Pick FLUX.1 [schnell]. If it is not in your Local section yet, it will be in Official Models with a small download cloud icon. Click the cloud, wait for it to pull down (it is a few gigs, so go make tea), and once it lands it moves into Local.

Why schnell and not the dev variant? Two reasons.

Speed. schnell is the 4-step distilled version. dev needs 20 to 50 steps for the same quality. On a Mac, that difference is the gap between "I can use this" and "I will never use this."
License. schnell is Apache 2.0. dev is non-commercial. If you ever want to ship anything you generated, schnell is the safer pick.

The other models in that list, LTX-2.3, ERNIE Image, the various distilled and quantized variants, are tempting but ignore them for now. Schnell is the one that maps cleanly to what Pollinations runs in the cloud, and it is the smallest path to a working pipeline.

Step 3: First image through the GUI

Before touching the API, run one image through the app itself. This confirms the model is loaded, the engine works, and your machine has the juice for FLUX.

Type a prompt into the box at the bottom of the canvas. I went with a tired developer at a laptop late at night, glowing monitor, moody lighting. Click the small button with the sparkle icon at the bottom right. Wait. Watch the progress.

Forty seconds later, an image. Mine came out as the tired developer above, lit by a green glow from a monitor in the dark. Not bad for clicking three buttons.

A few things I noticed during the first run:

The app uses your GPU. Activity Monitor will show a spike. Fan may kick in on smaller MacBooks.
First generation after launch is slower because the model has to warm up. After that it stabilises.
The output saves to wherever you set "Save Generated Media to" in settings. Mine goes to ~/Pictures/Flux Images. Worth setting this once so you can find your generations later.

If this step works, the GUI half is done. The next step is to make the same engine reachable from a terminal.

Tell me you also did the small "I touched the button and it worked" celebration the first time the image rendered. There is something satisfying about watching pixels appear out of math on your own laptop.

Step 4: Flip the HTTP API switch

Draw Things has a built-in HTTP API server. It is off by default. Once you turn it on, it speaks the Stable Diffusion WebUI API spec, which means anything that can talk to AUTOMATIC1111 can talk to Draw Things instead. Same endpoints, same JSON shape, mostly the same parameters.

Open Settings (the gear icon on the left rail), go to the Advanced tab, and scroll down to API Server. You will see a panel like this.

Four switches matter here. Get them right or the curl will hang silently and you will spend an hour wondering why.

Setting	Value	Why
Server Online	On (green)	The actual on/off for the server.
Protocol	HTTP, not gRPC	Draw Things ships both. gRPC needs protobuf clients. HTTP is what curl, jq, and any normal script can talk to. This is the most common mistake.
Port	7860	Same as the WebUI default. Anything assuming AUTOMATIC1111 will hit this without config.
TLS	Off	It is local-only. Self-signed certs just break curl with no real benefit.
IP	127.0.0.1 (localhost only)	The default is "allow all connections" which exposes the server to your whole network. No reason for that. Lock it to localhost.

Bridge Mode you can leave disabled. That is for routing through Draw Things' cloud, which defeats the whole "offline" point.

Once those four are right and the toggle dot is green, you have an HTTP API live on http://127.0.0.1:7860.

Step 5: The first sanity check, and the first gotcha

I wanted to confirm the server was alive before sending a real prompt. The standard move in the Stable Diffusion world is to hit /sdapi/v1/sd-models, which returns the list of installed models.

curl -s http://127.0.0.1:7860/sdapi/v1/sd-models

I got back a clean 404.

A few minutes of confusion later, I figured it out. Draw Things implements the actually-useful endpoints, mainly /txt2img and /img2img. It does not bother with the introspection ones. The model is whatever you have loaded in the app at that moment, and they did not see the point of duplicating that into an API call.

Which is fine, but it does mean the usual "is the server alive" check from Stable Diffusion world does not work here. The way you actually verify the server is up is by sending a real generation request and seeing what comes back.

If you ever hit this 404 yourself, you now know. It is not your config. It is just an endpoint Draw Things chose not to ship.

Step 6: A real generation request

Here is the smallest curl that gets you a working image.

curl -s -X POST http://127.0.0.1:7860/sdapi/v1/txt2img \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "a red apple on a wooden table",
    "steps": 4,
    "width": 512,
    "height": 512,
    "cfg_scale": 1.0
  }'

The response is JSON with a base64-encoded PNG inside the images array. Not a binary stream, not a multipart upload, just a JSON blob with the picture stuffed inside as base64. So the full path from prompt to viewable file is:

curl -s -X POST http://127.0.0.1:7860/sdapi/v1/txt2img \
  -H "Content-Type: application/json" \
  -d '{"prompt":"a red apple","steps":4,"width":512,"height":512,"cfg_scale":1.0}' \
  | jq -r '.images[0]' | base64 -d > /tmp/apple.png && open /tmp/apple.png

Run that and you get a session that looks roughly like this.

The first time I ran that and Preview popped open with an actual apple, I just sat back and smiled. These small wins are why I still enjoy this whole thing.

A few notes on the parameters:

steps: 4 is the magic of FLUX.1-schnell. Most diffusion models need 20 to 50 steps. Schnell is distilled to do good work in four. If you push it higher, it will not get noticeably better, just slower.
cfg_scale: 1.0 is correct for schnell. Higher values that work for SD1.5 or SDXL will produce burnt, oversaturated images here. Leave it at 1.
width and height must be multiples of 64. 512x512 is the sweet spot for testing. Blog hero size 1200x630 works but is slower (more on that below).

Step 7: Anatomy of the JSON response

If you run the curl without piping into jq, you will see something like this (truncated, because the base64 string is enormous).

{
  "images": [
    "iVBORw0KGgoAAAANSUhEUgAAAgAAAAIACAIAAABLbSncAAA...{thousands more chars}...AAElFTkSuQmCC"
  ],
  "parameters": {},
  "info": ""
}

Three things to know.

images is an array. If you ask for a batch ("batch_size": 4), you get four base64 strings back. Most of the time you only want index zero.
parameters and info come back empty in Draw Things. The Stable Diffusion WebUI fills these. Draw Things is implementing only what it implements, no more.
The base64 string is the entire PNG, including headers. iVBORw0KGgo is the magic prefix for PNG when base64-encoded. If you ever see that, you know you got a valid image and not an error JSON.

That last point is useful for debugging. If something is off, the response will not start with iVBORw, it will start with { and be a small JSON with an error. Pipe to head -c 20 if you want to peek.

Step 8: The data flow, end to end

Here is the whole pipeline from typing a prompt to opening a PNG, in one diagram.

your terminal                   Draw Things (Mac app)
     |                                  |
     |  POST /sdapi/v1/txt2img          |
     |  { prompt, steps, w, h, cfg }    |
     | -------------------------------> |
     |                                  |
     |                                  |  FLUX.1-schnell
     |                                  |  runs on GPU
     |                                  |  (Apple Silicon)
     |                                  |
     |  { "images": ["base64..."] }     |
     | <------------------------------- |
     |                                  |
     | jq -r '.images[0]'  -> base64    |
     | base64 -d           -> raw PNG   |
     | > /tmp/apple.png    -> file      |
     | open /tmp/apple.png              |
     |                                  |
     v                                  |
   Preview window pops open

Three tools, each doing one thing, composed into a single line. The Unix philosophy showing up in 2026.

Step 9: Wrap it in a zsh function

I did not want to remember the curl every time, so this went into my ~/.zshrc:

dt-gen() {
  local prompt="$1"
  local out="${2:-/tmp/dt-$(date +%s).png}"
  curl -s -X POST http://127.0.0.1:7860/sdapi/v1/txt2img \
    -H "Content-Type: application/json" \
    -d "$(jq -n --arg p "$prompt" \
      '{prompt:$p, steps:4, width:1024, height:1024, cfg_scale:1.0}')" \
    | jq -r '.images[0]' | base64 -d > "$out" && open "$out"
}

Now dt-gen "a brass compass on weathered wood, cinematic, 50mm" from any terminal generates the image, saves it, opens it. Nothing fancy. Just a curl wrapped in a function so I do not have to think about JSON escaping every time.

For blog hero images I use a slightly different variant that hits 1200x630.

dt-hero() {
  local prompt="$1"
  local out="${2:-/tmp/hero-$(date +%s).png}"
  curl -s -X POST http://127.0.0.1:7860/sdapi/v1/txt2img \
    -H "Content-Type: application/json" \
    -d "$(jq -n --arg p "$prompt" \
      '{prompt:$p, steps:4, width:1200, height:630, cfg_scale:1.0}')" \
    | jq -r '.images[0]' | base64 -d > "$out" && open "$out"
}

After saving, run source ~/.zshrc (or open a new terminal) and the function is available.

One catch worth knowing. Draw Things must be open with the API server running for these to work. Quit the app, the server stops. I do not have a launcher trick for this yet, and honestly for ad-hoc use it is fine. If I need it, I open the app first. The same way I open Postman before hitting an API while developing.

Speed reality, the part the demos do not show

Now the bit nobody puts in the demo videos. Local image generation on a laptop is slow. Not "wait a beat" slow. Slow enough that you can make tea.

Here is what I measured on my machine.

Image size	Steps	FLUX.1-schnell on Mac	Pollinations cloud
512 x 512	4	~40s	~6s
768 x 768	4	~75s	~7s
1024 x 1024	4	~110s	~8s
1200 x 630 (blog hero)	4	~90 to 150s	~8s

The hero image at the top of this blog took the upper end of the 1200x630 row. I generated it via the same API while writing this section.

Pollinations comes back in under ten seconds for any of these. The reason is simple. They are running on actual GPU servers, and I am running on an M-series chip. FLUX is the same FLUX. The hardware is what changes.

This is the part where I had to be honest with myself. If I am drafting a blog and want to iterate on hero prompts, two minutes per attempt will ruin the flow. If I am running a one-off script overnight, two minutes is nothing. So the decision is not "which one do I use", it is "which one suits the moment."

Troubleshooting matrix

Every problem I hit, plus the fix. Save this section, you will need at least one of these.

Symptom	Likely cause	Fix
`curl: (7) Failed to connect to 127.0.0.1 port 7860`	Server toggle is off, or app is closed	Open Draw Things, flip Server Online to green
`404 Not Found` on `/sdapi/v1/sd-models`	Endpoint not implemented in Draw Things	Skip that check. Verify with a real `/txt2img` request instead
Empty response, no error	Protocol set to gRPC	Switch Protocol to HTTP in API Server settings
TLS handshake error	TLS toggle is on with self-signed cert	Turn TLS off for local use
Hangs forever, no response	First call after launch, model is warming up	Wait 30 to 60 seconds. Subsequent calls are faster
Burnt, oversaturated colours	`cfg_scale` set too high for schnell	Set `cfg_scale: 1.0`
Output looks like noise / not the prompt	`steps` set to 1 or 2	Set `steps: 4` for schnell
`width or height not divisible by 64`	Custom size like 600x600	Round to nearest 64. Use 576 or 640
`jq: parse error` after curl	Response was an HTML error page, not JSON	Run curl without the pipe to see the raw response
Image saves but is 0 bytes	base64 decode failed silently	Check that `jq -r '.images[0]'` returns a string starting with `iVBORw`
Generations are slower than the table above	Other GPU-heavy app open (Final Cut, Blender)	Close them, retry. FLUX wants the GPU to itself
Server reachable from other devices on Wi-Fi	IP set to 0.0.0.0 (allow all)	Change IP to 127.0.0.1 (localhost only)
App freezes during generation	Tried to switch model mid-generation	Wait for current job to finish before changing model

Things that bit me along the way

A few smaller gotchas that did not need their own row in the table but are worth calling out.

The app needs to stay open. Draw Things is the API server. Quit Draw Things, the server dies. There is no launchd daemon, no background process. For me this is fine because I batch my image work. If you want a true always-on local server, you are looking at the wrong tool.

Model state matters. The model the API uses is whichever model is currently selected in the app. If you switch models in the GUI, your next API call uses the new one. There is no way to specify a model in the request itself for the schnell endpoint. If you need that, you are graduating to ComfyUI.

Bridge Mode is a different beast. I tried turning Bridge Mode on early because "more options" felt safer. Bridge Mode actually routes the request through Draw Things' cloud relay, which is the opposite of what I wanted. If you see references to Bridge Mode in the docs, that is a separate feature, not part of the local API path. Leave it off.

Save folder fills up fast. Every generation through the GUI saves to your "Save Generated Media to" folder. After a couple of hours of testing prompts, mine had two hundred PNGs in it. Set up a cleanup script or be ready for finder lag.

Where I actually landed

Here is the part I did not see coming when I started.

I was kind of expecting to switch my blog skill over to use Draw Things. Generate everything locally. No more pollinations. Look at this, it is all on my own hardware, very impressive.

I am not going to do that.

Pollinations stays as the default for the blog. Latency is the deciding factor. When I am writing, I want hero image attempts in seconds, not minutes. Draw Things becomes the ad-hoc tool. Need an image when there is no internet? Use it. Trying out a stubborn prompt that needs ten attempts and I am okay leaving the laptop alone? Use it. Want to run image generation in a longer-running background script? Use it.

Two tools, two clear use cases, no rewiring of anything that already works.

If you have been through a similar "I will replace the working thing with the local thing" detour and ended up keeping both, I would genuinely like to hear it. Misery loves company on this one.

What I am taking away

A few things stuck with me from this whole detour.

The simplest tool that does the job is usually the right starting point. Draw Things over ComfyUI was the right call for me, even though ComfyUI is technically more powerful.

Local does not always mean better. It means different. Speed, control, and privacy all live on a triangle, and you only get to pick two depending on the situation.

Documentation gaps are real. The Draw Things HTTP API is not as well-documented as AUTOMATIC1111, and a lot of what I figured out came from trial and error with curl. If you ever hit the same /sd-models 404 confusion, now you know.

The curl-jq-base64 pipeline is a beautiful little chain. Three tools, each doing one thing, composed into a single line. The Unix philosophy showing up in 2026.

And the smallest one. Sometimes the right answer to "should I do X locally" is "yes, but keep the cloud version too." Both/and beats either/or more often than I think.

Okay, that is enough from me for today. If any of this saved you some time, that is the whole point of writing it down. Until the next one, take it easy.

Cross-Posting My Blog to dev.to and Hashnode: What I Got Wrong

Vineeth N Krishnan — Sat, 25 Apr 2026 13:07:47 +0000

Cross-Posting My Blog to dev.to and Hashnode: What I Got Wrong

TL;DR - Setting up auto-syndication from my Astro blog to dev.to and Hashnode looked like a one-afternoon job. It turned into four pull requests, mostly because real APIs have rate limits, partial failures, and opinions about where you should keep state. Here is everything that broke after I shipped, and how I fixed each one.

Why I bothered with this in the first place

My blog lives on my own site. That is the source of truth, that is where the canonical URL lives, that is where I want readers to actually land. Problem is, almost nobody reads a personal portfolio blog directly. The two clicks it gets are usually from whoever I forwarded the link to on the family WhatsApp group. The actual traffic is on dev.to and Hashnode.

So I needed to syndicate. Syndicate just means publishing the same post on multiple platforms, with one of them marked as the canonical original. My site stays the source of truth. dev.to and Hashnode get copies that link back to it. Search engines see the link and treat my version as the original, so the copies do not get penalised as duplicate content.

I could copy-paste each post to both platforms by hand. I have done it before. It is the kind of task that should be quick, except every time you sit down to do it you are pasting markdown, fixing image links, picking tags, hitting publish, and then doing the whole thing again on the second platform. By the third post, I knew I was never going to keep this up.

So I sat down one weekend to write a small script that would do it for me on every push to main. How hard could it be.

The first ship: a small script and a workflow

The plan was simple. A Node script (scripts/syndicate.mjs) reads any new MDX file under src/content/blog/, parses the frontmatter, rewrites image paths to absolute URLs pointing back to my site, and posts the result to both platforms. dev.to has a normal REST API. Hashnode is GraphQL. Both let you set a canonical URL (canonical_url on dev.to, originalArticleURL on Hashnode) so search engines know my site is the original and not the copy.

A GitHub Actions workflow runs the script on every push to main that touches the blog folder. To avoid re-publishing the same post over and over, the script keeps a state file called .syndication.json with a content hash for each post and the IDs returned by each platform.

I shipped this in the first PR. Tested it on one post. Both platforms accepted it. Canonical URLs pointed back home. I closed my laptop and felt clever.

The cleverness lasted about a day.

Break one: partial failures

The first time the workflow ran on a real push, dev.to accepted the post, then the Hashnode call failed for some reason that I no longer remember. The script crashed before writing the state file. So as far as the next run was concerned, the post had never been syndicated anywhere.

You can guess what happened next. I pushed an unrelated commit, the workflow ran again, and it cheerfully created a brand new dev.to article on top of the one already there. Now I had two copies on dev.to and zero on Hashnode. Wonderful.

The fix was a reconciliation step at the start of every run. Before doing any writes, the script now lists all the existing articles on each platform via their APIs, matches them to my blog posts using the canonical URL, and stitches the existing IDs back into state. So even if the state file gets nuked or a previous run died halfway, the script knows what is already out there. On the next run it sees "this slug already has a dev.to ID" and does an update instead of a create.

The lesson here was annoying but obvious in hindsight. Any script that talks to two systems in sequence will eventually fail between them. You either need atomic writes across both, which is impossible with two separate APIs, or you need to make the script self-healing on the next run. I went with the second one.

Break two: dev.to and the 429 wall

A while later I tried to do a small backfill, syndicating a few older posts in one workflow run. dev.to rejected the second one with a 429. Then the third with a 429. Then it just kept failing.

It turns out dev.to has a fairly aggressive rate limit on creating new articles, and the public docs are quiet about the exact numbers. The only useful signal is the Retry-After header on the 429 response, which tells you how many seconds to wait before trying again.

Tell me I am not the only one who learns about API rate limits from production. The kind that do not show up until you run the thing in anger.

The fix did two things. One, an in-request retry loop on 429 that honours Retry-After with a sensible fallback, and that gives up after a few attempts. Two, a small pause between writes when the script is processing more than one post in a single run. It is a conservative ten-second sleep, but the syndication side of my blog is not a real-time system, so a few extra seconds per post does no harm.

Break three: where do you actually keep the state file

This is the one that took the longest to get right. The state file is small, but someone needs to keep it between runs. Where?

My first instinct was the obvious one. Commit it back to the repo. The workflow runs, syndicates the post, then opens a small commit with the updated state. Simple. Except now every successful syndication created a "chore: update syndication state" commit on main, which is noisy. And the default GITHUB_TOKEN does not trigger downstream workflows when it pushes, which is fine, but I still felt uncomfortable mixing bot commits into my own history.

So I pulled in a GitHub App, gave it just enough permission to push the state file, and used its token for the commit. That worked. But the commits were still ugly. Every push to main produced a sibling state-update commit a moment later. My git log started looking like a chat between me and a robot.

The next PR finally moved state into GitHub Actions cache. The cache is keyed per workflow, persists between runs, and most importantly does not show up in git history at all. The cache key uses the run ID for writes, with a restore-keys fallback so the next run picks up whatever the last run left behind.

- name: Restore syndication state cache
  uses: actions/cache@v4
  with:
    path: .syndication.json
    key: syndication-state-v1-${{ github.run_id }}
    restore-keys: syndication-state-v1-

Caches can be evicted, sure. But that is exactly why the reconciliation step from break one matters so much. If the cache disappears, the next run rebuilds state by listing what is already on dev.to and Hashnode. The two fixes ended up reinforcing each other, which was a nice surprise.

What I would tell past me

Three things, looking back.

One, do not store anything in your repo that does not need history. State files are not history. They are checkpoints. Use a cache, a database, or even an external KV. Anything but git.

Two, when you talk to two APIs in a row, you will eventually fail between them. Plan for it from the start. Either pick a single source of truth on each platform and reconcile against it, or make sure your "did it succeed" check is independent of in-memory state.

Three, every public API has rate limits. The good ones publish them. The rest, you find by writing a backfill script and getting smacked by a 429. Read the rate limit docs before you scale up, not after.

Where it stands now

Right now this script syndicates to dev.to and Hashnode only. That covers the two platforms where my readers actually are. I have thought about adding Medium, but their API has its own ceremonies and I am not sure the audience overlap justifies it yet. Maybe later.

Either way, the saga is at a stopping point. Posts auto-syndicate on push, partial failures self-heal on the next run, rate limits get retried with backoff, and my git log no longer has a robot living in it. That feels about right for what was meant to be an afternoon job.

So yeah, that is my take. Yours might be completely different, and that is exactly what makes this whole space interesting. Catch you in the next blog - should not be too long from now.

Building a per-repo wiki that actually gets read

Vineeth N Krishnan — Fri, 24 Apr 2026 18:31:30 +0000

Building a per-repo wiki that actually gets read

TL;DR - Our docs were not missing. They were in READMEs, internal docs folders, and even in the comments of our CI/CD workflows. The CI/CD was fully automated. And yet one specific teammate kept pinging me before every deployment, before every local setup, before every env change. I was the human shortcut. The fix was to put a wiki/ folder inside each repo, PR-reviewed, auto-synced to the hidden <repo>.wiki.git on every merge to main, and to change one habit on the team: answer the question, then write it down. Tooling was easy. The habit was the hard bit.

The moment I knew something had to change

So there I was, a little while back, at the end of a long day, finishing up a team-wide Slack note explaining why we were finally going to have a wiki for every repo. I had written the whole thing out. The "one person becomes the bottleneck" part. The "acknowledge every question first, then document the answer" part. The "duplicate across repos is okay, because a new joiner only opens one repo" part. I was pretty proud of it, honestly.

Twenty minutes later, the one teammate who had inspired about half of that note pinged me in DM asking if I had a moment to hop on a QA call.

A little later: "Where are you creating this wiki?"

A little later again: "I did not know that creating a wiki/ folder in a repo adds it to the GitHub Wiki tab."

(It does not, by the way. That is half of what this blog is about.)

I am not telling this story to laugh at him. He is one of the nicest, most curious people on our team and he later wrote back the kindest message saying "you have often been the most responsive to my questions." That part is actually sweet. But the irony hit hard. I was literally in the middle of building a thing to stop one person from being the default answer desk, while being the default answer desk about the thing I was building.

So yeah. Whatever I had been telling myself about the situation, that was the afternoon it stopped being funny and started being a project.

The uncomfortable part - the docs already existed

Here is where I have to be honest, because the easy version of this story is "we had no docs, so we built a wiki and everyone lived happily ever after." That is not what happened.

The docs existed.

They were in the README of each repo. They were in a docs/ folder with some decent notes. They were literally in the comments at the top of our CI/CD workflow files, where anybody could open the YAML and read exactly what a deploy would do, line by line.

And the CI/CD itself was already automated. Push to the right branch, the workflow fires, it deploys. No human in the middle for almost every case. The workflow has a log. It has a status badge. It tells you when it started, what it did, and whether it passed.

And yet I would still get a DM before a deployment. "Hey, can you deploy this for me?" The thing that was already automated. The thing where the "deploy" was "merge the PR." I would link the workflow run, explain again that it was already running, and that whole exchange would take more time than it would have taken either of us to just look at the Actions tab.

That was the signal that the problem was not a documentation problem. It was a "the documentation exists but it is not in the place the person looks first" problem. And the place they look first, every single time, is Slack. DMs to me.

Why existing docs kept getting missed

I thought about this for a while before writing the Slack note, because I did not want to be the guy who builds yet another documentation surface and then blames the team when it does not get used.

A few reasons kept surfacing.

The README was partly right. That is worse than being fully wrong, in a way. Once one thing in the README is stale, the reader quietly stops trusting the rest of it. Even the correct parts start feeling suspicious.

The internal docs folder was a dumping ground. It had PRDs, architectural notes, feature specs, and, buried in the middle, one or two genuinely useful operational pages. If you did not already know the useful page existed, you would not find it by scrolling.

The CI/CD workflow files were self-documenting, technically. But "self-documenting" means "if you know to open the file and read it." If the workflow is automatically doing the deployment and nobody has ever told you that it exists, you will keep asking a human to do the deployment for you. The workflow is only self-documenting to people who already know it is there.

And the GitHub Wiki tab existed on every repo. It was also, basically, dead. A couple of pages, last touched so long ago that nobody could confidently say if any of it was still true. The UI for editing it is fine, honestly, but it sits outside the normal PR review flow, which means nobody was going to edit it on a Tuesday afternoon when they should be writing code.

So docs existed. They just existed everywhere, at different levels of correctness, with no single first-stop the reader could trust. And when all the docs feel equally untrustworthy, the rational move for the reader is to skip all of them and ping the one human who will definitely give you the current answer. That human was me.

The conversation that finally kicked it off

In a quiet one-to-one with a colleague I trust, I said out loud the thing I had been avoiding saying. "The docs exist. The CI/CD is automated. I am still getting pinged every time a deployment has to go out. Something about how we are doing this is broken, and it is not a code problem." He said what I already knew. "That is a team problem, not a you problem. Go fix the team problem."

That evening I wrote the Slack note. The short version of it is this.

When someone asks a support question, acknowledge it in-thread first. Do not ghost. The person is actually blocked and needs at least a hello.
Answer the question.
Then push the answer into the relevant repo's wiki before end of day.
If the same answer applies to two repos, put it in both. The new hire joining the team next month only opens the repo they were assigned to, so that is where the answer needs to live.

A new dev only opens the repo they were assigned to. So the docs have to be there when they look.

Why we did not pick Confluence or Notion or a central wiki

Honest pass at the alternatives. Confluence is fine. Notion is fine. A company-wide wiki is fine. We have a Notion workspace that does have useful stuff in it.

But two things kept pulling us toward per-repo.

The first is simply that a new joiner opens the repo. Not Notion. Not Confluence. The repo. That is where they are spending eight hours of their day. Documentation that is two clicks and one context-switch away is documentation they will not open. Putting the answer next to the code it describes is a huge win, even if that means the answer exists in two places.

The second is that PR-reviewed Markdown beats a rich-text editor they have to context-switch into. A wiki page change becomes a Markdown diff on a feature branch. It gets reviewed. Typos are caught, stale facts are caught, "wait, we stopped doing that six months ago" is caught, all before the thing goes public.

Duplication across repos is a feature, not a bug. If a port-collision note applies to both our older PHP repo and our newer NestJS one, it lives in both wikis. The reader does not have to know which one is the "correct" home for it, because they are already in the repo they care about.

The part most people do not know - the GitHub Wiki is a hidden repo

This was the thing my teammate did not know, and honestly, most developers I have asked did not know it either.

The GitHub Wiki is not a folder inside your repo. It is a separate git repo. It lives at https://github.com/<owner>/<repo>.wiki.git. You can clone it like any other repo. You can push to it, commit to it, look at its git log. The Wiki tab on github.com is just a UI on top of that hidden repo.

There is one tiny gotcha. GitHub does not create the wiki repo for you automatically. Somebody has to go to the Wiki tab once, publish a placeholder first page through the UI, and from that moment on git clone <repo>.wiki.git works forever. If you have never touched the Wiki tab, that clone will just fail with "repository not found."

Once you know this, the rest of the setup basically writes itself. You keep your docs in a wiki/ folder inside the main repo, because that is where PR review, history, and git blame already live. Then you mirror wiki/ into the hidden <repo>.wiki.git on every merge to main. The GitHub Wiki tab becomes a publishing destination, not a place you author things.

The setup

Here is what each repo ended up with. The folder listing from a larger repo, just as an example of how it grew.

wiki/
  _Footer.md
  _Sidebar.md
  Home.md
  Introduction.md
  Requirements.md
  Development-Setup.md
  Architecture.md
  API-Reference.md
  Deployment.md
  Server-Access.md
  CLI-Commands.md
  Modules.md
  Naming-Conventions.md
  Testing.md
  Performance.md
  Release-Automation.md
  Glossary.md

A smaller repo started with four pages and that is completely fine. You do not have to have sixteen pages on day one. You do not even have to have sixteen pages on day one hundred. The point is that the pages that do exist are correct and current.

And here is the GitHub Action that makes the whole thing work. This is the real thing from our repos, not a simplified version.

# .github/workflows/wiki-publish.yml
# Syncs the wiki/ folder in this repo to the GitHub wiki on every push to
# main. The code repo is the source of truth. The wiki repo is a sink.
# Never edit pages via the GitHub wiki UI. They will be overwritten.

name: Publish Wiki

on:
  push:
    branches:
      - main
    paths:
      - 'wiki/**'
      - '.github/workflows/wiki-publish.yml'
  workflow_dispatch:

concurrency:
  group: wiki-publish
  cancel-in-progress: false

permissions:
  contents: write

jobs:
  publish:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code repo
        uses: actions/checkout@v6
        with:
          sparse-checkout: |
            wiki
          sparse-checkout-cone-mode: false

      - name: Clone wiki repo
        env:
          GH_TOKEN: ${{ secrets.WIKI_TOKEN || secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
        run: |
          git clone "https://x-access-token:${GH_TOKEN}@github.com/${REPO}.wiki.git" wiki-remote

      - name: Sync wiki content
        run: |
          rsync -av --delete --exclude='.git' wiki/ wiki-remote/

      - name: Commit and push
        working-directory: wiki-remote
        env:
          SOURCE_REPO: ${{ github.repository }}
          SOURCE_SHA: ${{ github.sha }}
          TRIGGERED_BY: ${{ github.actor }}
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "41898282+github-actions[bot]@users.noreply.github.com"

          git add -A

          if git diff --cached --quiet; then
            echo "No wiki changes to publish."
            exit 0
          fi

          SHORT_SHA="${SOURCE_SHA:0:7}"
          git commit -m "docs(wiki): sync from ${SHORT_SHA}" \
                     -m "Source: ${SOURCE_REPO}@${SOURCE_SHA}" \
                     -m "Triggered by: ${TRIGGERED_BY}"
          git push origin HEAD

A few things worth calling out in that file, because they were not obvious when I started.

The path filter on the trigger (paths: wiki/**) means the job only runs when someone actually changes a wiki file or the workflow itself. Without it, every push to main kicks off the job, which makes the Actions tab noisy for no reason.

The sparse checkout only pulls the wiki/ folder. On big repos this is just nice to have, on really big repos it is genuinely faster.

The rsync --delete is the important bit. If a page is removed from wiki/, it also disappears from the published wiki. Without --delete, old pages would linger forever on the Wiki tab like a graveyard of old truth.

The concurrency group means two merges that happen close together will serialize cleanly instead of racing each other and one winning silently.

And then the footgun. The default GITHUB_TOKEN that Actions give you for free cannot reliably push to the wiki repo. Depending on org settings and how the wiki repo was first created, it fails at the push step with permission errors that make no sense. So we use a separate WIKI_TOKEN secret, a personal access token with repo write scope, and fall back to GITHUB_TOKEN only so the workflow does not explode in forks. This PAT is the one honest piece of ceremony in the whole setup. It expires, you will forget to rotate it, and one day the workflow will show a green "succeeded" badge while silently not pushing anything. Document the rotation somewhere. Ideally, on the wiki.

Why this actually works

The thing I like about this setup is that nothing about it is clever. Every piece of it was already there before we did anything.

Wiki pages now go through the same PR review as code. Someone spots the typo, someone spots the outdated claim, someone says "we stopped doing that, by the way" and the page gets fixed before it ever makes it to the Wiki tab.

History is in regular Git. git blame on a doc line tells you who wrote it, when, and in which PR. No opaque wiki version log. No mystery about why a sentence was added.

Zero context switch. The developer is already in the repo, looking at their editor, looking at their branch. Adding a wiki page is editing a Markdown file and opening a PR. Same flow as everything else. Nobody has to remember a new tool.

And the Wiki tab on the repo's GitHub page, which used to be a dead link nobody clicked, becomes a real place people open on purpose.

The habit side, which is the harder side

Tooling gets you the plumbing. Habits get you the wiki.

The workflow is the easy part. The boring truth is that a wiki only stays useful if the team treats it like a living thing. That meant actually changing how we respond to questions.

Three rules ended up sticking.

Answer the question first. Always acknowledge in thread. Nobody wants their question ghosted with "it is in the wiki." Even if it is in the wiki, reply, link to it, say hi. The person asking is usually blocked, and being blocked while also being ignored is the worst combination.

Then write the answer down. The reply and the wiki page edit are basically the same effort. You have already explained the thing in Slack. Copy-paste, clean it up, open a PR. The cost of documenting is "one extra paste and one extra PR." The cost of not documenting is "I will answer this same question again in six weeks."

Wiki-only PRs do not need code review. The person writing the page is the expert on the thing they are writing about. Making them wait two days for two approvals to land a documentation fix kills the habit instantly. I still look at wiki PRs, but I do not block them.

On our newer repo, that habit made it all the way into the Home page of the wiki itself. "If a teammate asks you something not covered here, answer in thread, then add the answer to the relevant page before end of day." That line sitting at the top of the wiki is a tiny, constant reminder that the wiki is everyone's job.

Has this happened to you also? One person on the team quietly turns into the Slack search engine, and by the time everyone notices, they are already burnt out from it. I would genuinely love to hear how your team handled it, if you have been through this before.

The honest caveats

I am not going to pretend this setup is free or that it solved everything.

Duplication across repos is real and you do feel it. The same port-collision note, the same env-variable note, the same "do not forget to bump this version" note can end up in two or three different wikis. I decided early that I was okay with this. A shared page nobody can find is strictly worse than two slightly-duplicated pages that both readers find in the place they were already looking.

Seeding takes effort. The first few pages are a slog. You sit there, cup of coffee, trying to remember all the bits of tribal knowledge that you know exist but have never actually written down. Do not expect the team to show up on day one with a flood of contributions. For a while, you will be writing the first version of most pages yourself. That is fine. Once there is stuff on the page, people find it much easier to add to.

The PAT, as I mentioned, is a real footgun. It has repo-write scope. It expires. One day it will quietly stop working.

And then the honest, slightly uncomfortable one. A wiki does not magically make people read documentation. If a teammate was not reading the README, they may not read the wiki either, at least not at first. What the wiki changes is the answer I give them. Instead of re-typing the explanation for the fifth time, I reply with "good question, this is on the wiki page for exactly this" and link it. Over time, the shape of the reply itself trains the team to check the wiki first. But it is a slow training, not an overnight fix.

This whole thing only works if the team genuinely commits to the ask-answer-document habit. If the culture on your team is "answer in DMs, feel good about being helpful, move on," the wiki will rot the same way the old Wiki tab rotted. Tooling will not fix that. You need at least a couple of people, ideally including one tech lead, who will treat every answered question as an unwritten wiki page. Without that, skip the whole project.

What I would do differently

Not overthink the page structure on day one. Three pages is enough to start. Home. Development Setup. Deployment. That is it. The rest of the structure will emerge from the questions that actually come in. If you try to design the perfect information architecture on day zero, you will end up with a beautiful empty shell that nobody writes into.

Put the path filter on the workflow from day one. I had it triggering on every push to main for a while, which made the Actions tab noisier than it needed to be.

Add a one-line "edit this page via a PR to wiki/<filename>.md" contributing note on every page from the start. It removes the confusion my teammate ran into, and it is the kind of thing you forget to add later.

Spend less time arguing with the team about whether a per-repo wiki is the right shape, and more time just shipping page two. The first page is a statement of intent. The second page is when the team actually starts believing in it.

One small thing that changed

The most recent new joiner ran the local stack without pinging me once. Not one message. That is the only metric I actually care about, honestly. And quietly, the teammate who triggered half of this is now leaving thoughtful comments on wiki PRs, which made me smile the first time I saw it.

Try it on one repo

Do not roll this out across your whole organisation next Monday. Nothing nukes a good idea faster than forcing it on twenty teams at once.

Pick one repo. Ideally the one whose Slack threads you keep scrolling through to find the same answer you gave last month. Add a wiki/ folder. Copy the workflow above. Create the Wiki tab once through the GitHub UI so the hidden repo exists. Seed three pages. And see if anyone on the team opens a PR for page four.

A wiki that sits right next to the codebase is not a radical idea. It is just an obvious one that somehow almost nobody does. Give it a go and see how it feels for your team.

So yeah, that is my take. Yours might be completely different, and that is exactly what makes this whole space interesting. Catch you in the next blog - should not be too long from now.