Forem: Mustafa ERBAY

Living on My Own Server: The Balance of Time and Freedom

Mustafa ERBAY — Sun, 10 May 2026 13:05:24 +0000

Recently, I woke up to three different Docker containers being OOM-killed consecutively on my own VPS. PagerDuty rang at 4:30 AM, and my first reaction was, "Again?" This was a typical morning start in what I call living on my own server, where I manage all my personal projects and side products.

This situation is a concrete example of a long-standing quest for balance: the time and effort I invest in exchange for the freedom and control I possess. Living on my own server has become a choice, a philosophy for me.

The Price of Freedom: Time and Effort

I have a long-standing habit: setting up and managing everything with my own hands. This is essentially a reflection of the experience I've gained in system architecture and operations since 2006, applied to my personal life. It holds true for this blog and projects like hesapciyiz.com, spamkalkani.com, and islistesi.com, all hosted on my own server.

This freedom, of course, comes at a price. Sometimes, as on April 28th, when the disk fills up to 100% or I see kcompactd using 92% CPU, I have to spend hours at the terminal with a coffee in hand. Situations like an sshd failing to accept connections, a Docker disk fire, or blacklisting kernel modules like algif_aead to mitigate CVEs are among the responsibilities this freedom brings.

ℹ️ A Real-World Scenario

One day, I accidentally ran sleep 360 in the background on my own VPS and realized it got OOM-killed. Even this simple mistake can affect the entire system. Such situations constantly remind me to consider more robust approaches like "polling-wait." To err is human; the important thing is to learn from mistakes.

Why Do I Live on My Own Server? My Philosophy

So, despite all these problems and sleepless nights, why do I choose this path? There are a few fundamental philosophical reasons:

Full Control and Independence

On my own server, everything is my choice. From the operating system to the web server, from the database to the application layer, I choose every component. Astro, Node, SQLite, Nginx, systemd, GitHub Actions, Cloudflare form my daily toolset. I can shape this ecosystem according to my own needs, without being constrained by any vendor. This level of control is invaluable, especially when developing complex pipelines or specialized AI applications.

Continuous Learning and Experience

Even with 20 years of field experience, the world of technology is constantly changing. My own server acts as a sandbox for trying out new technologies, tools, and approaches. Moments when my Astro build consumes 2.5 GB of RAM and gets OOM-killed, or when Docker clogs the disk with 33 GB of build cache and 23 GB of unused images, transform theoretical knowledge into practical experience. These real-world scenarios constantly push me to find better solutions.

Economic Efficiency and Scalability

In some cases, hosting on my own server also offers cost advantages. Especially using self-hosted runners to avoid exceeding GitHub Actions quotas provides me with both flexibility and cost savings. The 13+ Docker containers I manage on my own VPS (including Postgres, Redis, Next.js applications) allow me to host multiple projects on a single server at a low cost. This is particularly important for side products.

Innovation and Paving My Own Way

When it comes to setting up AI-

Nginx's Insidious DNS Trap: Unable to Reach Docker Containers

Mustafa ERBAY — Sun, 10 May 2026 10:57:29 +0000

It was last Thursday morning when one of hesapciyiz.com's API endpoints suddenly started returning a 502 Bad Gateway error. My first thought, of course, was that the backend application had crashed

Docker Disk Storage Wars: A Guide to Data Integrity on a VPS

Mustafa ERBAY — Sun, 10 May 2026 09:31:04 +0000

Last month, on the morning of April 28th, I woke up to a "Disk Space Critical" email from my own VPS. When I ran df -h, I saw that the / directory was 100% full. As I suspected, Docker was the culprit again. When managing over 13 containers on the same server, if even one gets out of control, it's only a matter of time before the entire system goes into swap and gets OOM-killed.

This situation is familiar to anyone who constantly struggles with disk storage issues, hosting multiple applications on their own server. In this post, I'll explain how I manage these "Docker Disk Storage Wars" on my own VPS, how I ensure data integrity, and how I optimize disk space. My goal is to guide you with practical solutions and experiences I've personally had.

Why Do Docker Disk Storage Wars Erupt?

My VPS's disk space suddenly filling up has almost become a routine for me. Most of the time, the root of the problem lies in unexpected container log growth, unnecessary images, or build caches. For example, at one point, the build caches of my Next.js applications reached 33 GB, and with unused images added on top, they consumed another 23 GB of disk space. That's when the disk hit 100%.

Such situations can lead to severe performance degradation and even service outages, especially if you're using a VPS with limited resources. A container's disk I/O maxing out can lead to kcompactd using 92% CPU and sshd being unable to accept new connections. That's why it's crucial to understand the root cause of the problem.

Common Disk Space Consumers

Knowing the biggest disk space consumers in the Docker ecosystem is the first step to solving the problem. In my experience, the main culprits are:

Dangling Images and Volumes: Unused or disconnected images and volumes can unknowingly occupy gigabytes of space. This becomes inevitable, especially if you frequently rebuild images.
Build Cache: Even if you don't use multi-stage builds, Docker creates intermediate layers at each build step. These caches can accumulate and reach enormous sizes. My 33 GB build cache issue was a prime example of this.
Container Logs: Especially verbose applications or services running in debug mode can cause log files to grow uncontrollably. Gigabytes of logs can accumulate within a few days; I even witnessed critical services on an internal banking platform halt due to this.
Ephemeral Data and Temporary Files: Temporary files created by applications during operation can take up permanent disk space if not properly cleaned.

The docker system df command is very useful for identifying these issues. This command provides a detailed summary of Docker's disk usage, helping me understand which component occupies how much space.

docker system df

The output of this command shows how much disk space each Docker component is using. For example, you can find detailed information under categories like "Images", "Containers", "Local Volumes", and "Build Cache". The "Reclaimable" area, in particular, shows the amount of disk space you can recover with manual intervention.

Data Integrity and Persistent Storage Strategies

One of the most crucial aspects when running applications in Docker is ensuring data persistence, even if containers die. Since I host my own sites and side products (like hesapciyiz.com, spamkalkani.com), data loss is critical. A wrong configuration or automatic cleanup can instantly wipe out your data.

Therefore, it's essential to correctly understand Docker's storage mechanisms and develop proactive strategies. I generally prefer using volumes because they are easier to manage and are a persistent storage solution designed within Docker itself.

Docker Volumes and Bind Mounts

Docker offers two primary methods for making data persistent: volumes and bind mounts. Both have their advantages and disadvantages, and the choice depends somewhat on the use case.

Volumes: These are file systems managed by Docker. They are typically located under `/var/lib/

VPS Swap Fire: A Nightmare Started by a Kernel CVE Patch

Mustafa ERBAY — Sun, 10 May 2026 01:23:44 +0000

Last week, precisely on a Monday morning, the "Critical Alert" notifications on my monitor struck fear into my eyes. The systems running on my own VPS, especially my Docker containers, had suddenly started to slow down. Even SSH connections were lagging, and my commands were taking a long time to execute. Yet, I hadn't done anything; I had just applied the usual overnight updates.

This sudden slowdown was a major problem for me. Because this VPS is my entire world. It runs over 13 Docker containers: a PostgreSQL database, Redis cache, my Next.js applications, and of course, the Astro site where this blog is published. Everything was running smoothly together. Until this morning. I started a deep dive to find the source of the problem.

Swap Usage Spiraling Out of Control

The first place I looked was the server's overall resource usage. The moment I ran the htop command, I couldn't believe my eyes: Swap usage was nearing 100%. Normally, I keep my swap space very low, sometimes I even disable it. But this time, the situation was different. Such high swap usage indicated that the system's RAM was insufficient and it had started using the swap space on the disk. This, in turn, caused performance to plummet.

Why had swap usage suddenly spiked so high? I immediately checked the dmesg and journalctl logs. I was seeing a lot of warnings related to kcompactd and oom-killer. I noticed that kcompactd was consuming CPU at around 90%. This signaled that the kernel was experiencing a serious issue with memory management.

⚠️ The Dangers of Swap Usage

Swap space is a disk-based storage area that comes into play when physical RAM (memory) is insufficient. However, disks are much slower than RAM. Increased swap usage directly leads to a noticeable drop in system performance. Excessive swap usage can cause the server to freeze or processes to be abruptly terminated by the oom-killer (Out-Of-Memory killer).

A Nightmare Started by a Kernel CVE Patch

As I examined the logs in more detail, I realized the problem had started with the kernel update I applied overnight. I had specifically applied a patch related to CVE-2026-31431. This CVE was intended to close a security vulnerability in the kernel's network stack. However, it seemed this patch had caused unexpected side effects on my system.

This CVE patch was closely related to the kernel's memory management. It contained a fix specifically for the algif_aead module. This module is used in VPN and encryption operations. Although I wasn't directly making VPN connections on my system, Docker's network operations and some firewall rules might have indirectly affected this module. What happened was not in a "corporate consultant" tone, but entirely my own experience, a situation where I thought, "these things happen."

Identifying the Source of the Problem

The reason behind kcompactd consuming so much CPU was the kernel's attempt to keep memory pages contiguous. However, this process was causing a bottleneck in memory management. Everything had started with the kernel update I applied overnight. In my case, this update was incompatible with my existing setup.

At this point, I remembered times when my Astro build consumed a lot of RAM. In those situations, the system would also resort to swap. But this time, the problem was deeper, at the kernel level. kcompactd reaching 92% CPU usage was not normal. This situation had rendered the server unable to even accept SSH connections.

# A snippet from dmesg logs (not the actual error message, for illustration purposes)
[Mon May 09 06:15:32 2026] kcompactd0: highmem-intensive workload detected, entering compact mode
[Mon May 09 06:16:01 2026] Out of memory: Kill process 12345 (kworker/u8:1) score 1000 or sacrifice child
[Mon May 09 06:16:05 2026] systemd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0

Solutions and Trade-offs

The first thing that came to my mind to solve the problem was to revert the kernel update I had applied. However, this meant reintroducing a version with a security vulnerability back into the system. This was not an acceptable option. Instead, I needed to adopt a more secure approach.

Another option was to adjust kcompactd's behavior. By changing kernel parameters, I could make the memory compaction process less aggressive. However, this would not be a long-term solution and could lead to other problems.

Ultimately, I decided that the most logical solution was to find an alternative solution related to the CVE patch that was causing the problem. This would take more time, but it was a safe solution.

ℹ️ Adjusting Kernel Parameters (Be Careful!)

While it's possible to adjust the kernel's memory management, these operations must be done very carefully. An incorrect parameter change can lead to system instability or prevent it from booting. These settings are usually made via the /etc/sysctl.conf file or files within the /etc/sysctl.d/ directory. However, this approach would be a short-term solution in my case.

Temporary Solution: Reducing Swap Usage

Until I got to the root of the problem, I had to implement some temporary solutions to keep the system running. First, I cleaned up unnecessary Docker images and build caches. I tried to free up disk space with the command docker system prune -a. Then, I focused on optimizing the build process of my Astro project.

During this time, I also recalled the runner state corruption issue I experienced with GitHub Actions. In that case, deleting directories under /home/runner/_work/_temp had resolved it. Such issues indicate imbalances in the current system.

As a temporary solution, I considered writing a script that would automatically stop or lower the priority of certain operations when swap usage was very high. However, this was not a complete fix, just a preventive measure.

The Real Solution: CVE Patch Alternative

Instead of reverting the official patch for CVE-2026-31431, I decided to use an alternative kernel module that addressed a similar vulnerability without being tied to that specific patch. This required some research. I needed to find a more stable encryption module compatible with my system, instead of algif_aead.

Finally, I found a kernel version that mitigated the impact of this specific CVE and ran stably on my system. I installed the new kernel version and restarted the system. My first check confirmed that swap usage had returned to normal levels, and kcompactd was no longer straining the CPU. SSH connections had sped up.

During this period, I also remembered the disk full issue I experienced on my own VPS on April 28th. At that time, there were 33 GB of build cache and 23 GB of unused images on the disk. While this current issue was more related to memory management, I once again saw how important regular cleaning and optimization are for the overall health of the system.

💡 Pipeline Reliability Pattern: Preflight, Auto-fix, Dedup-Alert

When I encounter unexpected issues like this, I try to apply a general "pipeline reliability" pattern. This pattern is as follows:

Preflight Resource Guard: Before starting an operation, check if resources (disk, RAM, CPU) are sufficient.

Auto-fix: Automatically apply issues that can be resolved automatically (disk cleanup, simple service restart).

Dedup-Alert: Prevent repeated alerts for the same issue; try to fix the problem first, then notify if it cannot be resolved. The AI-assisted content creation process for this blog was also designed similarly to this pattern.

Lessons Learned and Future Steps

This experience taught me several important lessons. Firstly, I realized I need to be much more careful when applying kernel updates. Every update can lead to unexpected side effects on the system. Especially in production environments, testing updates in a staging environment is essential.

Secondly, regular monitoring of system resources (RAM, swap) and early detection of anomalies are crucial. In addition to tools like htop, dmesg, and journalctl, using more advanced monitoring systems can be beneficial. When managing so many containers on my own server, even a single container issue can affect the entire system.

Finally, while it's important to apply patches for CVEs quickly and effectively, I must not forget that these patches can themselves cause problems. Therefore, when applying security patches, I must closely monitor the system's overall behavior. Perhaps in my next blog post, I can prepare a guide on the economic advantages of self-hosted runners in GitHub Actions and using VPS to avoid exceeding quotas.

Have you ever encountered such unexpected system issues? I'd love to hear about it in the comments.

Cloudflare Cache's Blind Spot: The Cost of Bypass Rules

Mustafa ERBAY — Sat, 09 May 2026 18:46:04 +0000

Last month, I started noticing unexplained spikes in server load on my blog, mustafaerbay.com, especially when publishing new articles or when popular content received traffic. I manage over 13 Docker containers on my own VPS, and when one started misbehaving, others would swap, kcompactd would eat 90% of the CPU, and even SSH connections became impossible to establish. This was surprising for a site sitting behind Cloudflare. I even recall a moment when the Astro build process consumed 2.5 GB of RAM, straining the system's 7.6 GB. Surely Cloudflare's cache should have sorted everything out, right?

The problem wasn't Cloudflare itself, but rather my reliance on cache bypass rules and the blind spot created by the Cache-Control headers coming from my origin server. It took me a while to realize how Astro's default max-age=0 setting was completely disrupting the caching logic I had established with Cloudflare. In this post, I'll explain how I diagnosed this cache blind spot, its causes, and how I gained control over cache behavior using Cloudflare and Nginx.

The Problem: Unexpected Side Effects of Cache Bypass Rules

On my own blog, I had defined some cache bypass rules on Cloudflare for specific dynamic-looking sections like mustafaerbay.com/feed.xml or backend services like /api/*. My goal was to ensure these particular URLs always fetched the latest data from the origin. It seemed logical. However, over time, I noticed that these rules were causing unintended side effects.

Even static content in the /blog/* path of my blog posts occasionally started hitting my origin server directly. Although I could see cf-cache-status: HIT for some requests in the Cloudflare panel, my origin server's CPU and I/O load indicated that caching wasn't working as it should. This situation led to my server becoming overloaded, especially during peak traffic, and at one point, it escalated into a Docker disk fire. 33 GB of build cache and 23 GB of unused images filled the disk to 100%, causing the VPS to completely freeze. This wasn't just a caching issue; it became a chain reaction that turned into an operational nightmare.

Symptoms and Initial Observations

High Origin Load: My server's CPU and RAM usage were significantly higher than expected, despite Cloudflare being in front. I was seeing node processes peaking in htop.
Inconsistent Page Load Times: Some users experienced lightning-fast page loads, while others faced noticeable delays. This was a clear sign of inconsistent caching.
cf-cache-status: BYPASS: In browser developer tools or during curl tests, I started seeing the cf-cache-status: BYPASS header for some static content. This was my first red flag.
Disk Fullness: Increased traffic and unexpected origin requests caused log files to grow rapidly and temporary files for Docker containers to bloat. This led to the 100% disk usage mentioned earlier.

ℹ️ Sharing Experience

Performance issues like these rarely stem from a single point. While a cache bypass rule might be a trigger, as I experienced, underlying issues like poor disk management or application resource consumption can exacerbate the situation. I always try to see the whole picture.

Diagnosis: Where Did I Start?

To understand the problem, I proceeded step-by-step. Instead of panicking, I used the data and tools at hand to figure out what was happening.

Cloudflare Logs and Analytics

First, I checked the Analytics section in the Cloudflare panel. While the edge cache hit ratios generally looked good, I noticed a high Cache Bypass rate for certain URL patterns. This provided the initial clue as to which requests were reaching the origin. I specifically examined the Security and Firewall logs to see which rules were being triggered and how they were affecting cache behavior.

Origin Server Logs

I dove into the Nginx and Node.js (where my Astro application was running) logs. In Nginx's access.log file, I started seeing continuous 200 OK responses for static blog post URLs that should have been served by Cloudflare. I confirmed whether a request came from Cloudflare or directly by looking at the X-Forwarded-For header in these logs. Using grep and awk commands, I quickly compared request counts for specific URL patterns.

# Count requests containing "/blog/" in Nginx access log
grep "/blog/" /var/log/nginx/access.log | wc -l

# Filter requests from a specific IP (Cloudflare IP range)
grep "172.68.XXX.XXX" /var/log/nginx/access.log | grep "/blog/" | wc -l

Tests with `curl` and `httpie`

I obtained the most concrete data through curl commands. I tested cache behavior by sending different headers (especially Cookie or User-Agent). Cloudflare's cf-cache-status header clearly indicates whether a request came from the cache (HIT), went to the origin (BYPASS), or was cached for the first time (MISS).

# Check cache status for a blog post
curl -svo /dev/null https://mustafaerbay.com/blog/cloudflare-cachein-kor-noktasi-bypass-kuralinin-bedeli 2>&1 | grep -i "cf-cache-status"

# Test request with a Cookie (Cloudflare usually bypasses if Cookie is present)
curl -svo /dev/null -H "Cookie: my_session_id=123" https://mustafaerbay.com/blog/cloudflare-cachein-kor-noktasi-bypass-kuralinin-bedeli 2>&1 | grep -i "cf-cache-status"

💡 Quick Check

The command curl -svo /dev/null https://mysite.com/path 2>&1 | grep -i "cf-cache-status" is an indispensable tool for instantly checking Cloudflare cache status.

Browser Developer Tools

Finally, I opened my browser's developer tools (F12), went to the Network tab, and inspected the headers of the requests made when I refreshed the page. The cf-cache-status and cache-control headers were clearly visible here as well. Seeing the combination of cf-cache-status: BYPASS and Cache-Control: max-age=0 made me understand the root of the problem much better.

Root Cause Analysis: Why Was It Bypassing?

All these diagnostic steps led me to two primary issues: the broad scope of Cloudflare's cache bypass rules and the misinterpretation of Cache-Control headers from my origin server.

Cloudflare Page Rules vs. Cache Rules

Cloudflare has two main rule sets: Page Rules and Cache Rules. Page Rules are older and more general-purpose, while Cache Rules are designed for more specific control over cache behavior. In my scenario, a rule defined in Page Rules had Cache Level: Bypass set for specific paths. While this rule targeted dynamic endpoints like *mustafaerbay.com/api/*, it unintentionally affected static content like /blog/* due to its broad scope.

Cloudflare's default caching behavior was also important here. It generally caches GET requests, but it bypasses requests containing Cookie headers, Query Strings, or directives like Cache-Control: no-cache by default. When my bypass rule combined with this default behavior, even static content was being routed to the origin.

The Role of the `Cache-Control` Header

The biggest blind spot was the Cache-Control header that my Astro application was returning by default. Although Astro is a static site generator, when run as a Node.js-based SSR (Server-Side Rendering) or API endpoint, it can often return headers like Cache-Control: max-age=0, must-revalidate by default.

This header instructed Cloudflare to "check the validity of this content immediately" or "don't cache, always go to the origin." Even without my Cloudflare bypass rule, this max-age=0 directive from the origin was preventing Cloudflare from caching the content. Instead of seeing Cache-Control: public, max-age=X, seeing max-age=0 meant Cloudflare wouldn't cache the content. I can say that Astro's max-age=0 gave me quite a bit of trouble on my own blog. This situation rendered even Cloudflare's powerful caching mechanism ineffective.

⚠️ Important Information

Cache-Control headers from your origin server can even override your Cloudflare Edge Cache TTL settings. If the origin sends max-age=0 or no-cache, Cloudflare will generally respect this directive and not cache the content.

The Solution: Taking Control of Cache Behavior

After understanding the root cause of the problem, I took steps to gain full control over cache behavior using Cloudflare and Nginx.

More Precise Control with Cloudflare Cache Rules

The first step was to review the old Page Rules in Cloudflare and define more specific Cache Rules instead.

Review and Narrow Down Existing Bypass Rules: I refined the Cache Level: Bypass rules in the existing Page Rules to be specific only to paths that truly needed to be dynamic, like /api/* or /login/*. I excluded static content like /blog/* from these rules.
Define Cache Rules for Static Content: I created a new Cache Rule for all content under the /blog/* path. This rule sets the Edge Cache TTL to a specific duration (e.g., 1 hour or 1 day) and sets the Cache Level to Cache Everything. I also kept options like Bypass Cache on Cookie active only where truly necessary (like /account/*).

An example of a Cloudflare Cache Rule (when defining via API or Terraform, it would be similar to this format):

{
  "rules": [
    {
      "id": "mustafa_blog_cache_rule",
      "action": {
        "id": "cache_settings",
        "value": {
          "edge_cache_ttl": 3600, // 1 hour
          "cache_level": "cache_everything"
        }
      },
      "expression": "(http.request.uri.path contains \"/blog/\") and (not http.cookie)"
    },
    {
      "id": "mustafa_api_bypass_rule",
      "action": {
        "id": "cache_settings",
        "value": {
          "edge_cache_ttl": 0, // Bypass
          "cache_level": "bypass_cache"
        }
      },
      "expression": "http.request.uri.path contains \"/api/\""
    }
  ]
}

This example caches requests under the /blog/ path that do not contain cookies for 1 hour, while bypassing all requests under the /api/ path. Creating these rules through the Cloudflare UI is much easier and more visual.

Nginx Override on the Origin Server

Cloudflare Cache Rules are great for managing cache on Cloudflare's side. However, the Cache-Control: max-age=0 header from my origin server was still an issue. Cloudflare tended to respect this directive. Therefore, I decided to override this header using Nginx.

Using Nginx's proxy_hide_header and add_header directives, I hid the Cache-Control and Pragma headers coming from the origin and added my own desired Cache-Control header instead. This prevented Cloudflare from seeing the max-age=0 directive and allowed it to cache the content with my specified cache duration.

server {
    listen 80;
    server_name mustafaerbay.com;

    # Redirect HTTP requests from Cloudflare to HTTPS (optional, but generally good practice)
    # if ($http_x_forwarded_proto != 'https') {
    #     return 301 https://$host$request_uri;
    # }

    location /blog/ {
        # Hide Cache-Control and Pragma from Astro
        proxy_hide_header Cache-Control;
        proxy_hide_header Pragma;

        # Add our own Cache-Control: public cache for 1 hour
        add_header Cache-Control "public, max-age=3600";

        proxy_pass http://localhost:3000; # Port where Astro app is running
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # Other location blocks (e.g., /api/ or root path)
    location / {
        proxy_pass http://localhost:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # ... other Nginx settings
}

With this Nginx configuration, for all requests under the /blog/ path, the Cache-Control header from Astro is hidden by Nginx, and Cache-Control: public, max-age=3600 is added instead. This allows Cloudflare to cache this content for 1 hour.

⚠️ Caution: Nginx Override

You must be very careful when overriding cache headers from the origin. If you accidentally cache dynamic content as static, users might see old or incorrect data. Therefore, I only applied the override for paths that truly needed to be static.

Implementation Steps and Verification

After applying these changes, I performed extensive tests to ensure everything was working as expected.

Step 1: Review and Clean Up Existing Rules

I reviewed all Page Rules and Cache Rules in the Cloudflare panel. I either deleted rules that conflicted or seemed unnecessary, or I narrowed their scope. I specifically disabled all bypass rules affecting the /blog/* path.

Step 2: Check Origin Headers

Before changing the Nginx configuration, I checked the headers coming directly from the Astro application with commands like curl -I http://localhost:3000/blog/post-name. I made sure I was seeing Cache-Control: max-age=0.

Step 3: Define Cloudflare Cache Rule

I defined the mustafa_blog_cache_rule mentioned above through the Cloudflare UI. I set the Edge Cache TTL to 1 hour and Cache Level to Cache Everything. I disabled options like cookie bypass for this rule, ensuring it would cache under all conditions (provided the Cache-Control header from Nginx was considered).

Step 4: Nginx Configuration Changes

I SSH'd into my VPS, edited the /etc/nginx/sites-available/mustafaerbay.com file, and applied the Nginx configuration above.

sudo nano /etc/nginx/sites-available/mustafaerbay.com

After saving the changes, I tested the Nginx configuration and restarted it:

sudo nginx -t
sudo systemctl reload nginx

Step 5: Testing and Verification

Finally, I re-tested using curl and browser developer tools:

curl -svo /dev/null https://mustafaerbay.com/blog/cloudflare-cachein-kor-noktasi-bypass-kuralinin-bedeli 2>&1 | grep -i "cf-cache-status\|cache-control"

I was now seeing cf-cache-status: HIT and Cache-Control: public, max-age=3600 headers. This meant Cloudflare was successfully caching the content and not putting unnecessary load on the origin. My server's CPU and RAM usage also dropped significantly. Even when traffic to my blog increased, I started seeing 304 Not Modified or HIT instead of BYPASS in the Nginx logs.

Lessons Learned and Best Practices

This incident taught me several important lessons:

Headers Are Critical: HTTP headers, especially Cache-Control, can be the smallest yet most impactful details in a system architecture. When using a CDN like Cloudflare, it's crucial to understand how headers from the origin server are interpreted.
Thorough Testing is Essential: Just because everything looks fine in the Cloudflare panel doesn't mean everything is truly okay. Testing different scenarios with curl is essential to see the actual behavior.
Beware of Bypass Rules: Cache bypass rules are powerful, but they should be kept as specific as possible. Broadly scoped rules can lead to unexpected outcomes, as I experienced. Always ask, "What will this rule affect?"
Monitoring is Indispensable: Continuously monitoring performance metrics is critical for early detection of issues like abnormal origin load or disk fullness. This is even more important for someone like me managing their own server. This situation reminded me once again of the "preflight resource guard" principle in pipeline reliability. It's always better to check a resource before loading it than to put out fires later.
Understand Trade-offs: Caching isn't always good. Bypassing might be necessary for dynamic content. However, if we don't want static content to behave like dynamic content, managing this trade-off with an intermediate layer like Nginx is necessary.

Conclusion

Cloudflare's caching mechanism is a powerful tool, but managing bypass rules and Cache-Control headers from the origin server correctly is vital for performance and operational stability. My experience demonstrated that even static content can overload the origin due to misconfigurations. By combining Cloudflare Cache Rules with Nginx's proxy_hide_header and add_header directives, I overcame this blind spot and ensured my blog ran much more stably.

The lessons learned from this process carry important insights for those who, like me, manage their own servers and pay attention to every millisecond of performance. Do you have a similarly frustrating caching story? Or have you used different approaches to solve these problems? I'd love to hear about it in the comments.

My CI Runner Was Killed by My Own Script: The Dark Side of Cleanup

Mustafa ERBAY — Sat, 09 May 2026 14:35:40 +0000

Towards the end of last month, I started a build job on my self-hosted GitHub Actions runner. It was a job that normally took 10-15 minutes, but this time it just wouldn't finish. The job seemed stuck, and I wasn't getting any response from the runner. When I tried to connect to the server via SSH, the connection was refused. It felt similar to the OOM scenarios I'd experienced on my VPS where sshd couldn't accept connections, but this time my RAM usage was normal.

After some digging, I realized my runner's heart had stopped beating. Looking at the GitHub Actions panel, I saw the runner was "Offline". The interesting thing was that the server itself was up, and my other Docker containers were running without issues. The only problem was my CI runner.

Getting to the Root of the Problem: A Cleanup Script Murder

To understand why the runner had died, I connected to the server via console. My first task was to check the dmesg output. There was nothing surprising there; no kernel-level error or OOM killer trigger was visible. When I checked the service status with the systemctl status github-runner command, I encountered an even more interesting situation: the service was active (exited), and there were no error messages in the logs. It was as if someone had gracefully shut down the service.

It was at this exact moment that the "innocent" cleanup script I'd added last week came to mind. I manage over 13 Docker containers on my own VPS, and disk space can sometimes become critical. Especially Docker's build cache and unused images, with 33 GB of build cache and 23 GB of unused images, can fill my disk up to 100%. Because of this, I had written a script to clean up old build outputs and unnecessary files in the _work directory.

⚠️ Chaos of My Own Making

This kind of automation can be a lifesaver, yes. But if it's not tested sufficiently or if scenarios aren't well thought out, shooting yourself in the foot becomes inevitable. My scenario was exactly that.

The Killer Script and the Victim Runner

The script I wrote simply deleted files older than a certain age in the _work directory. However, I had overlooked a small detail: the runner itself also operated within the _work directory, and temporary directories like _temp, or even in some cases the runner's own binaries or configuration files, could fall within this scope. I had previously experienced the pain of deleting directories inside _work/_temp on a GitHub Actions runner, but this time I had gone even further.

I hadn't used parameters like maxdepth or prune in the find command within the script carefully enough. While my goal was only build artifacts, the script had deleted some files vital for the runner itself. The result: The runner service quietly shut down when it couldn't access the necessary files to continue operating. This was a resource management disaster, similar to my Astro build consuming 2.5 GB of RAM and hitting OOM, but this time it was disk and file system related.

# A snippet from the faulty cleanup script (simplified version)
# This command was deleting all files older than 7 days under the _work directory.
# However, the runner's own working files were also included in this scope.
find /home/runner/_work/ -type f -mtime +7 -delete
find /home/runner/_work/ -type d -empty -delete

This command, `/home

System Architecture is a Bit About Paranoia

Mustafa ERBAY — Sat, 09 May 2026 11:50:07 +0000

Recently, a series of OOM-killed errors in the AI generation pipeline running on my own VPS took me back to the old days. I once again saw how a sleep 360 command could wreak havoc on a system and the cost of a simple mistake. This situation made me realize that system architecture is, in fact, a bit about "paranoya."

For me, this state of "paranoia" is like a way of life built on anticipating worst-case scenarios, accepting that anything can go wrong, and taking precautions accordingly. While it might sound a bit negative, my 20 years of experience in the field have repeatedly proven why this approach is indispensable.

Roots of Paranoia: Past Scars

This "paranoid" mindset isn't an empty delusion; it's a result of bitter experiences I've lived through and learned from. Over the years, I've seen many systems crash unexpectedly, slow down, or become completely inaccessible. These incidents formed the foundation of my architectural approach.

I remember once, at a major Turkish e-commerce site, the database server completely locked up in the middle of a critical campaign. I'll never forget the helplessness and panic of that moment. I've experienced similar, though smaller-scale, crises on my own VPS; for example, my disk filling up to 100% on April 28th. Such events taught me how crucial it is to ask, "what if?"

Incidents on My Own VPS

I manage over 13 Docker containers on my own server. Sometimes, when even one starts behaving unexpectedly, I see a domino effect, with others also getting swapped out. These scenarios show how each part of the system interacts with one another and how the weakest link can affect the entire system.

⚠️ VPS Overload and OOM

One of the most classic scenarios I've experienced on my own VPS is out-of-memory (OOM). Sometimes I've encountered situations like kcompactd using 92% CPU, or sshd being unable to accept new connections. This always reminds me how critical it is to monitor resources and know the limits.

Once, I noticed that Docker's build cache had reached 33 GB, and unused images were taking up 23 GB. My server disk was 100% full, and I couldn't even SSH in. This situation painfully taught me that even a simple full disk could cripple the entire operation. Since that day, I regularly run the docker system prune -a command.

Knowing That Everything Can Break

As a system architect, my most fundamental principle is to accept the fact that everything, absolutely everything, will break one day. This could be a hardware failure, a software bug, or a simple human error. What's important is knowing that these breakdowns will happen and how we build a resilience mechanism against them.

I once experienced state corruption in my GitHub Actions runner due to _work/_temp directories. The pain of deleting those directories and having to rebuild the entire pipeline showed me how fragile even automation systems can be. Such incidents explain why redundancy and fast recovery mechanisms are so valuable.

Resilience and Fault Tolerance

This "paranoid" perspective drives me to focus more on the concepts of resilience and fault tolerance when designing systems. Planning how the system will remain operational if a component fails is one of the most critical steps in architecture.

For example, this blog's Astro build process sometimes consumes 2.5 GB of RAM, pushing the total system RAM to 7.6 GB and resulting in an OOM. In such a situation, I add a preflight resource guard to the pipeline to check resources first. If resources are insufficient, I defer the operation and switch to polling-wait mode. Last month, when I typed sleep 360 and got OOM-killed, I had to activate this polling-wait mechanism.

Cloudflare Cache Strategies

Even when using Cloudflare, this "paranoia" comes into play. Astro's default max-age=0 wasn't providing the performance I wanted for static content. Therefore, I implemented an override on Nginx to define longer cache durations for specific paths. This is a matter of making a trade-off between content freshness and performance, and consciously managing that trade-off.

location /_astro/ {
    expires 1y;
    add_header Cache-Control "public, max-age=31536000, immutable";
    proxy_pass http://localhost:4321;
}

In this example, I set a 1-year cache duration for static assets starting with _astro. This reduces unnecessary origin hits at the CDN layer, thereby improving performance and lightening the load on my server.

Security and Constant Vigilance

One of the most prominent areas where "paranoia" is evident in system architecture is security. I always assume that attackers will try to find the weakest point in the system. Therefore, I try to take precautions not only against known vulnerabilities but also against potential risks.

In recent years, I've seen many times how critical CVEs can be. Even on my own system, I've tracked potential risks like CVE-2026-31431 and tried to close a possible vulnerability by blacklisting kernel modules like algif_aead. Such proactive steps strengthen the system's overall security posture.

ℹ️ Proactive Security Measures

Steps like blacklisting kernel modules or disabling unnecessary services, while often seeming like minor details, can prevent major security incidents. Remember, the best security measure is to prevent an incident from happening.

Runner Economics on My Own VPS

To avoid exceeding my GitHub Actions quota, I use a self-hosted runner on my own VPS. While this provides a cost advantage, it also requires me to constantly track updates and security patches. This situation allows me to view "paranoia" as a form of cost optimization and risk management tool.

Even when setting up my AI-powered content pipeline, I noticed errors occurring when slashes (/) were used in tags or when the publishDate field wasn't a quoted string. Even Turkish-specific details like the dotted-i character problem show that unexpected issues can arise at every layer of the system. These kinds of "quirks" are part of my constant vigilance and thinking about every detail.

Paranoia or Professionalism?

So, what exactly is this "paranoia" in system architecture? For me, it's not about being constantly anxious or expecting a disaster at any moment. Rather, it's about understanding the inherent complexity and fragility of systems and taking conscious steps to minimize these risks. We could call this risk-aware design.

This is a kind of engineering perspective. Just as an engineer building a bridge designs with scenarios like earthquakes, storms, or excessive loads in mind, we also strive to make our systems resilient against potential failures. I'm not ashamed of my own mistakes; on the contrary, last month when I typed sleep 360 and got OOM-killed, I learned from that error and developed the polling-wait mechanism. This is about asking, "what can I do to prevent this from happening again?" instead of just shrugging it off as "it happens."

For me, system architecture is a bit like this: being constantly vigilant and designing with the knowledge that everything can go wrong. Without this "paranoia," none of my side projects like hesapciyiz.com, spamkalkani.com, or islistesi.com would run so stably. Have you ever had such "paranoid" moments? I'd love to hear about them in the comments.

That Meaningless Stress After a Deploy

Mustafa ERBAY — Sat, 09 May 2026 09:20:12 +0000

Recently, I deployed a small CSS change for this blog. Normally, it's a simple tweak, just shifting a few pixels, but after hitting git push, that inexplicable tension settled over me again. It was as if I'd deployed a critical banking system; the question of "what if" started swirling somewhere inside.

This feeling is familiar to me; I've experienced it after every deploy for 20 years. Automatically, my hand reaches for the tail -f /var/log/nginx/access.log command, and I open the Cloudflare dashboard in my browser to check cache hit ratios and error logs. Even if everything appears fine, I remain vigilant for a while longer.

Symptoms of That "What If" Feeling

This tension that arises after a deploy is a situation many of us are familiar with. Sometimes it manifests as a minor twitch, other times as a mild paranoia lasting hours. There are even times when I wake up in the middle of the night with an urge to check, wondering, "Did I forget something?"

I don't just experience this process with large projects. Even on my own VPS, in an environment where I manage over 13 Docker containers, I feel this way after a simple configuration change. Despite everything being automated, one still thinks, "What if, just maybe?"

Past Painful Experiences and Triggers

At the root of this "what if" feeling are, I believe, the painful experiences we've had in the past. Those moments are deeply etched in our brains and are triggered with every deploy. For me, some of these triggers are very clear.

On my own VPS, I experienced this feeling most intensely on April 28th. I had deployed a new container, and the next morning, the Pipeline-health monitor sent a "DEGRADED" email. I saw the system was choked with kcompactd %92 CPU; it couldn't even accept SSH connections. The helplessness at that moment, and the hours of debugging that followed, explain the reason for this tension.

⚠️ Docker Disk Fire

Once, again on my own VPS, I experienced a Docker disk fire. The disk filled to 100% due to 33 GB of build cache and 23 GB of unused images. All my applications went down instantly, requiring urgent intervention. Such incidents are among the most significant reasons that reinforce that 'what if' feeling after a deploy.

There were also times when my Astro build consumed 2.5 GB of RAM, pushing the system's 7.6 GB RAM to its limits and causing an OOM (Out Of Memory) error. Or the pain of deleting directories inside _work/_temp on a GitHub Actions runner... All these scenarios have repeatedly shown me that a system can react unexpectedly. That's why, no matter how prepared I am, that meaningless stress lingers with me for a while.

The Balance of Risk and Control

This situation is essentially a reflection of risk management and the need for control. The uncertainty of what consequences our developed or managed systems might cause when they go live confronts us with this stress. Even if we use robust tests, automations, and monitoring tools, the "production" environment always holds its own surprises.

Striking this balance—finding a way between the desire for fast deploys and the goal of risk-free deploys—is often challenging. Sometimes we compromise on certain controls to gain speed, and we might pay the price later. On the other hand, trying to perfect everything also slows down the process.

My Coping Mechanisms

Over the years, I've developed my own methods to cope with this stress. While it hasn't completely disappeared, I've managed to reduce its impact. Automation and comprehensive monitoring are at the forefront of these methods.

I've set up automatic deploy processes with GitHub Actions. Every change is automatically pushed to production after passing tests. With Prometheus and Grafana, I monitor every corner of the system, and with Alertmanager, I receive instant notifications for anomalies. For pipeline reliability, I've specifically implemented preflight resource guards; these check if system resources are sufficient before a deploy.

💡 Small and Frequent Deploys

Instead of large, monolithic deploys, I prefer small, atomic changes. This narrows the scope of a potential problem and makes rolling back much easier. When an issue arises, it becomes much simpler to pinpoint what changed.

Rollback mechanisms are vitally important to me. When a deploy is found to be problematic, I need to be able to revert to the previous stable version with a single command. This sense of security somewhat alleviates that initial moment of stress. Furthermore, I'm not ashamed to make mistakes. Last month, when I wrote sleep 360 and got OOM-killed, I told myself, "this too was a lesson," and switched to a polling-wait mechanism. Learning from my self-created problems helps me be more careful in the next deploy.

The "It Happens" Philosophy and Acceptance

Ultimately, a certain amount of risk and uncertainty is inherent in this line of work. There's no such thing as a perfect system; there can always be a vulnerability, a bug, or an unexpected interaction. Accepting this truth, embracing the "it happens" philosophy, reduces the pressure on me.

Of course, this is not a state of complacency. On the contrary, it constantly pushes me to build better, more resilient, and more secure systems. There are times when I implement kernel module blacklists (like algif_aead for CVE-2026-31431) as part of CVE mitigation; this is also part of the job. I learn from every mistake, every problem, and enter the next deploy better prepared.

ℹ️ Self-Hosted Runner Economics

To avoid exceeding GitHub Actions quotas, I use a self-hosted runner on my own VPS. This both reduces costs and gives me more control. However, it also brings its own operational overhead. Every decision has a trade-off.

This constant state of vigilance has, I suppose, become a part of my profession. Perhaps this situation is a source of motivation that drives us to build better systems. It's not about striving for perfection, but a continuous cycle of improvement and learning.

Do you also have similar experiences after a deploy? How do you cope with that "what if" feeling? What's the first thing you do after a deploy? I'd love if you could share in the comments; perhaps we can learn from each other.

The Hidden Dependency Hell of Cloud-Based Microservices

Mustafa ERBAY — Sat, 09 May 2026 09:13:21 +0000

The Hidden Dependency Hell of Cloud-Based Microservices: A Guide to the Way Out

Cloud-based microservice architectures have become an indispensable part of modern software development. While they offer benefits such as flexibility, scalability, and rapid development, the complexities they bring along also can't be overlooked. One of the most insidious and exhausting of these complexities is the situation we can call "hidden dependency hell."

This situation arises when different services in the system develop invisible, vague, and hard-to-manage dependencies on each other. Although these dependencies aren't noticed at first, over time they erode the stability of the system, make debugging impossible, and turn adding new features into something close to torture.

Sources of Hidden Dependencies in Microservices

Hidden dependencies can find their way into microservice architectures for various reasons. Understanding these reasons is the first step to getting to the root of the problem and producing solutions. Typically, the pressure for rapid development, lack of documentation, or sloppy choice of communication protocols between services lead to these kinds of issues.

Another important source is services becoming indirectly dependent on the inner workings of one another. For instance, one service might expect another service to return a specific piece of data in a specific format. If that format changes, an unexpected chain of errors can follow.

ℹ️ What Is a Dependency?

In the context of software development, a dependency is when a component (a service, a library, etc.) needs another component in order to function. This need can be a direct call, or it can occur through an indirect data flow or a shared resource.

Symptoms and Effects of Hidden Dependencies

The presence of hidden dependencies usually shows itself when sudden and unexplained errors appear in the system. A small change in one service can cause big problems in unexpected places. This situation leads to a serious loss of motivation and a drop in productivity for the development teams.

Such dependencies also negatively affect the overall stability of the system. The debugging process turns into something like a labyrinth because of the difficulty of finding which service caused the issue. New deployments come to be done with the worry of when the next big problem will erupt.

Debugging Difficulty: You have to inspect more than one service to find the source of the problem.
Slow Development Cycles: It's hard to predict the possible effects of making a change.
Low System Stability: Unexpected errors and crashes happen more often.
Increasing Operational Costs: More resources are needed for troubleshooting and maintenance.

How Do You Get Out of This Hell? Paths to a Solution

To escape hidden dependency hell, you have to take a proactive approach. This requires being careful in many areas, from architectural decisions all the way to daily development practices. One of the most effective methods is to make communication between services explicit and standardized.

Using an API Gateway creates a centralized point of control by preventing services from communicating directly with each other. Thanks to this, dependencies between services become more visible and easier to manage. Additionally, event-driven architectures can prevent these kinds of problems by encouraging loose coupling between services.

💡 The Role of the API Gateway

The API Gateway is a mediator that receives all client requests and ensures they're handled by the relevant services. Thanks to this, clients aren't aware of the architecture of the services and dependencies between services are managed better.

Communication Patterns and Standardization

How communication between microservices is done plays a critical role in managing dependencies. Using standardized communication protocols such as RESTful APIs and gRPC allows services to understand each other more easily. These standards help dependencies become more explicit and predictable.

It's also important to clearly define the data formats and messaging schemas the services use. These definitions should be documented and versioned. That way, when one service changes a data format, the other services can adapt to the change or be made aware of it.

Traceability and Observability

Being able to monitor the behavior of all services in the system and the interactions among them is one of the most effective ways to detect hidden dependencies. Observability tools such as centralized logging, distributed tracing, and metrics collection provide valuable information about the overall health of the system.

Thanks to these tools, you can track how a request travels across multiple services, easily understand which service is causing latency or where the error started. This lets you quickly diagnose problems caused by hidden dependencies.

⚠️ The Importance of Observability

Observability is a property that lets you understand the internal state of the system by observing it from outside. In microservice architectures, this property is vital for proactively detecting and solving problems.

Other Methods of Managing Dependencies

Dependency Injection: Providing the dependencies a service needs from the outside makes services more independent.
Circuit Breaker Pattern: When a service repeatedly fails, it prevents the system from crashing by blocking calls from other services to that service.
Service Discovery: Lets services find each other dynamically, which helps reduce static dependencies.
Regular Refactoring: It's important to regularly review the architecture and make improvements aimed at reducing dependencies.

Conclusion: Continuous Effort for Healthier Microservices

The flexibility and speed brought by cloud-based microservices, when not managed properly, can lead to growing complexity and to problems like "hidden dependency hell." Escaping this hell isn't possible with a one-time fix; it's possible only through continuous effort.

Making your architectural decisions carefully, standardizing inter-service communication, using observability tools effectively, and regularly reviewing your system will help you reach healthier, more stable, and more manageable microservice architectures. Remember, a well-designed microservice architecture forms the foundation of your future growth and innovation.