<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Yuto Takashi</title>
    <description>The latest articles on Forem by Yuto Takashi (@tielec-takashi).</description>
    <link>https://forem.com/tielec-takashi</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3683605%2Fe1040807-5532-4992-8f53-f3cd04762229.jpg</url>
      <title>Forem: Yuto Takashi</title>
      <link>https://forem.com/tielec-takashi</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/tielec-takashi"/>
    <language>en</language>
    <item>
      <title>Why SRE Investment Gets Undervalued (And How to Fix It)</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sat, 14 Feb 2026 09:01:21 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/why-sre-investment-gets-undervalued-and-how-to-fix-it-507c</link>
      <guid>https://forem.com/tielec-takashi/why-sre-investment-gets-undervalued-and-how-to-fix-it-507c</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you've ever heard these questions from management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The system is working fine, why do we need more SRE budget?"&lt;/li&gt;
&lt;li&gt;"Can't we just respond to incidents when they happen?"&lt;/li&gt;
&lt;li&gt;"Why not just hire more developers to speed up development?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You're not alone. SRE and Platform Engineering investments are often undervalued, and there's a structural reason for it.&lt;/p&gt;

&lt;p&gt;This article explores why this happens and provides three concrete frameworks to justify infrastructure investment to non-technical executives.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Police Department Analogy
&lt;/h2&gt;

&lt;p&gt;Here's an interesting parallel: &lt;strong&gt;SRE work is similar to police, fire departments, and disaster response teams.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;They protect society using public funds&lt;/li&gt;
&lt;li&gt;When nothing goes wrong, they train and prepare&lt;/li&gt;
&lt;li&gt;But people often say "what are they even doing?" and cut their budgets&lt;/li&gt;
&lt;li&gt;When problems occur, everyone asks "why didn't you prevent this?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The value of prevention is invisible.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ship a new feature → "We contributed to revenue!" (highly visible)&lt;/li&gt;
&lt;li&gt;System runs 24/7 without issues → "That's expected" (taken for granted)&lt;/li&gt;
&lt;li&gt;Prevented a major outage → &lt;strong&gt;Never happened, so nobody notices&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Same with police and fire departments. Low crime rates and no fires are actually the result of their work, but it looks like "they're doing nothing."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Negative Spiral
&lt;/h2&gt;

&lt;p&gt;Here's what's scary: this structure creates a &lt;strong&gt;negative spiral&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Budget cuts → Staff shortage → More incidents → "SRE is incompetent" → Further budget cuts&lt;/p&gt;

&lt;p&gt;This happens with police too: "Crime is rising, what are they doing?" → Budget cuts → Less patrol → More crime...&lt;/p&gt;

&lt;p&gt;The cycle continues like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Budget gets cut&lt;/li&gt;
&lt;li&gt;Fewer people, fewer tools&lt;/li&gt;
&lt;li&gt;Incidents increase&lt;/li&gt;
&lt;li&gt;"Why is SRE failing?"&lt;/li&gt;
&lt;li&gt;Even more budget cuts&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Chicken-and-Egg Problem
&lt;/h2&gt;

&lt;p&gt;There's an even trickier issue: &lt;strong&gt;No budget until problems occur&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: Reactive funding&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Major outage happens → Emergency budget approved → Fix implemented → System stable → "We're good now, right?" → Budget cut&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: Prevention isn't valued&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Our database will hit limits in 6 months"&lt;/li&gt;
&lt;li&gt;"But it's working now, do we really need this?"&lt;/li&gt;
&lt;li&gt;(6 months later: outage)&lt;/li&gt;
&lt;li&gt;"Why didn't you predict this?!"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly like earthquake-proofing budgets. Before the earthquake: "waste of money." After: "why didn't we do this?"&lt;/p&gt;

&lt;h2&gt;
  
  
  Platform Engineering: A New Approach
&lt;/h2&gt;

&lt;p&gt;In the past 2-3 years, &lt;strong&gt;Platform Engineering&lt;/strong&gt; has gained attention.&lt;/p&gt;

&lt;p&gt;It's about &lt;strong&gt;building "self-service infrastructure" so developers can manage infrastructure themselves&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This emerged because of the gap between &lt;strong&gt;DevOps ideals and reality&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DevOps ideal (2010s)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Developers do everything: build, deploy, operate!"&lt;/li&gt;
&lt;li&gt;"You build it, you run it"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reality&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operational burden concentrated on developers&lt;/li&gt;
&lt;li&gt;Learning curve too steep (Kubernetes, Terraform, monitoring tools...)&lt;/li&gt;
&lt;li&gt;Each team picks their own tools → chaos&lt;/li&gt;
&lt;li&gt;Eventually, load concentrates on "the few who can operate"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ &lt;strong&gt;"It was unrealistic to expect all developers to be infrastructure experts"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE vs Platform Engineering
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;SRE&lt;/th&gt;
&lt;th&gt;Platform Engineering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Goal&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Protect service reliability&lt;/td&gt;
&lt;td&gt;Improve developer productivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;For Whom?&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;End users (customers)&lt;/td&gt;
&lt;td&gt;Internal developers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Key Activities&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Incident response, SLO management&lt;/td&gt;
&lt;td&gt;Self-service platform, tooling&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Using the police analogy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional SRE&lt;/strong&gt;: Patrol cars responding to crimes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Platform Engineering&lt;/strong&gt;: Install streetlights, cameras, empower residents to protect themselves&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, &lt;strong&gt;SRE is shifting from "protector" to "enabler"&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  But Budget Issues Remain
&lt;/h2&gt;

&lt;p&gt;Here's what I realized:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Changing the approach doesn't solve the "how much is enough?" problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security camera example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 cameras → "Does this even work?"&lt;/li&gt;
&lt;li&gt;100 cameras → "Do we really need that many?"&lt;/li&gt;
&lt;li&gt;1,000 cameras → "Isn't this overkill?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Same with Platform Engineering:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build CI/CD pipeline → "Too much effort?"&lt;/li&gt;
&lt;li&gt;Developer portal → "How much does this license cost?"&lt;/li&gt;
&lt;li&gt;Monitoring for all services → "Do small services need this?"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Moreover, Platform Engineering's value is harder to prove than SRE's incident response. You're proving &lt;strong&gt;"losses that didn't happen"&lt;/strong&gt; rather than "losses that did happen."&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Frameworks to Justify Investment
&lt;/h2&gt;

&lt;p&gt;So how do you explain the need for investment?&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Engineer Ratio Approach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb: 1 SRE per 10-20 developers&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50 developers → 3-5 SREs&lt;/li&gt;
&lt;li&gt;100 developers → 5-10 SREs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Varies by service scale and complexity&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Falling below this ratio increases the risk of entering a negative spiral.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Revenue Percentage Approach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Rule of thumb: 10-20% of IT budget for operations (including SRE)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Annual IT budget $1M → $100K-$200K&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a rough industry standard.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Downtime Cost Calculation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Formula:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Calculate revenue lost per hour of downtime&lt;/li&gt;
&lt;li&gt;Define acceptable annual downtime (e.g., 99.9% = 8.76 hours)&lt;/li&gt;
&lt;li&gt;Calculate potential annual loss&lt;/li&gt;
&lt;li&gt;Invest 10-20% of that amount&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hourly downtime loss: $50K&lt;/li&gt;
&lt;li&gt;Annual acceptable downtime: 8.76 hours&lt;/li&gt;
&lt;li&gt;Potential loss: $438K&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Investment: $50K-$100K&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If investment &amp;lt; expected loss, it's a rational investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Clear for Non-Technical Executives
&lt;/h2&gt;

&lt;p&gt;If your CTO or CEO has an engineering background, they'll understand. But when they don't, it gets tough.&lt;/p&gt;

&lt;p&gt;You might not even get time to explain everything. And even if you do, they might not fully grasp it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That's why we need to articulate the necessity of SRE and Platform Engineering at a level that non-engineers can understand.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Something you can say: "Read this first" — a primer that builds foundational understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Executive Guide Available
&lt;/h3&gt;

&lt;p&gt;I created &lt;strong&gt;"SRE &amp;amp; Platform Engineering Guide for Executives"&lt;/strong&gt; with this in mind.&lt;/p&gt;

&lt;p&gt;The guide covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Why digital services are "cities," not "buildings"&lt;/li&gt;
&lt;li&gt;What SRE is (police/fire department analogy)&lt;/li&gt;
&lt;li&gt;What Platform Engineering is (roads/utilities analogy)&lt;/li&gt;
&lt;li&gt;Why investment is necessary (visualizing "invisible losses")&lt;/li&gt;
&lt;li&gt;How much to invest (three frameworks)&lt;/li&gt;
&lt;li&gt;Common misconceptions&lt;/li&gt;
&lt;li&gt;Decision-making checklist&lt;/li&gt;
&lt;li&gt;Next actions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All written to be &lt;strong&gt;understandable by non-engineers in 10 minutes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The complete guide is available in the original article.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Use it as a resource for conversations with leadership.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Infrastructure is Investment, Not Cost
&lt;/h2&gt;

&lt;p&gt;SRE and Platform Engineering investment is like fire insurance.&lt;/p&gt;

&lt;p&gt;Companies pay hundreds of thousands annually for fire insurance. Nobody says "it's wasteful because no fire happened."&lt;/p&gt;

&lt;p&gt;Similarly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expected loss: $3M/year (outage risk)&lt;/li&gt;
&lt;li&gt;Investment: $500K (SRE)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;If investment &amp;lt; expected loss, it's rational&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But many companies only invest in SRE &lt;strong&gt;after experiencing a major outage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Police and fire departments get budgets &lt;strong&gt;before&lt;/strong&gt; major incidents happen. Because they're recognized as &lt;strong&gt;"social infrastructure"&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SRE should be recognized as "digital infrastructure" too.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Honestly, there's no absolute answer to "how much is enough?" It becomes a matter of organizational values and priorities.&lt;/p&gt;

&lt;p&gt;But at least we can provide "materials for thinking."&lt;/p&gt;

&lt;p&gt;How much does your organization invest in SRE and Platform Engineering?&lt;/p&gt;




&lt;p&gt;For more on this investment framework and the complete executive guide, check out the original article.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://tielec.blog/en/tech/sre/why-sre-investment-undervalued" rel="noopener noreferrer"&gt;https://tielec.blog/en/tech/sre/why-sre-investment-undervalued&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Final Chapter: Jenkins EFS Problem Solved - From 100% to 0% Throughput Usage</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sat, 14 Feb 2026 04:35:45 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/final-chapter-jenkins-efs-problem-solved-from-100-to-0-throughput-usage-2km7</link>
      <guid>https://forem.com/tielec-takashi/final-chapter-jenkins-efs-problem-solved-from-100-to-0-throughput-usage-2km7</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;After three articles tracking down a Jenkins EFS performance issue, &lt;strong&gt;enabling Shared Library cache reduced throughput usage from 100% spikes to near 0%&lt;/strong&gt;. This article covers the final results and the complete SRE process from emergency response to permanent fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  Previous Episodes (Quick Recap)
&lt;/h2&gt;

&lt;p&gt;This is the final article in a 4-part series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Episode 1&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/how-git-temp-files-killed-our-jenkins-performance-efs-metadata-iops-hell-3ff4"&gt;How Git Temp Files Killed Our Jenkins Performance&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Problem: Jenkins slowed down, Git clone failures, 504 errors&lt;/li&gt;
&lt;li&gt;Discovery: EFS metadata IOPS exhaustion&lt;/li&gt;
&lt;li&gt;Culprit: ~15GB of &lt;code&gt;tmp_pack_*&lt;/code&gt; files accumulating&lt;/li&gt;
&lt;li&gt;Emergency fix: Provisioned throughput 300 MiB/s + cleanup job&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Episode 2&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/aws-efs-emergency-response-how-i-spent-69-in-26-hours-and-how-to-avoid-it-5gb8"&gt;How I Spent $69 in 26 Hours (and How to Avoid It)&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost: $69 in 26 hours&lt;/li&gt;
&lt;li&gt;Lesson: Didn't know about Elastic Throughput (1/20th the cost)&lt;/li&gt;
&lt;li&gt;Learning: Decision process was sound, but not the optimal solution&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Episode 3&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/how-jenkins-slowly-drained-our-efs-burst-credits-over-2-weeks-4dg4"&gt;How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key finding: Symptom appeared on 1/26, but root cause started on 1/13&lt;/li&gt;
&lt;li&gt;Multiple factors:&lt;/li&gt;
&lt;li&gt;Shared Library cache was disabled (existed before)&lt;/li&gt;
&lt;li&gt;Changed to disposable agent approach (1/13) → metadata IOPS increased&lt;/li&gt;
&lt;li&gt;Development accelerated in new year → more builds&lt;/li&gt;
&lt;li&gt;Root fix: Enabled Shared Library cache, set Refresh time to 180 minutes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Result: Dramatic Improvement
&lt;/h2&gt;

&lt;p&gt;After enabling the Shared Library cache setting (&lt;code&gt;Cache fetched versions on controller for quick retrieval&lt;/code&gt;) and setting Refresh time to 180 minutes, the EFS throughput usage changed dramatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before the fix (left side, 06:00-12:00)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Throughput usage &lt;strong&gt;spiking to 100%&lt;/strong&gt; frequently&lt;/li&gt;
&lt;li&gt;Almost constantly under high load&lt;/li&gt;
&lt;li&gt;Far exceeding the 75% warning zone&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After the fix (after 12:00)&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Throughput usage dropped dramatically and stabilized&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Baseline near 0%&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Regular small spikes every 3 hours (max ~60%)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foamsiwkn7o9ouacv3rer.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foamsiwkn7o9ouacv3rer.png" alt="Throughput usage" width="800" height="296"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The 3-hour spikes are from the Shared Library cache refresh checks (Refresh time: 180 minutes). In other words, &lt;strong&gt;it's working exactly as expected&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Honestly, I didn't expect the effect to be this clean.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Complete Timeline: 5 Throughput Modes
&lt;/h2&gt;

&lt;p&gt;Over the course of this incident, we went through 5 different EFS throughput modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode Progression&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bursting (original)
    ↓ 1/27 emergency response
Provisioned 300 MiB/s (26 hours)
    ↓ 1/28 cost reduction
Elastic Throughput (1 day)
    ↓ 1/29 verification
Provisioned 10 MiB/s (current)
    ↓ planned
Bursting (return to original)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost Comparison&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Duration&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bursting&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Until 1/27&lt;/td&gt;
&lt;td&gt;Storage only&lt;/td&gt;
&lt;td&gt;Normal operation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provisioned 300 MiB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1/27 (26 hrs)&lt;/td&gt;
&lt;td&gt;~$69&lt;/td&gt;
&lt;td&gt;Emergency: ensure investigation could proceed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic Throughput&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1/28-1/29 (~1 day)&lt;/td&gt;
&lt;td&gt;~$8&lt;/td&gt;
&lt;td&gt;Cost reduction: pay per use&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provisioned 10 MiB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1/30-current&lt;/td&gt;
&lt;td&gt;~$2.3/day&lt;/td&gt;
&lt;td&gt;Verification: stable operation at low cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bursting (planned)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Soon&lt;/td&gt;
&lt;td&gt;Storage only&lt;/td&gt;
&lt;td&gt;Permanent fix: return to original&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why We Changed From Elastic Throughput
&lt;/h3&gt;

&lt;p&gt;Elastic Throughput turned out to be "surprisingly costly":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Daily cost: ~$8&lt;/li&gt;
&lt;li&gt;Monthly estimate: ~$240 (~¥35,000)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In contrast, Provisioned 10 MiB/s costs ~$72/month (~¥10,000). Given our current usage pattern (average throughput a few %, max ~60%), 10 MiB/s is sufficient.&lt;/p&gt;

&lt;p&gt;However, this is just for the verification period. We plan to eventually return to Bursting mode.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hypothesis Verification
&lt;/h2&gt;

&lt;p&gt;Let's verify the hypothesis from Episode 3.&lt;/p&gt;

&lt;h3&gt;
  
  
  Initial Hypothesis
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Did the change to disposable agent approach (1/13) cause the metadata IOPS spike?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Answer: &lt;strong&gt;Partially correct, but not the main culprit&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Real Culprit
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Shared Library cache was disabled (existed before the incident)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Just enabling the cache brought throughput usage down to near 0%. This means Shared Library's full fetch on every build was consuming the overwhelming majority of metadata IOPS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Impact of Disposable Agent Approach
&lt;/h3&gt;

&lt;p&gt;So was the disposable agent approach irrelevant?&lt;/p&gt;

&lt;p&gt;Not quite. I believe the change to disposable agents was &lt;strong&gt;one factor that accelerated Burst Credit depletion&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The combination of three factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared Library cache disabled (pre-existing) → Controller-side metadata IOPS on every build&lt;/li&gt;
&lt;li&gt;Disposable agent approach (from 1/13) → Agent-side metadata IOPS on every build&lt;/li&gt;
&lt;li&gt;Development acceleration in new year → increased build frequency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These three factors together caused rapid Burst Credit depletion from 1/13, with symptoms appearing 2 weeks later on 1/26.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SRE Process: From Detection to Resolution
&lt;/h2&gt;

&lt;p&gt;Looking at the timeline of our response, you can see a clear SRE process:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Complete SRE Workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Problem Detection&lt;/strong&gt; (1/27 morning)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Symptoms: Jenkins slow, Git clone failures, 504 errors&lt;/li&gt;
&lt;li&gt;Metrics check: EFS throughput usage at 100%&lt;/li&gt;
&lt;li&gt;Time: ~30 minutes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Emergency Response&lt;/strong&gt; (1/27 morning)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decision: Change to Provisioned throughput 300 MiB/s (executed next day)&lt;/li&gt;
&lt;li&gt;Goal: Ensure investigation could continue&lt;/li&gt;
&lt;li&gt;Tradeoff: High cost vs. continued development&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Impact Mitigation&lt;/strong&gt; (1/27 afternoon)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created cleanup job&lt;/li&gt;
&lt;li&gt;Planned &lt;code&gt;tmp_pack_*&lt;/code&gt; deletion&lt;/li&gt;
&lt;li&gt;Implemented recurrence prevention&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Root Cause Investigation&lt;/strong&gt; (1/27-1/30)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 1: Found &lt;code&gt;tmp_pack_*&lt;/code&gt; accumulation&lt;/li&gt;
&lt;li&gt;Stage 2: Burst Credit Balance graph analysis revealed 1/13 as starting point&lt;/li&gt;
&lt;li&gt;Stage 3: Discovered Shared Library cache was disabled&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To be honest, I got stuck here. When I found &lt;code&gt;tmp_pack_*&lt;/code&gt;, I thought "this is it," but it was actually just part of the symptom. Reviewing the graphs chronologically led me to the true root cause.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Permanent Fix&lt;/strong&gt; (1/30)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enabled Shared Library cache&lt;/li&gt;
&lt;li&gt;Set Refresh time: 180 minutes&lt;/li&gt;
&lt;li&gt;Optimized throughput mode&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Effect Measurement&lt;/strong&gt; (1/30 onward)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Confirmed dramatic improvement in throughput usage&lt;/li&gt;
&lt;li&gt;3-hour spikes are as expected&lt;/li&gt;
&lt;li&gt;Continuing to monitor&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reflection &amp;amp; Knowledge Sharing&lt;/strong&gt; (this article)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reflected on cost decisions (didn't know about Elastic Throughput)&lt;/li&gt;
&lt;li&gt;Understood the multiple contributing factors&lt;/li&gt;
&lt;li&gt;Sharing knowledge with the organization&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This last part is surprisingly crucial. It's not just about solving the problem, but articulating "why it happened" and "how we decided" to apply to future situations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Outstanding Issues and Next Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Short-term Tasks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Return to Bursting Mode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're currently running on Provisioned 10 MiB/s, but plan to return to Bursting mode eventually.&lt;/p&gt;

&lt;p&gt;What to check before switching back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Has Burst Credit Balance recovered sufficiently?&lt;/li&gt;
&lt;li&gt;Are new &lt;code&gt;tmp_pack_*&lt;/code&gt; files being generated?&lt;/li&gt;
&lt;li&gt;Is the cleanup job working correctly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Strengthen Monitoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This incident could have been caught earlier with proper monitoring.&lt;/p&gt;

&lt;p&gt;Alerts to set up:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EFS throughput usage &amp;gt; 75%&lt;/li&gt;
&lt;li&gt;Burst Credit Balance &amp;lt; threshold (TBD)&lt;/li&gt;
&lt;li&gt;Abnormal storage capacity increase&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Long-term Considerations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Reconsidering the Disposable Agent Approach&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We're continuing with the disposable agent approach, but its impact on metadata IOPS can't be ignored.&lt;/p&gt;

&lt;p&gt;Options to consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Extend agent lifecycle slightly to reuse across multiple jobs&lt;/li&gt;
&lt;li&gt;Share Git cache on EFS across all agents&lt;/li&gt;
&lt;li&gt;Place cache in S3 and sync on startup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Finding the right balance between cost and performance is the next challenge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Looking back at the timeline, here's what I learned.&lt;/p&gt;

&lt;h3&gt;
  
  
  Technical Lessons
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;EFS Metadata IOPS Characteristics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Massive small file operations are deadly&lt;/li&gt;
&lt;li&gt;File count matters more than storage capacity&lt;/li&gt;
&lt;li&gt;Burst Credit management is key&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Jenkins Caching Mechanisms&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Importance of Shared Library cache&lt;/li&gt;
&lt;li&gt;Setting the right Refresh time balance&lt;/li&gt;
&lt;li&gt;Hidden costs of disabled caching&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Throughput Mode Selection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elastic Throughput isn't a silver bullet&lt;/li&gt;
&lt;li&gt;Optimization based on usage patterns&lt;/li&gt;
&lt;li&gt;Importance of cost estimation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Process Lessons
&lt;/h3&gt;

&lt;p&gt;But more importantly, it's about "how we decided."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emergency Decision Making&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make decisions without perfect information&lt;/li&gt;
&lt;li&gt;Prioritize avoiding worst-case scenarios&lt;/li&gt;
&lt;li&gt;Clarify tradeoffs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Investigation Approach&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Look at graphs chronologically, not just symptoms&lt;/li&gt;
&lt;li&gt;Form hypotheses, test them, move to next hypothesis if wrong&lt;/li&gt;
&lt;li&gt;Acknowledge when you're stuck&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Accountability&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Costs can be explained after the fact&lt;/li&gt;
&lt;li&gt;Articulate the decision process&lt;/li&gt;
&lt;li&gt;Share both successes and failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I regret not knowing about Elastic Throughput, but I don't regret the decision to "ensure investigation could continue."&lt;/p&gt;

&lt;p&gt;The $69 tuition might have been expensive, but I think I got more than that in learning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Related Articles
&lt;/h2&gt;

&lt;p&gt;This is the final article in the series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Episode 1&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/how-git-temp-files-killed-our-jenkins-performance-efs-metadata-iops-hell-3ff4"&gt;How Git Temp Files Killed Our Jenkins Performance&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episode 2&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/aws-efs-emergency-response-how-i-spent-69-in-26-hours-and-how-to-avoid-it-5gb8"&gt;How I Spent $69 in 26 Hours&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Episode 3&lt;/strong&gt;: &lt;a href="https://dev.to/tielec-takashi/how-jenkins-slowly-drained-our-efs-burst-credits-over-2-weeks-4dg4"&gt;How Jenkins Slowly Drained Our EFS Burst Credits&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;I write more about SRE decision-making processes and the thinking behind technical choices on my blog.&lt;br&gt;
Check it out: &lt;a href="https://tielec.blog/en/tech/sre/jenkins-efs-final-report" rel="noopener noreferrer"&gt;https://tielec.blog/en/tech/sre/jenkins-efs-final-report&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
      <category>performance</category>
    </item>
    <item>
      <title>How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sat, 31 Jan 2026 03:58:09 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/how-jenkins-slowly-drained-our-efs-burst-credits-over-2-weeks-4dg4</link>
      <guid>https://forem.com/tielec-takashi/how-jenkins-slowly-drained-our-efs-burst-credits-over-2-weeks-4dg4</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shared Library cache was disabled (existing issue)&lt;/li&gt;
&lt;li&gt;Switched to disposable agents (1/13 change)&lt;/li&gt;
&lt;li&gt;Increased build frequency (New Year effect)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Result: 50x increase in metadata IOPS → EFS burst credits drained over 2 weeks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you're running Jenkins on EFS, this could happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time-series analysis of metrics is crucial.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mystery: Symptoms vs. Root Cause
&lt;/h2&gt;

&lt;p&gt;Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15GB of Git temporary files (&lt;code&gt;tmp_pack_*&lt;/code&gt;) accumulated on EFS, causing metadata IOPS exhaustion.&lt;/p&gt;

&lt;p&gt;We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?&lt;/p&gt;

&lt;p&gt;Not quite.&lt;/p&gt;

&lt;p&gt;When I checked the EFS Burst Credit Balance graph, I noticed something important:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The credit started declining around 1/13, but symptoms appeared on 1/26.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Timeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1/13&lt;/strong&gt;: Credit decline starts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/19&lt;/strong&gt;: Rapid decline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/26&lt;/strong&gt;: Credit bottoms out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1/26-27&lt;/strong&gt;: Symptoms appear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;tmp_pack_*&lt;/code&gt; accumulation was a symptom, not the root cause. Something changed on 1/13.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changed on 1/13?
&lt;/h2&gt;

&lt;p&gt;Honestly, this stumped me. I had a few ideas, but nothing definitive:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Agent Architecture Change
&lt;/h3&gt;

&lt;p&gt;Around 1/13, we changed our Jenkins agent strategy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before: Shared Agents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2 instances: c5.large, etc.&lt;/li&gt;
&lt;li&gt;Multiple jobs share agents&lt;/li&gt;
&lt;li&gt;Workspace reuse&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git pull&lt;/code&gt; for incremental updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After: Disposable Agents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EC2 instances: t3.small, etc.&lt;/li&gt;
&lt;li&gt;One agent per job&lt;/li&gt;
&lt;li&gt;Destroy after use&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git clone&lt;/code&gt; for full clones every time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal was cost reduction. We didn't consider metadata IOPS impact.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Post-New Year Development Rush
&lt;/h3&gt;

&lt;p&gt;Teams ramped up development after the New Year holiday, increasing overall Jenkins load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Math: 50x Metadata IOPS Increase
&lt;/h2&gt;

&lt;p&gt;Let me calculate the impact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Builds per day: 50 (estimated)
Files created per clone: 5,000

Shared agent approach:
  Clone once = 5,000 metadata operations

Disposable agent approach:
  50 builds × 5,000 files = 250,000 metadata operations/day
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;50x increase in metadata IOPS.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Add the New Year rush, and the numbers get even worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Git Cache in Jenkins
&lt;/h2&gt;

&lt;p&gt;During investigation, I noticed &lt;code&gt;/mnt/efs/jenkins/caches/&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is Jenkins Git plugin's &lt;strong&gt;bare repository cache&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Git Caching Works
&lt;/h3&gt;

&lt;p&gt;Jenkins Git plugin optimizes clones by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Caching remote repos in &lt;code&gt;/mnt/efs/jenkins/caches/git-{hash}/&lt;/code&gt; as bare repositories&lt;/li&gt;
&lt;li&gt;Cloning to job workspaces using &lt;code&gt;git clone --reference&lt;/code&gt; from this cache&lt;/li&gt;
&lt;li&gt;Hash generated from repo URL + branch combination&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Disposable agents might not benefit from this cache since they're new every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Smoking Gun: tmp_pack_* Location
&lt;/h2&gt;

&lt;p&gt;I revisited where &lt;code&gt;tmp_pack_*&lt;/code&gt; files were located:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
  └── 335abf.../root/.git/objects/pack/
      └── tmp_pack_WqmOyE  ← 100-300MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are in &lt;strong&gt;per-build directories&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;jobs/sample-job/jobs/sample-pipeline/
└── builds/
    ├── 104/
    │   └── libs/.../tmp_pack_WqmOyE
    ├── 105/
    │   └── libs/.../tmp_pack_XYZ123
    └── 106/
        └── libs/.../tmp_pack_ABC456
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Every build&lt;/strong&gt; was re-checking out the Pipeline Shared Library, generating &lt;code&gt;tmp_pack_*&lt;/code&gt; each time.&lt;/p&gt;

&lt;p&gt;Question: &lt;strong&gt;Why is Shared Library being fetched on every build?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Root Cause: Cache Setting Was OFF
&lt;/h2&gt;

&lt;p&gt;I checked Jenkins configuration and found the smoking gun.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Shared Library setting &lt;code&gt;Cache fetched versions on controller for quick retrieval&lt;/code&gt; was unchecked.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shared Library cache completely disabled&lt;/li&gt;
&lt;li&gt;Full fetch from remote repository on every build&lt;/li&gt;
&lt;li&gt;Temporary files generated in &lt;code&gt;.git/objects/pack/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Massive metadata IOPS consumption&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Fix: Enable Caching
&lt;/h2&gt;

&lt;p&gt;I immediately changed the settings:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enabled&lt;/strong&gt; &lt;code&gt;Cache fetched versions on controller for quick retrieval&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;Refresh time in minutes&lt;/code&gt; to &lt;strong&gt;180 minutes&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Choosing the Refresh Time
&lt;/h3&gt;

&lt;p&gt;This is actually important. Options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;60-120 min&lt;/strong&gt;: Fast updates, moderate IOPS reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;180 min (3 hours)&lt;/strong&gt;: Balanced - 8 updates/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;360 min (6 hours)&lt;/strong&gt;: Stable operation - 4 updates/day&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1440 min (24 hours)&lt;/strong&gt;: Maximum IOPS reduction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why I chose 180 minutes:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Updates check ~8 times/day (9am, 12pm, 3pm, 6pm...)&lt;/li&gt;
&lt;li&gt;Shared Library changes reflected within half a day is fine&lt;/li&gt;
&lt;li&gt;Significant IOPS reduction (every build → once per 3 hours)&lt;/li&gt;
&lt;li&gt;Can manually clear cache for urgent changes&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Jenkins has a "force refresh" feature, so urgent changes aren't a problem. I documented this in our runbook so we don't forget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Measuring the Impact
&lt;/h2&gt;

&lt;p&gt;Post-change monitoring plan:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short-term (24-48 hours)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No new &lt;code&gt;tmp_pack_*&lt;/code&gt; files generated&lt;/li&gt;
&lt;li&gt;EFS metadata IOPS decrease&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mid-term (1 week)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Burst Credit Balance recovery trend&lt;/li&gt;
&lt;li&gt;Stable build performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term (1 month)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Credits remain stable&lt;/li&gt;
&lt;li&gt;No recurrence&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Symptoms ≠ Root Cause Timeline
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Symptom appearance: 1/26-1/27&lt;/li&gt;
&lt;li&gt;Root cause: Around 1/13&lt;/li&gt;
&lt;li&gt;Credit depletion: Gradual over 2 weeks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Time-series analysis is crucial.&lt;/strong&gt; Fixing only the visible symptoms leads to superficial solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Architecture Changes Have Hidden Costs
&lt;/h3&gt;

&lt;p&gt;The disposable agent change was for cost optimization. We did reduce EC2 costs, but created problems elsewhere.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When changing architecture:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Evaluate performance impact beforehand&lt;/li&gt;
&lt;li&gt;Set up monitoring before the change&lt;/li&gt;
&lt;li&gt;Continue tracking metrics after&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. EFS Metadata IOPS Characteristics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Mass creation/deletion of small files is deadly&lt;/li&gt;
&lt;li&gt;File count matters more than storage size&lt;/li&gt;
&lt;li&gt;Burst mode requires credit management&lt;/li&gt;
&lt;li&gt;Credit depletion happens gradually&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Especially with &lt;code&gt;.git/objects/&lt;/code&gt; containing thousands of small files, behavior differs drastically from normal file I/O.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Compound Root Causes
&lt;/h3&gt;

&lt;p&gt;This issue wasn't a single cause but three factors combining:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Shared Library cache disabled (pre-existing)&lt;/li&gt;
&lt;li&gt;Disposable agent switch (1/13)&lt;/li&gt;
&lt;li&gt;Increased builds (New Year)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each alone might not have caused major issues, but together they exceeded the critical threshold.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open Questions
&lt;/h2&gt;

&lt;p&gt;While we enabled Shared Library caching, we're still using disposable agents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Can agent-side Git cache be utilized effectively with disposable agents?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Possibilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Share EFS Git cache across all agents&lt;/li&gt;
&lt;li&gt;Extend agent lifecycle slightly for reuse across jobs&lt;/li&gt;
&lt;li&gt;Cache in S3 and sync on startup&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Finding the right balance between cost and performance remains a challenge.&lt;/p&gt;




&lt;p&gt;I write more about technical decision-making and engineering practices on my blog.&lt;br&gt;
Check it out: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
      <category>performance</category>
    </item>
    <item>
      <title>Python's Silent Failure: When `python -m` Does Nothing (And How to Fix It)</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sat, 31 Jan 2026 03:34:25 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/pythons-silent-failure-when-python-m-does-nothing-and-how-to-fix-it-4n30</link>
      <guid>https://forem.com/tielec-takashi/pythons-silent-failure-when-python-m-does-nothing-and-how-to-fix-it-4n30</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;Ever run &lt;code&gt;python -m your.module&lt;/code&gt; and get... nothing? No errors, exit code 0, but your code never actually runs? I just spent an hour debugging this exact problem.&lt;/p&gt;

&lt;p&gt;Here's what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I was running a CLI command via Jenkins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; monitoring_sdk.core.cli metabase-check &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$?&lt;/span&gt;
0
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;reports/
&lt;span class="c"&gt;# Empty - nothing was created&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code 0 means success, right? But nothing happened. No output, no files, no errors. Just... silence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Root Cause
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Missing &lt;code&gt;__main__&lt;/code&gt; entrypoint.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you run &lt;code&gt;python -m module.path&lt;/code&gt;, Python imports the module but &lt;strong&gt;doesn't automatically call your &lt;code&gt;main()&lt;/code&gt; function&lt;/strong&gt;. You need to explicitly tell it what to run.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Quick Fix
&lt;/h2&gt;

&lt;p&gt;Add one of these (or both):&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: Create &lt;code&gt;__main__.py&lt;/code&gt; (Recommended)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# monitoring_sdk/core/__main__.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;monitoring_sdk.core.cli&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Option 2: Add to your main file
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# monitoring_sdk/core/cli.py
# ... your code ...
&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Problem solved.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happens
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;python -m monitoring_sdk.core.cli&lt;/code&gt;, Python:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Imports &lt;code&gt;monitoring_sdk/core/cli.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Executes module-level code (imports, decorators, function definitions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stops&lt;/strong&gt; - it doesn't call &lt;code&gt;main()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Exits with code 0&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So you get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Module imported&lt;/li&gt;
&lt;li&gt;✅ Functions defined&lt;/li&gt;
&lt;li&gt;❌ Nothing executed&lt;/li&gt;
&lt;li&gt;✅ Exit code 0 (looks like success!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is what I call a "silent failure" - the worst kind.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Debugged It
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Added logging everywhere
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;

&lt;span class="n"&gt;logger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__name__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;metabase_check&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting command&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# This never appeared
&lt;/span&gt;    &lt;span class="c1"&gt;# ... code ...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Checked what WAS running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ DEBUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; monitoring_sdk.core.cli metabase-check
DEBUG - Package initialized  &lt;span class="c"&gt;# Module imported&lt;/span&gt;
DEBUG - Monitors initialized &lt;span class="c"&gt;# Module imported&lt;/span&gt;
&lt;span class="c"&gt;# ... nothing else&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Package initialization logs appeared (because &lt;code&gt;__init__.py&lt;/code&gt; runs on import), but my function logs didn't. That's when I realized: &lt;strong&gt;the function was never called&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Tried &lt;code&gt;--help&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; monitoring_sdk.core.cli &lt;span class="nt"&gt;--help&lt;/span&gt;
&lt;span class="c"&gt;# No output at all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If even &lt;code&gt;--help&lt;/code&gt; doesn't work, the entrypoint is definitely missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Python Looks for Entrypoints
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;python -m module.path&lt;/code&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Look for &lt;code&gt;module/path/__main__.py&lt;/code&gt; → Run it&lt;/li&gt;
&lt;li&gt;Otherwise, run &lt;code&gt;module/path.py&lt;/code&gt; (but need &lt;code&gt;if __name__ == "__main__"&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;How You Run It&lt;/th&gt;
&lt;th&gt;What You Need&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;python -m module.path&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;__main__.py&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;python script.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;if __name__ == "__main__"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Why This Bit Me (AI-Generated Code)
&lt;/h2&gt;

&lt;p&gt;Full transparency: I was using AI to generate code quickly. The AI created the CLI structure perfectly - decorators, commands, options - but forgot the entrypoint.&lt;/p&gt;

&lt;p&gt;I assumed "AI generated it, so it's complete." I didn't even check. That was my mistake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lesson learned&lt;/strong&gt;: AI is great for boilerplate, but you still need to verify the basics.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix in Action
&lt;/h2&gt;

&lt;p&gt;After adding &lt;code&gt;__main__.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; monitoring_sdk.core.cli metabase-check &lt;span class="nt"&gt;--config&lt;/span&gt; config.yaml
INFO - Starting card check
INFO - Card check completed: &lt;span class="nv"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OK
INFO - Report generated: &lt;span class="nv"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;reports/summary.md
INFO - Command completed: &lt;span class="nv"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OK

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;reports/
summary.md  &lt;span class="c"&gt;# Finally!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;python -m&lt;/code&gt; requires an explicit entrypoint&lt;/li&gt;
&lt;li&gt;Exit code 0 doesn't mean your code ran&lt;/li&gt;
&lt;li&gt;Silent failures are hard to debug - add logging early&lt;/li&gt;
&lt;li&gt;Even with AI-generated code, verify the basics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The worst bugs are the simple ones you don't think to check.&lt;/p&gt;




&lt;p&gt;If you're interested in more troubleshooting processes and decision-making in engineering, I write about them here: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>cli</category>
      <category>beginners</category>
    </item>
    <item>
      <title>AWS EFS Emergency Response: How I Spent $69 in 26 Hours (And How to Avoid It)</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Fri, 30 Jan 2026 06:12:13 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/aws-efs-emergency-response-how-i-spent-69-in-26-hours-and-how-to-avoid-it-5gb8</link>
      <guid>https://forem.com/tielec-takashi/aws-efs-emergency-response-how-i-spent-69-in-26-hours-and-how-to-avoid-it-5gb8</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;During a Jenkins EFS incident, I switched to Provisioned Throughput (300 MiB/s) for emergency response. It cost &lt;strong&gt;$69 for just 26 hours&lt;/strong&gt;. If I had known about Elastic Throughput, it would have been around &lt;strong&gt;$3.50&lt;/strong&gt;. Here's what I learned about EFS throughput modes and cost optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Incident
&lt;/h2&gt;

&lt;p&gt;Last week, our Jenkins CI/CD pipeline came to a halt due to EFS metadata IOPS exhaustion. As an emergency measure, I changed the EFS throughput mode to Provisioned Throughput at 300 MiB/s to keep Jenkins running while investigating the root cause.&lt;/p&gt;

&lt;p&gt;The next day, I checked AWS Cost Explorer and saw:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;$69.00&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For 26 hours of usage. Ouch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you're running EFS for production workloads, understanding throughput modes is critical. A simple configuration choice can mean the difference between &lt;strong&gt;$3 and $69&lt;/strong&gt; for the same workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  EFS Throughput Modes: A Quick Comparison
&lt;/h2&gt;

&lt;p&gt;AWS EFS offers three throughput modes:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Bursting Throughput (Default)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;: Storage cost only&lt;/p&gt;

&lt;p&gt;Performance scales with storage size. You get baseline throughput based on your storage capacity, plus burst credits for temporary spikes.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ No extra cost&lt;/li&gt;
&lt;li&gt;❌ Performance degrades when credits run out (our problem)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Provisioned Throughput
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;: Storage + Throughput cost&lt;/p&gt;

&lt;p&gt;Tokyo region: &lt;strong&gt;~$7.2 per MiB/s per month&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For 300 MiB/s:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monthly: 300 × $7.2 = &lt;strong&gt;$2,160&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;26 hours: $2,160 × (26/720) ≈ &lt;strong&gt;$78&lt;/strong&gt; (actual: $69)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;✅ Guaranteed performance&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;❌ Very expensive, billed even when idle&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Elastic Throughput
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;: Storage + Actual usage&lt;/p&gt;

&lt;p&gt;Tokyo region:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read: $0.04/GB&lt;/li&gt;
&lt;li&gt;Write: $0.07/GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For 26 hours with ~50GB usage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;50GB × $0.07 ≈ &lt;strong&gt;$3.50&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;✅ Pay-per-use, auto-scales&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;❌ Harder to predict costs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;26-hour Cost&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Bursting&lt;/td&gt;
&lt;td&gt;$5.6/month&lt;/td&gt;
&lt;td&gt;Normal operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Provisioned&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$69&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Constant high throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$3.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Spike handling (best for most cases)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Difference&lt;/strong&gt;: ~$65 (~$9,500 yen)&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Should Have Done
&lt;/h2&gt;

&lt;p&gt;Instead of jumping to Provisioned Throughput, here's the better approach:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Switch to Elastic Throughput
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws efs put-file-system-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file-system-id&lt;/span&gt; fs-xxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--throughput-mode&lt;/span&gt; elastic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This would have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-scaled during investigation&lt;/li&gt;
&lt;li&gt;Cost only ~$3.50 for the same period&lt;/li&gt;
&lt;li&gt;No manual capacity planning needed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Investigate Root Cause
&lt;/h3&gt;

&lt;p&gt;While Elastic Throughput handles the spike automatically, investigate and fix the underlying issue (in our case, Git temporary files accumulating).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Set Up Monitoring
&lt;/h3&gt;

&lt;p&gt;CloudWatch alarms for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PercentIOLimit&lt;/code&gt; &amp;gt; 75%&lt;/li&gt;
&lt;li&gt;Early warning before IOPS exhaustion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I Didn't Choose Elastic Throughput
&lt;/h2&gt;

&lt;p&gt;Honestly? &lt;strong&gt;I didn't know it existed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Elastic Throughput was announced in 2022, but I hadn't updated my knowledge. During the emergency, my mental model was:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bursting = free but unreliable&lt;/li&gt;
&lt;li&gt;Provisioned = expensive but guaranteed&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I missed the third, better option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Was the Decision Wrong?
&lt;/h2&gt;

&lt;p&gt;Not entirely. Let's look at ROI:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost&lt;/strong&gt;: $69 (10,000 yen)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoided Loss&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 engineers × 3 hours waiting = 30 person-hours&lt;/li&gt;
&lt;li&gt;At ~$50/hour = &lt;strong&gt;$1,500&lt;/strong&gt; in productivity loss&lt;/li&gt;
&lt;li&gt;Plus deployment delays (hard to quantify)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ROI&lt;/strong&gt;: ~20x&lt;/p&gt;

&lt;p&gt;The decision to prioritize business continuity was correct. But knowing about Elastic Throughput would have achieved the same result for 1/20th the cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Always Research Current Options
&lt;/h3&gt;

&lt;p&gt;Don't rely on old knowledge during emergencies. Take 5 minutes to check AWS documentation for the latest features.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Cost Estimation is Part of the Response
&lt;/h3&gt;

&lt;p&gt;"Make it work first" is important, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;List all options&lt;/li&gt;
&lt;li&gt;Quick cost comparison&lt;/li&gt;
&lt;li&gt;Choose based on data, not urgency&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Document and Share
&lt;/h3&gt;

&lt;p&gt;This $69 lesson becomes valuable when shared. Your team (and the community) can learn without paying the same price.&lt;/p&gt;

&lt;h2&gt;
  
  
  Action Items
&lt;/h2&gt;

&lt;p&gt;If you're using EFS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Check your current throughput mode&lt;/li&gt;
&lt;li&gt;[ ] Consider Elastic Throughput for variable workloads&lt;/li&gt;
&lt;li&gt;[ ] Set up CloudWatch alarms for &lt;code&gt;PercentIOLimit&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;[ ] Document your throughput mode decision process&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Use Elastic Throughput for most production workloads.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's the best of both worlds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles spikes automatically&lt;/li&gt;
&lt;li&gt;Pay only for what you use&lt;/li&gt;
&lt;li&gt;No capacity planning required&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Provisioned Throughput should be reserved for constant, predictable high-throughput scenarios.&lt;/p&gt;

&lt;p&gt;Next time I face a similar situation, I'll reach for Elastic Throughput first.&lt;/p&gt;




&lt;p&gt;I write more about technical decision-making and engineering practices on my blog.&lt;br&gt;
Check it out: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/efs/pricing/" rel="noopener noreferrer"&gt;Amazon EFS Pricing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/new-amazon-efs-elastic-throughput/" rel="noopener noreferrer"&gt;Announcing Amazon EFS Elastic Throughput&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>devops</category>
      <category>performance</category>
    </item>
    <item>
      <title>Measuring ROI of Forward-Looking Design Decisions with ADR</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Fri, 30 Jan 2026 06:08:08 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/measuring-roi-of-forward-looking-design-decisions-with-adr-jjd</link>
      <guid>https://forem.com/tielec-takashi/measuring-roi-of-forward-looking-design-decisions-with-adr-jjd</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;Ever built a feature "just in case" only to never use it? Or skipped implementing something flexible, only to refactor it weeks later?&lt;/p&gt;

&lt;p&gt;We all face this dilemma: &lt;strong&gt;YAGNI (You Aren't Gonna Need It) vs. forward-looking design&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The problem? We make these decisions based on gut feeling, not data.&lt;/p&gt;

&lt;p&gt;This post shows how to make forward-looking design measurable using ADR (Architecture Decision Records).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem
&lt;/h2&gt;

&lt;p&gt;What we really want to know is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How often do our predictions about future requirements actually come true?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three dimensions to evaluate:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prediction&lt;/td&gt;
&lt;td&gt;"We'll need X feature in the future"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation&lt;/td&gt;
&lt;td&gt;Code/design we built in advance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reality&lt;/td&gt;
&lt;td&gt;Did that requirement actually come?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We want to measure &lt;strong&gt;prediction accuracy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  YAGNI vs. Forward-Looking Design
&lt;/h2&gt;

&lt;p&gt;YAGNI means "You Aren't Gonna Need It &lt;strong&gt;now&lt;/strong&gt;", not "You'll Never Need It".&lt;/p&gt;

&lt;p&gt;The problem is paying heavy upfront costs for low-accuracy predictions.&lt;/p&gt;

&lt;h3&gt;
  
  
  When forward-looking design makes sense
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Extension points (interfaces, hooks, plugin architecture)&lt;/li&gt;
&lt;li&gt;Database schema separation&lt;/li&gt;
&lt;li&gt;Fields that are easy to add later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Low cost, low damage if wrong&lt;/p&gt;

&lt;h3&gt;
  
  
  When YAGNI is the answer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;UI implementation&lt;/li&gt;
&lt;li&gt;Complex business logic&lt;/li&gt;
&lt;li&gt;External integrations&lt;/li&gt;
&lt;li&gt;Permission/billing logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Will need complete rewrite if wrong&lt;/p&gt;

&lt;h2&gt;
  
  
  The Estimation Problem
&lt;/h2&gt;

&lt;p&gt;You might think: calculate value like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Value = (Probability × Cost_Saved_Later) − Cost_Now
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But here's the catch: &lt;strong&gt;we can't estimate these accurately&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nobody knows the real probability&lt;/li&gt;
&lt;li&gt;Future costs are unknown until we do the work&lt;/li&gt;
&lt;li&gt;Even upfront costs grow during implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If we could estimate accurately, we wouldn't need this system.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning from inaccuracy
&lt;/h3&gt;

&lt;p&gt;We don't need perfect estimates.&lt;/p&gt;

&lt;p&gt;What we need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn which areas have accurate predictions&lt;/li&gt;
&lt;li&gt;Learn which areas consistently miss&lt;/li&gt;
&lt;li&gt;Build organizational knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Tendency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Database design&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;Forward-looking OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI specs&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Stick to YAGNI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External APIs&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Definitely YAGNI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Numbers are &lt;strong&gt;learning material&lt;/strong&gt;, not absolute truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recording Predictions in ADR
&lt;/h2&gt;

&lt;p&gt;To make forward-looking design measurable, we need to record our decisions.&lt;/p&gt;

&lt;p&gt;We use ADR (Architecture Decision Records).&lt;/p&gt;

&lt;p&gt;Example ADR with forecast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Context&lt;/span&gt;
Customer-specific permission management might be needed in the future

&lt;span class="gu"&gt;## Decision&lt;/span&gt;
Keep simple role model for now

&lt;span class="gu"&gt;## Forecast&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Probability estimate: 30% (based on sales feedback)
&lt;span class="p"&gt;-&lt;/span&gt; Cost if later: 20 person-days (rough estimate)
&lt;span class="p"&gt;-&lt;/span&gt; Cost if now: 4 person-days (rough estimate)
&lt;span class="p"&gt;-&lt;/span&gt; Decision: Don't build it now (negative expected value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Estimates can be rough. What matters is &lt;strong&gt;recording the rationale&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Measurable
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;In our environment, this works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Repositories (ADR + metadata.json)
  ↓
Jenkins (cross-repo scanning, diff collection)
  ↓
S3 (aggregated JSON)
  ↓
Microsoft Fabric (analysis &amp;amp; visualization)
  ↓
Dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We already have Jenkins scanning repos for code complexity. We can extend this for ADR metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Design
&lt;/h3&gt;

&lt;p&gt;Keep ADR content free-form. Standardize only the metadata for aggregation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docs/adr/ADR-023.meta.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ADR-023"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"probability_estimate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cost_now_estimate_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cost_late_estimate_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pending"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-11-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"requirement_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual_cost_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Minimum fields needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;adr_id&lt;/code&gt;: unique identifier&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;type&lt;/code&gt;: forecast&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;probability_estimate&lt;/code&gt;: 0-1&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost_now_estimate_pd&lt;/code&gt;: upfront cost (person-days)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost_late_estimate_pd&lt;/code&gt;: later cost (person-days)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;status&lt;/code&gt;: pending / hit / miss&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;outcome&lt;/code&gt;: actual results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat estimates as "estimates", not gospel truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diff-Based Collection
&lt;/h3&gt;

&lt;p&gt;Full scans get expensive. Collect only diffs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Record last commit SHA&lt;/span&gt;
git diff &lt;span class="nt"&gt;--name-only&lt;/span&gt; &amp;lt;prev&amp;gt;..&amp;lt;now&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'*.meta.json'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scales as repos grow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparing Predictions to Reality
&lt;/h3&gt;

&lt;p&gt;After 6-12 months, review:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Est. Prob&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;th&gt;Est. Cost&lt;/th&gt;
&lt;th&gt;Actual Cost&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F-001&lt;/td&gt;
&lt;td&gt;CSV bulk import&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;Came after 6mo&lt;/td&gt;
&lt;td&gt;15pd later&lt;/td&gt;
&lt;td&gt;1pd&lt;/td&gt;
&lt;td&gt;Hit &amp;amp; overestimated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F-002&lt;/td&gt;
&lt;td&gt;i18n&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;Didn't come&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Miss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F-003&lt;/td&gt;
&lt;td&gt;Advanced perms&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Came after 3mo&lt;/td&gt;
&lt;td&gt;20pd later&lt;/td&gt;
&lt;td&gt;25pd&lt;/td&gt;
&lt;td&gt;Hit &amp;amp; underestimated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Focus on &lt;strong&gt;trends and deviation reasons&lt;/strong&gt;, not absolute accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregation &amp;amp; Visualization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two Types of Output
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Raw facts&lt;/strong&gt; (NDJSON, append-only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ADR-023"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"hit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ADR-024"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"miss"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Snapshot&lt;/strong&gt; (daily/weekly metrics):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-27"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"success_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_forecasts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"misses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"avg_cost_deviation_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-3.5&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Leadership Wants to See
&lt;/h3&gt;

&lt;p&gt;CTOs and executives probably care about:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forecast success rate&lt;/strong&gt; (prediction accuracy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost savings trend&lt;/strong&gt; (rough ROI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning curve&lt;/strong&gt; (are we getting better?)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Treat ADR as transaction log. Handle visualization separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Do We Know It Was Worth It?
&lt;/h2&gt;

&lt;p&gt;Three evaluation moments:&lt;/p&gt;

&lt;h3&gt;
  
  
  ① When requirement actually comes
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Did that requirement actually happen?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;No → prediction missed&lt;/li&gt;
&lt;li&gt;Yes → move to next evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not "worth it" yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  ② When we measure implementation time (most important)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"How fast/cheap could we implement it?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Additional work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;With forward-looking design&lt;/td&gt;
&lt;td&gt;1 person-day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Without (estimated)&lt;/td&gt;
&lt;td&gt;10 person-days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is when we can say "forward-looking design paid off".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User adoption doesn't matter yet.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ When users get value
&lt;/h3&gt;

&lt;p&gt;This evaluates business value, but involves marketing, sales, timing, competition.&lt;/p&gt;

&lt;p&gt;For technical decisions, focus on &lt;strong&gt;② implementation cost difference&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Excel?
&lt;/h2&gt;

&lt;p&gt;Excel management fails because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates scatter across time&lt;/li&gt;
&lt;li&gt;Unclear ownership&lt;/li&gt;
&lt;li&gt;Diverges from decision log&lt;/li&gt;
&lt;li&gt;Nobody looks at it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Excel becomes "create once, forget forever".&lt;/p&gt;

&lt;p&gt;Treat ADR as input device, visualization as separate layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This system's goal isn't &lt;strong&gt;perfect estimates&lt;/strong&gt; or &lt;strong&gt;perfect predictions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Goals are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Record decisions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Learn from results&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improve prediction accuracy over time&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wrong estimates aren't failures. &lt;strong&gt;Making the same wrong decision repeatedly without learning&lt;/strong&gt; is the failure.&lt;/p&gt;

&lt;p&gt;Treat numbers as &lt;strong&gt;learning material&lt;/strong&gt;, not absolute truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Planning to propose:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Finalize metadata.json schema&lt;/li&gt;
&lt;li&gt;PoC with 2-3 repos&lt;/li&gt;
&lt;li&gt;Build Jenkins → S3 → Fabric pipeline&lt;/li&gt;
&lt;li&gt;Start with hit rate &amp;amp; cost deviation&lt;/li&gt;
&lt;li&gt;Run for 3 months, evaluate learning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not sure if this will work, but worth trying to turn forward-looking design from "personal skill" into "organizational capability".&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>design</category>
      <category>documentation</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Measuring ROI of Forward-Looking Design Decisions with ADR</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Tue, 27 Jan 2026 13:05:05 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/measuring-roi-of-forward-looking-design-decisions-with-adr-5b36</link>
      <guid>https://forem.com/tielec-takashi/measuring-roi-of-forward-looking-design-decisions-with-adr-5b36</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;Ever built a feature "just in case" only to never use it? Or skipped implementing something flexible, only to refactor it weeks later?&lt;/p&gt;

&lt;p&gt;We all face this dilemma: &lt;strong&gt;YAGNI (You Aren't Gonna Need It) vs. forward-looking design&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The problem? We make these decisions based on gut feeling, not data.&lt;/p&gt;

&lt;p&gt;This post shows how to make forward-looking design measurable using ADR (Architecture Decision Records).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Problem
&lt;/h2&gt;

&lt;p&gt;What we really want to know is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How often do our predictions about future requirements actually come true?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Three dimensions to evaluate:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Detail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prediction&lt;/td&gt;
&lt;td&gt;"We'll need X feature in the future"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implementation&lt;/td&gt;
&lt;td&gt;Code/design we built in advance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reality&lt;/td&gt;
&lt;td&gt;Did that requirement actually come?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We want to measure &lt;strong&gt;prediction accuracy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  YAGNI vs. Forward-Looking Design
&lt;/h2&gt;

&lt;p&gt;YAGNI means "You Aren't Gonna Need It &lt;strong&gt;now&lt;/strong&gt;", not "You'll Never Need It".&lt;/p&gt;

&lt;p&gt;The problem is paying heavy upfront costs for low-accuracy predictions.&lt;/p&gt;

&lt;h3&gt;
  
  
  When forward-looking design makes sense
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Extension points (interfaces, hooks, plugin architecture)&lt;/li&gt;
&lt;li&gt;Database schema separation&lt;/li&gt;
&lt;li&gt;Fields that are easy to add later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Low cost, low damage if wrong&lt;/p&gt;

&lt;h3&gt;
  
  
  When YAGNI is the answer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;UI implementation&lt;/li&gt;
&lt;li&gt;Complex business logic&lt;/li&gt;
&lt;li&gt;External integrations&lt;/li&gt;
&lt;li&gt;Permission/billing logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ Will need complete rewrite if wrong&lt;/p&gt;

&lt;h2&gt;
  
  
  The Estimation Problem
&lt;/h2&gt;

&lt;p&gt;You might think: calculate value like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Value = (Probability × Cost_Saved_Later) − Cost_Now
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;But here's the catch: &lt;strong&gt;we can't estimate these accurately&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Nobody knows the real probability&lt;/li&gt;
&lt;li&gt;Future costs are unknown until we do the work&lt;/li&gt;
&lt;li&gt;Even upfront costs grow during implementation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If we could estimate accurately, we wouldn't need this system.&lt;/strong&gt;&lt;/p&gt;
&lt;h3&gt;
  
  
  Learning from inaccuracy
&lt;/h3&gt;

&lt;p&gt;We don't need perfect estimates.&lt;/p&gt;

&lt;p&gt;What we need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn which areas have accurate predictions&lt;/li&gt;
&lt;li&gt;Learn which areas consistently miss&lt;/li&gt;
&lt;li&gt;Build organizational knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Area&lt;/th&gt;
&lt;th&gt;Hit Rate&lt;/th&gt;
&lt;th&gt;Tendency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Database design&lt;/td&gt;
&lt;td&gt;70%&lt;/td&gt;
&lt;td&gt;Forward-looking OK&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI specs&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Stick to YAGNI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;External APIs&lt;/td&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;Definitely YAGNI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Numbers are &lt;strong&gt;learning material&lt;/strong&gt;, not absolute truth.&lt;/p&gt;
&lt;h2&gt;
  
  
  Recording Predictions in ADR
&lt;/h2&gt;

&lt;p&gt;To make forward-looking design measurable, we need to record our decisions.&lt;/p&gt;

&lt;p&gt;We use ADR (Architecture Decision Records). I've written about ADRs before in this post:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a" class="crayons-story__hidden-navigation-link"&gt;Why did we choose this again?" - How ADRs Solved Our Documentation Problem&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/tielec-takashi" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3683605%2Fe1040807-5532-4992-8f53-f3cd04762229.jpg" alt="tielec-takashi profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/tielec-takashi" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Yuto Takashi
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Yuto Takashi
                
              
              &lt;div id="story-author-preview-content-3196766" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/tielec-takashi" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3683605%2Fe1040807-5532-4992-8f53-f3cd04762229.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Yuto Takashi&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jan 25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a" id="article-link-3196766"&gt;
          Why did we choose this again?" - How ADRs Solved Our Documentation Problem
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/architecture"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;architecture&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/documentation"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;documentation&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/productivity"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;productivity&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/softwareengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;softwareengineering&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
            &lt;a href="https://dev.to/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;


&lt;/div&gt;





&lt;p&gt;Example ADR with forecast:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Context&lt;/span&gt;
Customer-specific permission management might be needed in the future

&lt;span class="gu"&gt;## Decision&lt;/span&gt;
Keep simple role model for now

&lt;span class="gu"&gt;## Forecast&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Probability estimate: 30% (based on sales feedback)
&lt;span class="p"&gt;-&lt;/span&gt; Cost if later: 20 person-days (rough estimate)
&lt;span class="p"&gt;-&lt;/span&gt; Cost if now: 4 person-days (rough estimate)
&lt;span class="p"&gt;-&lt;/span&gt; Decision: Don't build it now (negative expected value)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Estimates can be rough. What matters is &lt;strong&gt;recording the rationale&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Measurable
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;

&lt;p&gt;In our environment, this works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Repositories (ADR + metadata.json)
  ↓
Jenkins (cross-repo scanning, diff collection)
  ↓
S3 (aggregated JSON)
  ↓
Microsoft Fabric (analysis &amp;amp; visualization)
  ↓
Dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We already have Jenkins scanning repos for code complexity. We can extend this for ADR metadata.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Design
&lt;/h3&gt;

&lt;p&gt;Keep ADR content free-form. Standardize only the metadata for aggregation.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docs/adr/ADR-023.meta.json&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ADR-023"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"probability_estimate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cost_now_estimate_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cost_late_estimate_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pending"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"decision_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-11-01"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"outcome"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"requirement_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"actual_cost_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Minimum fields needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;adr_id&lt;/code&gt;: unique identifier&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;type&lt;/code&gt;: forecast&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;probability_estimate&lt;/code&gt;: 0-1&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost_now_estimate_pd&lt;/code&gt;: upfront cost (person-days)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cost_late_estimate_pd&lt;/code&gt;: later cost (person-days)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;status&lt;/code&gt;: pending / hit / miss&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;outcome&lt;/code&gt;: actual results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Treat estimates as "estimates", not gospel truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Diff-Based Collection
&lt;/h3&gt;

&lt;p&gt;Full scans get expensive. Collect only diffs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Record last commit SHA&lt;/span&gt;
git diff &lt;span class="nt"&gt;--name-only&lt;/span&gt; &amp;lt;prev&amp;gt;..&amp;lt;now&amp;gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'*.meta.json'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Scales as repos grow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparing Predictions to Reality
&lt;/h3&gt;

&lt;p&gt;After 6-12 months, review:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;ID&lt;/th&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Est. Prob&lt;/th&gt;
&lt;th&gt;Actual&lt;/th&gt;
&lt;th&gt;Est. Cost&lt;/th&gt;
&lt;th&gt;Actual Cost&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;F-001&lt;/td&gt;
&lt;td&gt;CSV bulk import&lt;/td&gt;
&lt;td&gt;30%&lt;/td&gt;
&lt;td&gt;Came after 6mo&lt;/td&gt;
&lt;td&gt;15pd later&lt;/td&gt;
&lt;td&gt;1pd&lt;/td&gt;
&lt;td&gt;Hit &amp;amp; overestimated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F-002&lt;/td&gt;
&lt;td&gt;i18n&lt;/td&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;Didn't come&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;Miss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F-003&lt;/td&gt;
&lt;td&gt;Advanced perms&lt;/td&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;Came after 3mo&lt;/td&gt;
&lt;td&gt;20pd later&lt;/td&gt;
&lt;td&gt;25pd&lt;/td&gt;
&lt;td&gt;Hit &amp;amp; underestimated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Focus on &lt;strong&gt;trends and deviation reasons&lt;/strong&gt;, not absolute accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aggregation &amp;amp; Visualization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Two Types of Output
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Raw facts&lt;/strong&gt; (NDJSON, append-only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ADR-023"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"hit"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"adr_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"ADR-024"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"forecast"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"miss"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Snapshot&lt;/strong&gt; (daily/weekly metrics):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2025-01-27"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"success_rate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"total_forecasts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"hits"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"misses"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"avg_cost_deviation_pd"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;-3.5&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What Leadership Wants to See
&lt;/h3&gt;

&lt;p&gt;CTOs and executives probably care about:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Forecast success rate&lt;/strong&gt; (prediction accuracy)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost savings trend&lt;/strong&gt; (rough ROI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Learning curve&lt;/strong&gt; (are we getting better?)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Treat ADR as transaction log. Handle visualization separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Do We Know It Was Worth It?
&lt;/h2&gt;

&lt;p&gt;Three evaluation moments:&lt;/p&gt;

&lt;h3&gt;
  
  
  ① When requirement actually comes
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"Did that requirement actually happen?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;No → prediction missed&lt;/li&gt;
&lt;li&gt;Yes → move to next evaluation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not "worth it" yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  ② When we measure implementation time (most important)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"How fast/cheap could we implement it?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Additional work&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;With forward-looking design&lt;/td&gt;
&lt;td&gt;1 person-day&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Without (estimated)&lt;/td&gt;
&lt;td&gt;10 person-days&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is when we can say "forward-looking design paid off".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User adoption doesn't matter yet.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  ③ When users get value
&lt;/h3&gt;

&lt;p&gt;This evaluates business value, but involves marketing, sales, timing, competition.&lt;/p&gt;

&lt;p&gt;For technical decisions, focus on &lt;strong&gt;② implementation cost difference&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Excel?
&lt;/h2&gt;

&lt;p&gt;Excel management fails because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Updates scatter across time&lt;/li&gt;
&lt;li&gt;Unclear ownership&lt;/li&gt;
&lt;li&gt;Diverges from decision log&lt;/li&gt;
&lt;li&gt;Nobody looks at it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Excel becomes "create once, forget forever".&lt;/p&gt;

&lt;p&gt;Treat ADR as input device, visualization as separate layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This system's goal isn't &lt;strong&gt;perfect estimates&lt;/strong&gt; or &lt;strong&gt;perfect predictions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Goals are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Record decisions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Learn from results&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Improve prediction accuracy over time&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wrong estimates aren't failures. &lt;strong&gt;Making the same wrong decision repeatedly without learning&lt;/strong&gt; is the failure.&lt;/p&gt;

&lt;p&gt;Treat numbers as &lt;strong&gt;learning material&lt;/strong&gt;, not absolute truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Planning to propose:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Finalize metadata.json schema&lt;/li&gt;
&lt;li&gt;PoC with 2-3 repos&lt;/li&gt;
&lt;li&gt;Build Jenkins → S3 → Fabric pipeline&lt;/li&gt;
&lt;li&gt;Start with hit rate &amp;amp; cost deviation&lt;/li&gt;
&lt;li&gt;Run for 3 months, evaluate learning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Not sure if this will work, but worth trying to turn forward-looking design from "personal skill" into "organizational capability".&lt;/p&gt;




&lt;p&gt;I write more about design decisions and engineering processes on my blog.&lt;br&gt;
If you're interested, check it out: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>documentation</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How Git Temp Files Killed Our Jenkins Performance (EFS Metadata IOPS Hell)</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Tue, 27 Jan 2026 11:29:23 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/how-git-temp-files-killed-our-jenkins-performance-efs-metadata-iops-hell-3ff4</link>
      <guid>https://forem.com/tielec-takashi/how-git-temp-files-killed-our-jenkins-performance-efs-metadata-iops-hell-3ff4</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you're running Jenkins on AWS EFS, you might hit this exact problem. Git clone operations start timing out, Jenkins UI becomes painfully slow, and you get cryptic "Bad file descriptor" errors. &lt;/p&gt;

&lt;p&gt;The culprit? Git temporary pack files accumulating over time, starving EFS of metadata IOPS.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Monday morning. Jenkins dashboard takes forever to load. Hit the Replay button, build starts, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fatal: write error: Bad file descriptor
fatal: fetch-pack: invalid index-pack output
ERROR: Error cloning remote repo 'origin'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Transfer speed drops from 77KB/s to 51KB/s before timing out completely. 504 Gateway Timeout errors everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Initial Investigation
&lt;/h2&gt;

&lt;p&gt;First thought: "Network issue?"&lt;/p&gt;

&lt;p&gt;But looking closer at the error logs, I noticed &lt;code&gt;pipeline-groovy-lib&lt;/code&gt; was failing during Shared Library checkout. That happens on the Jenkins Controller, not agents. So this is a Controller resource problem.&lt;/p&gt;

&lt;p&gt;Checked CloudWatch metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ CPU: Normal (0-5%, occasional 30% spikes)&lt;/li&gt;
&lt;li&gt;✅ Network: Nothing unusual&lt;/li&gt;
&lt;li&gt;✅ EBS disk latency: Stable at ~0.7s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Wait... this Jenkins uses EFS, not just EBS.&lt;/p&gt;

&lt;h2&gt;
  
  
  EFS Metrics Told the Real Story
&lt;/h2&gt;

&lt;p&gt;Checked EFS CloudWatch metrics and found:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Throughput utilization&lt;/strong&gt;: Hitting &lt;strong&gt;100% during 00:00-03:00&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IOPS&lt;/strong&gt;: Metadata operations dominating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage&lt;/strong&gt;: Growing from 14GB → 17GB&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;📊 See detailed metrics and graphs in the &lt;a href="https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue" rel="noopener noreferrer"&gt;full write-up&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;EFS was starving on metadata IOPS&lt;/strong&gt;, not storage capacity.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Metadata IOPS?
&lt;/h2&gt;

&lt;p&gt;In EFS, metadata operations include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;File stat (checking size/timestamps)&lt;/li&gt;
&lt;li&gt;Directory listings&lt;/li&gt;
&lt;li&gt;File create/delete&lt;/li&gt;
&lt;li&gt;Permission changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words: &lt;strong&gt;lots of small file operations&lt;/strong&gt; consume metadata IOPS.&lt;/p&gt;

&lt;p&gt;Jenkins workloads are full of these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build logs (thousands of small files)&lt;/li&gt;
&lt;li&gt;Git repositories (.git/objects with tons of files)&lt;/li&gt;
&lt;li&gt;Shared Library clones&lt;/li&gt;
&lt;li&gt;Build fingerprints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It's not about storage size. It's about &lt;strong&gt;file count&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding the Culprit
&lt;/h2&gt;

&lt;p&gt;Checked directory sizes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;du&lt;/span&gt; &lt;span class="nt"&gt;-sh&lt;/span&gt; /mnt/efs/jenkins/&lt;span class="k"&gt;*&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;342M    plugins
106M    war
332M    logs
373M    caches
174M    fingerprints
timeout jobs  # Suspicious timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only found ~1.3GB. &lt;strong&gt;Missing ~15.7GB&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Searched for large files directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find /mnt/efs/jenkins &lt;span class="nt"&gt;-type&lt;/span&gt; f &lt;span class="nt"&gt;-size&lt;/span&gt; +100M 2&amp;gt;/dev/null &lt;span class="nt"&gt;-ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Boom:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;125829120 Jan 27 01:50 .../builds/873/libs/.../root/.git/objects/pack/tmp_pack_S3GPJw
122683392 Jan 27 01:50 .../builds/872/libs/.../root/.git/objects/pack/tmp_pack_c4EAwd
(dozens more of these...)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;tmp_pack_&lt;/strong&gt;* files everywhere. 100-300MB each.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root Cause
&lt;/h2&gt;

&lt;p&gt;Here's what was happening:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Jenkins clones Pipeline Shared Library&lt;/li&gt;
&lt;li&gt;Git creates temporary pack files (&lt;code&gt;tmp_pack_*&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;EFS IOPS throttling causes timeout&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Temp files never get cleaned up&lt;/li&gt;
&lt;li&gt;This happens every build (nightly at 23:12)&lt;/li&gt;
&lt;li&gt;~200-300MB garbage per build × dozens of builds = &lt;strong&gt;~15GB&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Vicious cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EFS slow → Git fails → Files accumulate → EFS slower
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Immediate: Adjust EFS Throughput
&lt;/h3&gt;

&lt;p&gt;Changed from &lt;strong&gt;Bursting mode&lt;/strong&gt; to &lt;strong&gt;Provisioned throughput (300 MiB/s)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why provisioned?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable performance for metadata IOPS spikes&lt;/li&gt;
&lt;li&gt;No waiting for burst credits to recover&lt;/li&gt;
&lt;li&gt;Works during investigation (&lt;code&gt;find&lt;/code&gt;, &lt;code&gt;du&lt;/code&gt; commands)&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ Note: EFS throughput mode changes have restrictions. Plan accordingly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Cleanup Job
&lt;/h3&gt;

&lt;p&gt;Created a daily cleanup pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;any&lt;/span&gt;
    &lt;span class="n"&gt;triggers&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;cron&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'0 4 * * *'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;stages&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Clean tmp_pack files'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'''
                    find $JENKINS_HOME -name "tmp_pack_*" -type f -mtime +1 -delete
                    echo "Cleaned up tmp_pack_* files older than 1 day"
                '''&lt;/span&gt;
            &lt;span class="o"&gt;}&lt;/span&gt;
        &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Manual Cleanup (One-time)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl stop jenkins
find /mnt/efs/jenkins &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"tmp_pack_*"&lt;/span&gt; &lt;span class="nt"&gt;-type&lt;/span&gt; f &lt;span class="nt"&gt;-delete&lt;/span&gt;
systemctl start jenkins
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Freed up ~15GB instantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Storage Size ≠ Performance
&lt;/h3&gt;

&lt;p&gt;Small files matter more than total GB on EFS. Metadata operations can bottleneck before you hit storage limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Bursting Mode Can Be Unpredictable
&lt;/h3&gt;

&lt;p&gt;When problems accumulate gradually ("silently"), burst credits can run out unexpectedly.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Always Have a Safety Net
&lt;/h3&gt;

&lt;p&gt;Changing to provisioned throughput bought us time to investigate properly without user impact.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Setup
&lt;/h2&gt;

&lt;p&gt;Added CloudWatch alarms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EFS throughput utilization &amp;gt; 75%&lt;/li&gt;
&lt;li&gt;Directory size monitoring (weekly reports)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Early detection prevents these surprises.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Surface symptom: Git clone errors&lt;br&gt;
→ Deeper cause: EFS metadata IOPS exhaustion&lt;br&gt;
→ Root cause: Git temp file accumulation&lt;/p&gt;

&lt;p&gt;Problem-solving is about peeling back layers. Each hypothesis, each metric check, gets you closer to the truth.&lt;/p&gt;

&lt;p&gt;If you found this useful, I write more about infrastructure debugging and SRE experiences here:&lt;br&gt;
&lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Full investigation details with metrics graphs:&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue" rel="noopener noreferrer"&gt;https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue&lt;/a&gt;&lt;/p&gt;

</description>
      <category>jenkins</category>
      <category>aws</category>
      <category>devops</category>
      <category>troubleshooting</category>
    </item>
    <item>
      <title>My 5-Year-Old Keyboard Died During a Winter Trip - Here's What I Learned</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Mon, 26 Jan 2026 02:48:53 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/my-5-year-old-keyboard-died-during-a-winter-trip-heres-what-i-learned-54kf</link>
      <guid>https://forem.com/tielec-takashi/my-5-year-old-keyboard-died-during-a-winter-trip-heres-what-i-learned-54kf</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you use a Bluetooth keyboard daily, it will eventually fail. Understanding why it happened and how to choose the next one can save you time and money.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5.5-year-old Filco Bluetooth keyboard died suddenly&lt;/li&gt;
&lt;li&gt;Hypothesis: Winter temperature shock (15-20°C drop) + aging&lt;/li&gt;
&lt;li&gt;Chose RealForce RC1 45g over HHKB due to ThinkPad compatibility&lt;/li&gt;
&lt;li&gt;Lesson: 5 years is the expected lifespan for most electronics&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Happened
&lt;/h2&gt;

&lt;p&gt;I bought a Filco Majestouch MINILA Air (Bluetooth, Cherry MX Black) in July 2019. It worked perfectly for 5.5 years.&lt;/p&gt;

&lt;p&gt;Then, I left home for a week during a winter cold wave (heating off). When I returned, the keyboard was completely dead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No power&lt;/li&gt;
&lt;li&gt;No LED lights&lt;/li&gt;
&lt;li&gt;New batteries didn't help&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Troubleshooting Steps
&lt;/h2&gt;

&lt;p&gt;I tried everything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;✓ Check power switch → ON&lt;/li&gt;
&lt;li&gt;✓ Check battery polarity → Correct&lt;/li&gt;
&lt;li&gt;✓ Clean battery contacts → No dirt&lt;/li&gt;
&lt;li&gt;✓ Try new batteries (multiple brands) → No change&lt;/li&gt;
&lt;li&gt;✓ Reset Bluetooth pairing → No response&lt;/li&gt;
&lt;li&gt;✓ Full discharge (remove batteries, press keys) → No change&lt;/li&gt;
&lt;li&gt;✓ Try pairing with smartphone → No response&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Diagnosis: Power system failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Possible causes (based on research):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Power IC failure&lt;/li&gt;
&lt;li&gt;Solder cracks on power lines&lt;/li&gt;
&lt;li&gt;Bluetooth board power management failure&lt;/li&gt;
&lt;li&gt;Capacitor damage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Conclusion: Not worth repairing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Out of warranty (5.5 years old)&lt;/li&gt;
&lt;li&gt;Filco's repair support ends after 5 years&lt;/li&gt;
&lt;li&gt;Repair cost would exceed new keyboard price&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why Did It Fail?
&lt;/h2&gt;

&lt;p&gt;"How can a keyboard die just because I didn't use it for a week?"&lt;/p&gt;

&lt;p&gt;This question bothered me. Then I remembered: &lt;strong&gt;it was a winter cold wave, and I had the heating off&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temperature Changes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Before trip (heating on): Room temp ~20-25°C&lt;/li&gt;
&lt;li&gt;During trip (cold wave, no heating): Room temp ~5-10°C&lt;/li&gt;
&lt;li&gt;After return (heating on): Room temp ~20-25°C&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Temperature difference: 15-20°C&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hypothesis: Thermal Shock
&lt;/h3&gt;

&lt;p&gt;After 5.5 years of use, solder joints likely had microscopic cracks. Normal temperature changes (from keyboard heat during use) were fine, but &lt;strong&gt;rapid temperature cycling (shrink → expand) may have caused complete fractures&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Research found similar cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PC support companies report increased "keyboard not working" calls during cold waves&lt;/li&gt;
&lt;li&gt;Many cases involve keyboards 5+ years old&lt;/li&gt;
&lt;li&gt;Most are temporary (work after warming up), but mine was permanent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Note: This is just a hypothesis. I have no definitive proof.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Next Keyboard
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Requirements
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Wireless connection (Bluetooth or 2.4GHz)&lt;/li&gt;
&lt;li&gt;Quiet typing (I used to like clicky switches, but now prefer silence)&lt;/li&gt;
&lt;li&gt;Portable, but also used with ThinkPad&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The last point turned out to be critical.&lt;/p&gt;

&lt;h3&gt;
  
  
  Keyboards Considered
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Filco MINILA-R Convertible&lt;/strong&gt; (~$100)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cherry MX switches&lt;/li&gt;
&lt;li&gt;Compact with arrow keys&lt;/li&gt;
&lt;li&gt;Standard layout&lt;/li&gt;
&lt;li&gt;Familiar brand&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;HHKB (Happy Hacking Keyboard) Professional HYBRID Type-S&lt;/strong&gt; (~$250)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ultra-compact (60% keyboard)&lt;/li&gt;
&lt;li&gt;Lightweight (540g)&lt;/li&gt;
&lt;li&gt;Topre switches (electrostatic capacitive)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;But: no arrow keys, unique layout&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RealForce RC1&lt;/strong&gt; (~$230)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;70% keyboard&lt;/li&gt;
&lt;li&gt;Arrow keys + function keys&lt;/li&gt;
&lt;li&gt;Topre switches (electrostatic capacitive)&lt;/li&gt;
&lt;li&gt;Standard layout&lt;/li&gt;
&lt;li&gt;600g&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why I Rejected HHKB
&lt;/h3&gt;

&lt;p&gt;I was initially attracted to HHKB's compactness and portability.&lt;/p&gt;

&lt;p&gt;But the layout is very different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No arrow keys (use Fn + keys instead)&lt;/li&gt;
&lt;li&gt;Control key in unusual position&lt;/li&gt;
&lt;li&gt;No function keys (F1-F12)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; ThinkPad has a standard layout. If I use ThinkPad for work outside, switching between two different layouts would be confusing.&lt;/p&gt;

&lt;p&gt;I prioritized &lt;strong&gt;"ThinkPad compatibility"&lt;/strong&gt; over &lt;strong&gt;"portability"&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why I Chose RealForce RC1
&lt;/h3&gt;

&lt;p&gt;Decision factors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard layout (same as ThinkPad)&lt;/li&gt;
&lt;li&gt;70% size (compact but has arrow/function keys)&lt;/li&gt;
&lt;li&gt;Topre switches (quiet + durable)&lt;/li&gt;
&lt;li&gt;600g (desk-focused but portable)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Weight: 30g vs 45g
&lt;/h3&gt;

&lt;p&gt;RealForce RC1 comes in two versions: 30g and 45g.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;30g:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lighter feel&lt;/li&gt;
&lt;li&gt;Quieter&lt;/li&gt;
&lt;li&gt;Easier to mistype&lt;/li&gt;
&lt;li&gt;Big difference from ThinkPad&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;45g:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard weight (similar to ThinkPad)&lt;/li&gt;
&lt;li&gt;Less confusion when switching&lt;/li&gt;
&lt;li&gt;Fewer mistypes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I chose &lt;strong&gt;45g for ThinkPad compatibility&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final decision: RealForce RC1 45g, Japanese layout&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Price: ~$230. I'll wait for a sale. Until then, I'll use my old Filco wired keyboard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expected Lifespan of Filco Keyboards
&lt;/h2&gt;

&lt;p&gt;I researched user experiences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage period (based on user reports):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4-5 years: Most common&lt;/li&gt;
&lt;li&gt;5-7 years: Above average&lt;/li&gt;
&lt;li&gt;10+ years: Very rare (mostly wired models)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Manufacturer's support:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Filco repair support: 5 years&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;After 5 years, spare parts are no longer guaranteed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bluetooth vs Wired:&lt;/strong&gt;&lt;br&gt;
Bluetooth models have more components (power IC, Bluetooth board), so more failure points.&lt;/p&gt;

&lt;p&gt;Estimated lifespan: 4-6 years&lt;/p&gt;

&lt;p&gt;My 5.5 years was actually pretty good for a Bluetooth model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. 5 Years is the Cutoff
&lt;/h3&gt;

&lt;p&gt;Manufacturer warranties and support typically end around 5 years, matching real-world failure rates.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Aging Electronics + Temperature Shock = Risk
&lt;/h3&gt;

&lt;p&gt;While I can't prove it, aged products may be more vulnerable to rapid temperature changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Prevention is Limited
&lt;/h3&gt;

&lt;p&gt;Realistic options during long trips:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store in temperature-stable location&lt;/li&gt;
&lt;li&gt;Use cardboard + towels for insulation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But for 5+-year-old products, there may be no preventing eventual failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Timing, Not Duration
&lt;/h3&gt;

&lt;p&gt;The root cause was &lt;strong&gt;5.5 years of aging&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Winter cold wave + week-long temperature cycling likely triggered the final failure.&lt;/p&gt;

&lt;p&gt;Even with continued use, it probably would have failed soon anyway.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Filco MINILA Air served me well for 5.5 years&lt;/li&gt;
&lt;li&gt;Failure likely due to aging + temperature shock (hypothesis)&lt;/li&gt;
&lt;li&gt;Next keyboard: RealForce RC1 45g (considering ThinkPad compatibility)&lt;/li&gt;
&lt;li&gt;Electronics typically last ~5 years; 10+ years is lucky&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;More about my decision-making process:&lt;br&gt;&lt;br&gt;
&lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>keyboards</category>
    </item>
    <item>
      <title>I Wanted Zero-Input Time Tracking. Here's What I Learned.</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Mon, 26 Jan 2026 01:49:14 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/i-wanted-zero-input-time-tracking-heres-what-i-learned-3cmd</link>
      <guid>https://forem.com/tielec-takashi/i-wanted-zero-input-time-tracking-heres-what-i-learned-3cmd</guid>
      <description>&lt;p&gt;Time tracking tools. Task trackers. There are tons of them out there.&lt;/p&gt;

&lt;p&gt;But I've never found one that truly clicked for me.&lt;/p&gt;

&lt;p&gt;Why? Because &lt;strong&gt;every tool assumes you'll input something&lt;/strong&gt;. And that input is exactly what I hate.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;If you've ever:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tried TaskChute or similar methods and gave up&lt;/li&gt;
&lt;li&gt;Wished for "automatic" time tracking&lt;/li&gt;
&lt;li&gt;Wondered why no tool just &lt;em&gt;knows&lt;/em&gt; what you're working on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;...then this post is for you. I went down the rabbit hole so you don't have to.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Promise of Automatic Tracking
&lt;/h2&gt;

&lt;p&gt;Tools like &lt;a href="https://www.rescuetime.com/" rel="noopener noreferrer"&gt;RescueTime&lt;/a&gt;, &lt;a href="https://www.timely.com/" rel="noopener noreferrer"&gt;Timely&lt;/a&gt;, and &lt;a href="https://timingapp.com/" rel="noopener noreferrer"&gt;Timing&lt;/a&gt; (Mac) promise automatic tracking. No timers, no manual input.&lt;/p&gt;

&lt;p&gt;Sounds perfect, right?&lt;/p&gt;

&lt;p&gt;Well, I found 4 limitations that made me rethink everything.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitation 1: "Which App" ≠ "What For"
&lt;/h2&gt;

&lt;p&gt;Automatic trackers record &lt;em&gt;traces&lt;/em&gt;. Which app you used. Which site you visited.&lt;/p&gt;

&lt;p&gt;But here's the thing: I use Chrome for research, for social media, and for work docs. The tool can't tell the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "why" behind your actions? Only you know that.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitation 2: Idle Detection Doesn't Catch Everything
&lt;/h2&gt;

&lt;p&gt;RescueTime detects when you stop typing or moving your mouse. Smart.&lt;/p&gt;

&lt;p&gt;But what about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Waiting for a build to finish&lt;/li&gt;
&lt;li&gt;Reading logs&lt;/li&gt;
&lt;li&gt;Just... thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You're working, but not "doing" anything. The tool marks you as idle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitation 3: Only the Active Window Gets Tracked
&lt;/h2&gt;

&lt;p&gt;Multiple monitors? Multiple browser windows? Too bad.&lt;/p&gt;

&lt;p&gt;Only the window you're actively interacting with gets logged. If you're reading docs on the left and coding on the right, only the coding side counts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your multitasking reality? Invisible.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Limitation 4: iPhone Doesn't Play Nice
&lt;/h2&gt;

&lt;p&gt;My setup: Windows + iPhone.&lt;/p&gt;

&lt;p&gt;Windows? RescueTime works great.&lt;br&gt;
iPhone? Apple doesn't allow background app monitoring. So RescueTime barely functions. You're stuck with Screen Time, which doesn't export or integrate with anything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two separate worlds. No unified view.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  So What Does Everyone Else Do?
&lt;/h2&gt;

&lt;p&gt;I wondered: surely someone has solved this?&lt;/p&gt;

&lt;p&gt;Turns out... not really.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most people don't bother tracking in detail&lt;/li&gt;
&lt;li&gt;TaskChute enthusiasts power through with willpower (rare)&lt;/li&gt;
&lt;li&gt;Managers have assistants do it for them&lt;/li&gt;
&lt;li&gt;Engineers use Git history as a proxy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Zero-input, perfect tracking? I couldn't find anyone doing it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  My Realistic Compromise
&lt;/h2&gt;

&lt;p&gt;Okay, perfect is impossible. So what can I accept?&lt;/p&gt;

&lt;p&gt;My non-negotiable: &lt;strong&gt;zero input&lt;/strong&gt;.&lt;br&gt;
My goal: understand &lt;em&gt;trends&lt;/em&gt; in how I spend time.&lt;br&gt;
My setup: Windows + iPhone.&lt;/p&gt;

&lt;p&gt;Solution: &lt;strong&gt;RescueTime (free) + check the dashboard once a week&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;What I'm giving up:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compromise&lt;/th&gt;
&lt;th&gt;How I'm Dealing With It&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;iPhone integration&lt;/td&gt;
&lt;td&gt;Just ignore it. PC is my main workspace.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What for" context&lt;/td&gt;
&lt;td&gt;"Which app" is good enough.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Perfect accuracy&lt;/td&gt;
&lt;td&gt;Trends are enough.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task-level tracking&lt;/td&gt;
&lt;td&gt;Category-level is fine.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Expected outcome: "This week, I spent X hours on productive stuff, Y hours drifting." That's it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;If you want perfect time tracking, you have to accept some manual input.&lt;/p&gt;

&lt;p&gt;If you want zero input, you have to let go of perfection.&lt;/p&gt;

&lt;p&gt;I'm trying RescueTime for a week. If it's not enough, I'll figure something else out.&lt;/p&gt;

&lt;p&gt;That's where I landed. Maybe it helps you too.&lt;/p&gt;




&lt;p&gt;I write about decisions and reflections like this on my blog.&lt;br&gt;
If you're interested: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>tools</category>
    </item>
    <item>
      <title>Windows Battery Shows 10% But Suddenly Shuts Down? Here's How to Fix It</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sun, 25 Jan 2026 07:52:31 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/windows-battery-shows-10-but-suddenly-shuts-down-heres-how-to-fix-it-3441</link>
      <guid>https://forem.com/tielec-takashi/windows-battery-shows-10-but-suddenly-shuts-down-heres-how-to-fix-it-3441</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;Ever been working on something important when your laptop suddenly shuts down, even though the battery showed 10%? Yeah, me too. Lost some unsaved work and got pretty frustrated.&lt;/p&gt;

&lt;p&gt;Turns out, it's not a bug—it's battery degradation messing with the displayed percentage. Here's how to diagnose it and prevent it from happening again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to check your actual battery health with one command&lt;/li&gt;
&lt;li&gt;Why your battery percentage lies to you&lt;/li&gt;
&lt;li&gt;How to adjust settings to avoid sudden shutdowns&lt;/li&gt;
&lt;li&gt;Best practices for extending battery life&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;My laptop showed &lt;strong&gt;10% battery remaining&lt;/strong&gt;. I was about to plug in the charger when—BAM—"Your PC will shut down in 1 minute" with no way to cancel. And it did. Hard shutdown, not even hibernate like it was supposed to.&lt;/p&gt;

&lt;p&gt;Wait, what? I had 10% left!&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Check Your Battery Health
&lt;/h2&gt;

&lt;p&gt;Windows has a built-in command to generate a detailed battery report. Open Command Prompt as Administrator and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;powercfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;/batteryreport&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates an HTML file at &lt;code&gt;C:\Windows\System32\battery-report.html&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; I initially typed &lt;code&gt;betteryreport&lt;/code&gt; and got an error. It's &lt;code&gt;batteryreport&lt;/code&gt; (with an 'a'). Don't be like me. 😅&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 2: Read the Report
&lt;/h2&gt;

&lt;p&gt;Open the HTML file in your browser. Look for these key numbers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;Example Value&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DESIGN CAPACITY&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;86,000 mWh&lt;/td&gt;
&lt;td&gt;Battery capacity when new&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FULL CHARGE CAPACITY&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;68,630 mWh&lt;/td&gt;
&lt;td&gt;Current maximum capacity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CYCLE COUNT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;287&lt;/td&gt;
&lt;td&gt;Number of charge cycles&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Calculate degradation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;68,630 ÷ 86,000 ≈ 0.798 = ~80% health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;My battery had degraded by 20%. That's the culprit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Causes Sudden Shutdowns
&lt;/h2&gt;

&lt;p&gt;Here's the thing: &lt;strong&gt;Windows shows percentages based on current capacity, not design capacity.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So when my laptop showed 10%:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Display: 10% of current capacity&lt;/li&gt;
&lt;li&gt;Reality: ~8% of original capacity (10% × 0.8)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The shutdown threshold was set at 5%. From 10% to 5% is supposed to be a 5% buffer, but with degradation, it's really only ~4%. Under heavy load, that disappears in seconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Fix
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fix 1: Adjust Warning Levels
&lt;/h3&gt;

&lt;p&gt;Give yourself more buffer time before the forced shutdown.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Navigate to:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Control Panel 
→ Power Options 
→ Change plan settings 
→ Change advanced power settings 
→ Battery
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Adjust these:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;Low battery level&lt;/code&gt;: 15-20% (up from 10%)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Critical battery level&lt;/code&gt;: 7-10% (up from 5%)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Fix 2: Enable 80% Charge Limit (Lenovo)
&lt;/h3&gt;

&lt;p&gt;If you use your laptop plugged in most of the time, limit charging to 80% to reduce battery wear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Lenovo laptops:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open Lenovo Vantage app&lt;/li&gt;
&lt;li&gt;Go to Hardware Settings → Power&lt;/li&gt;
&lt;li&gt;Find "Battery Charge Threshold" or "Conservation Mode"&lt;/li&gt;
&lt;li&gt;Set maximum charge to 80%&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Other manufacturers have similar features—check your laptop's utility app.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix 3: Match Your Usage Pattern
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Mostly at a desk?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep it plugged in&lt;/li&gt;
&lt;li&gt;Set 80% charge limit&lt;/li&gt;
&lt;li&gt;Use a cooling pad if it gets hot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Mobile user?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Charge when it hits 20-30%&lt;/li&gt;
&lt;li&gt;Try to stay between 20-80%&lt;/li&gt;
&lt;li&gt;Avoid draining to 0%&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bonus: AC vs Battery Power
&lt;/h2&gt;

&lt;p&gt;Here's something that surprised me: &lt;strong&gt;AC power is actually better for your PC&lt;/strong&gt; (though not necessarily for the battery).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Battery power: voltage fluctuates as it drains&lt;/li&gt;
&lt;li&gt;AC power: stable voltage and current&lt;/li&gt;
&lt;li&gt;Your CPU/GPU prefer stable power&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So keeping it plugged in isn't bad for the &lt;em&gt;computer&lt;/em&gt;. Just set that 80% charge limit to protect the &lt;em&gt;battery&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  About Charge Cycles
&lt;/h2&gt;

&lt;p&gt;A charge cycle = 100% of battery capacity used.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use 50%, charge → Use 50%, charge = 1 cycle&lt;/li&gt;
&lt;li&gt;Use 30%, charge × 3 times ≈ 1 cycle&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When plugged in:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Battery hits 100% → charging stops&lt;/li&gt;
&lt;li&gt;Power comes directly from AC adapter&lt;/li&gt;
&lt;li&gt;Cycles barely increase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My 287 cycles meant I was actually using it unplugged quite a bit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Using a Power Bank?
&lt;/h2&gt;

&lt;p&gt;Make sure it's powerful enough!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check your AC adapter:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example: OUTPUT: 20V 3.25A
→ 20V × 3.25A = 65W
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Power requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard laptops: 45-65W&lt;/li&gt;
&lt;li&gt;High-performance: 90-135W&lt;/li&gt;
&lt;li&gt;Gaming laptops: 135W+&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A 60W power bank works for most regular laptops.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Degradation levels:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ 80%+: Healthy&lt;/li&gt;
&lt;li&gt;⚠️ 70-80%: Adjust settings&lt;/li&gt;
&lt;li&gt;🔴 Below 70%: Consider replacement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Action checklist:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Run &lt;code&gt;powercfg /batteryreport&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;[ ] Calculate actual battery health&lt;/li&gt;
&lt;li&gt;[ ] Raise low battery warning to 15-20%&lt;/li&gt;
&lt;li&gt;[ ] Set 80% charge limit if mostly plugged in&lt;/li&gt;
&lt;li&gt;[ ] Verify power bank wattage matches needs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;That mysterious shutdown? Not so mysterious anymore. Battery degradation is sneaky—your laptop thinks it has more juice than it actually does.&lt;/p&gt;

&lt;p&gt;One command (&lt;code&gt;powercfg /batteryreport&lt;/code&gt;) and a few setting tweaks can save you from lost work and frustration.&lt;/p&gt;

&lt;p&gt;Have you dealt with this issue? Drop a comment with your battery health percentage! 👇&lt;/p&gt;




&lt;p&gt;I share more thoughts on technical decisions and problem-solving approaches on my blog if you're interested: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>windows</category>
      <category>battery</category>
      <category>troubleshooting</category>
      <category>laptop</category>
    </item>
    <item>
      <title>Why did we choose this again?" - How ADRs Solved Our Documentation Problem</title>
      <dc:creator>Yuto Takashi</dc:creator>
      <pubDate>Sun, 25 Jan 2026 06:34:27 +0000</pubDate>
      <link>https://forem.com/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a</link>
      <guid>https://forem.com/tielec-takashi/why-did-we-choose-this-again-how-adrs-solved-our-documentation-problem-3n8a</guid>
      <description>&lt;h2&gt;
  
  
  Why You Should Care
&lt;/h2&gt;

&lt;p&gt;Ever had a new team member ask "why are we using this tool?" and you couldn't remember the exact reason? Or worse, spent hours digging through Slack threads and meeting notes trying to reconstruct a decision from 6 months ago?&lt;/p&gt;

&lt;p&gt;That was me until I discovered &lt;strong&gt;Architecture Decision Records (ADRs)&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem: Scattered Information
&lt;/h2&gt;

&lt;p&gt;I recently wrote about evaluating project management tools for our 30-person team. We had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redmine (database maintenance nightmare)&lt;/li&gt;
&lt;li&gt;Azure DevOps (too complex, nobody used half the features)&lt;/li&gt;
&lt;li&gt;Discussions scattered across Slack, Google Docs, and email&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Someone commented: "You should write an ADR for this."&lt;/p&gt;

&lt;p&gt;My reaction? &lt;strong&gt;"What's an ADR?"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an ADR?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Architecture Decision Record&lt;/strong&gt; = A short document that captures &lt;strong&gt;why&lt;/strong&gt; you made a technical decision.&lt;/p&gt;

&lt;p&gt;That's it. Simple format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# 001. Evaluate Linear for Project Management&lt;/span&gt;

&lt;span class="gu"&gt;## Status&lt;/span&gt;
Proposed (evaluating)

&lt;span class="gu"&gt;## Context&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; 30-person team, tools fragmented
&lt;span class="p"&gt;-&lt;/span&gt; Redmine: maintenance issues
&lt;span class="p"&gt;-&lt;/span&gt; Azure DevOps: too complex, underutilized

&lt;span class="gu"&gt;## Decision&lt;/span&gt;
Evaluate Linear as unified solution
&lt;span class="p"&gt;-&lt;/span&gt; Simple UI (learned from Azure DevOps)
&lt;span class="p"&gt;-&lt;/span&gt; Cost: ~$3,000/year vs $12,000 Redmine maintenance
&lt;span class="p"&gt;-&lt;/span&gt; Cross-team visibility

&lt;span class="gu"&gt;## Consequences&lt;/span&gt;
Pros: Simple, cheaper, unified
Cons: English UI, less customization
Unknown: Actual usage experience
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ADR vs Meeting Notes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Meeting notes&lt;/strong&gt; capture what happened:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2:00 PM - Alice: I think Linear is good
2:05 PM - Bob: But it's in English
2:10 PM - Carol: What about Jira?
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ADR&lt;/strong&gt; captures the decision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Decision: Linear&lt;/span&gt;

&lt;span class="gu"&gt;## Why&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Simple (avoiding Azure DevOps mistake)
&lt;span class="p"&gt;-&lt;/span&gt; Cost effective
&lt;span class="p"&gt;-&lt;/span&gt; Cross-team visibility

&lt;span class="gu"&gt;## Alternatives Rejected&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Jira: expensive, complex
&lt;span class="p"&gt;-&lt;/span&gt; GitHub Projects: weak reporting
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six months later, which one helps your future self more?&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping ADRs Alive (The Real Challenge)
&lt;/h2&gt;

&lt;p&gt;The biggest risk? &lt;strong&gt;ADRs becoming dead documentation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's what works:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Keep It Simple (5-10 minutes max)
&lt;/h3&gt;

&lt;p&gt;If it takes an hour to write, you're doing it wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# 003. Use Linear&lt;/span&gt;

&lt;span class="gu"&gt;## Context&lt;/span&gt;
Tools fragmented, need unification

&lt;span class="gu"&gt;## Decision&lt;/span&gt;
Linear - simple, affordable

&lt;span class="gu"&gt;## Consequences&lt;/span&gt;
Good: unified, cheap
Bad: English UI, less customization

Details: &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;Meeting notes&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="sx"&gt;link&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Make Them Useful
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Onboarding&lt;/strong&gt;: New members read them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Questions arise&lt;/strong&gt;: "Why this?" → "Check ADR-003"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quarterly review&lt;/strong&gt;: Update or deprecate&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Living Documents
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## Status&lt;/span&gt;
Accepted (2026-01-22)

&lt;span class="gu"&gt;## Update (2026-04-15)&lt;/span&gt;
After 3 months:
&lt;span class="p"&gt;-&lt;/span&gt; English UI was fine
&lt;span class="p"&gt;-&lt;/span&gt; BUT: needed spreadsheet alongside
  due to limited custom fields
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Don't Force It
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Culture comes from convenience, not mandates.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with one ADR. Share it. If someone says "this is helpful," you've won.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Perfection Trap
&lt;/h2&gt;

&lt;p&gt;"But what if I miss important information?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You don't need 100% completeness.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Your future teammates need:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What you chose&lt;/li&gt;
&lt;li&gt;Why you chose it&lt;/li&gt;
&lt;li&gt;What else you considered&lt;/li&gt;
&lt;li&gt;Main trade-offs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;That's 80% of questions answered.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The other 20%? Link to meeting notes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Perfectionist approach:
- Time: 3 hours
- Completeness: 100%
- Written: 1-2x per year
→ Result: Most decisions undocumented

Pragmatic approach:
- Time: 5-10 minutes
- Completeness: 70-80%
- Written: 2-3x per month
→ Result: Most decisions documented
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;70% information beats 0% information.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Example: Google's Design Docs
&lt;/h2&gt;

&lt;p&gt;Google has a similar practice called "&lt;a href="https://www.industrialempathy.com/posts/design-docs-at-google/" rel="noopener noreferrer"&gt;Design Docs&lt;/a&gt;":&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A design doc is not a spec. It doesn't need to be perfect. It's a tool for discussion."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Their approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not required, but encouraged&lt;/li&gt;
&lt;li&gt;Bullet points are fine&lt;/li&gt;
&lt;li&gt;Start discussion before coding&lt;/li&gt;
&lt;li&gt;Review and improve together&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  My Takeaway
&lt;/h2&gt;

&lt;p&gt;When I wrote that project management tool article, I was already doing ADR-style thinking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Problem context ✓&lt;/li&gt;
&lt;li&gt;Options considered ✓&lt;/li&gt;
&lt;li&gt;Trade-offs ✓&lt;/li&gt;
&lt;li&gt;Current decision ✓&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I just didn't know the term.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ADRs are letters to your future self and team.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They answer: "Why did we do this?" when everyone has forgotten, moved on, or left the company.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Try one ADR&lt;/strong&gt; for your next technical decision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep it short&lt;/strong&gt; (5-10 minutes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store in Git&lt;/strong&gt; (&lt;code&gt;docs/adr/0001-your-decision.md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share with team&lt;/strong&gt; and see if they find it useful&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate&lt;/strong&gt; based on feedback&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. No fancy tools needed. Just markdown files in your repo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions" rel="noopener noreferrer"&gt;Original ADR post by Michael Nygard (2011)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://adr.github.io/" rel="noopener noreferrer"&gt;ADR templates and examples&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.industrialempathy.com/posts/design-docs-at-google/" rel="noopener noreferrer"&gt;Design Docs at Google&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: We actually implemented this at my company. Started with one ADR. Now it's standard practice. The trick? Don't mandate it—let the value speak for itself.&lt;/p&gt;

&lt;p&gt;What's your experience with decision documentation? How do you handle "why did we choose this?" questions? Drop a comment below! 👇&lt;/p&gt;




&lt;p&gt;I write more about decision-making and reflective practices for engineers.&lt;br&gt;
If you're interested, you can find more here: &lt;a href="https://tielec.blog/" rel="noopener noreferrer"&gt;https://tielec.blog/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>documentation</category>
      <category>productivity</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
