Forem: Yuto Takashi

Why SRE Investment Gets Undervalued (And How to Fix It)

Yuto Takashi — Sat, 14 Feb 2026 09:01:21 +0000

Why You Should Care

If you've ever heard these questions from management:

"The system is working fine, why do we need more SRE budget?"
"Can't we just respond to incidents when they happen?"
"Why not just hire more developers to speed up development?"

You're not alone. SRE and Platform Engineering investments are often undervalued, and there's a structural reason for it.

This article explores why this happens and provides three concrete frameworks to justify infrastructure investment to non-technical executives.

The Police Department Analogy

Here's an interesting parallel: SRE work is similar to police, fire departments, and disaster response teams.

Think about it:

They protect society using public funds
When nothing goes wrong, they train and prepare
But people often say "what are they even doing?" and cut their budgets
When problems occur, everyone asks "why didn't you prevent this?"

The value of prevention is invisible.

Ship a new feature → "We contributed to revenue!" (highly visible)
System runs 24/7 without issues → "That's expected" (taken for granted)
Prevented a major outage → Never happened, so nobody notices

Same with police and fire departments. Low crime rates and no fires are actually the result of their work, but it looks like "they're doing nothing."

The Negative Spiral

Here's what's scary: this structure creates a negative spiral.

Budget cuts → Staff shortage → More incidents → "SRE is incompetent" → Further budget cuts

This happens with police too: "Crime is rising, what are they doing?" → Budget cuts → Less patrol → More crime...

The cycle continues like this:

Budget gets cut
Fewer people, fewer tools
Incidents increase
"Why is SRE failing?"
Even more budget cuts

The Chicken-and-Egg Problem

There's an even trickier issue: No budget until problems occur.

Pattern 1: Reactive funding

Major outage happens → Emergency budget approved → Fix implemented → System stable → "We're good now, right?" → Budget cut

Pattern 2: Prevention isn't valued

"Our database will hit limits in 6 months"
"But it's working now, do we really need this?"
(6 months later: outage)
"Why didn't you predict this?!"

This is exactly like earthquake-proofing budgets. Before the earthquake: "waste of money." After: "why didn't we do this?"

Platform Engineering: A New Approach

In the past 2-3 years, Platform Engineering has gained attention.

It's about building "self-service infrastructure" so developers can manage infrastructure themselves.

This emerged because of the gap between DevOps ideals and reality.

DevOps ideal (2010s)

"Developers do everything: build, deploy, operate!"
"You build it, you run it"

Reality

Operational burden concentrated on developers
Learning curve too steep (Kubernetes, Terraform, monitoring tools...)
Each team picks their own tools → chaos
Eventually, load concentrates on "the few who can operate"

→ "It was unrealistic to expect all developers to be infrastructure experts"

SRE vs Platform Engineering

	SRE	Platform Engineering
Primary Goal	Protect service reliability	Improve developer productivity
For Whom?	End users (customers)	Internal developers
Key Activities	Incident response, SLO management	Self-service platform, tooling

Using the police analogy:

Traditional SRE: Patrol cars responding to crimes
Platform Engineering: Install streetlights, cameras, empower residents to protect themselves

In other words, SRE is shifting from "protector" to "enabler".

But Budget Issues Remain

Here's what I realized:

Changing the approach doesn't solve the "how much is enough?" problem.

Security camera example:

10 cameras → "Does this even work?"
100 cameras → "Do we really need that many?"
1,000 cameras → "Isn't this overkill?"

Same with Platform Engineering:

Build CI/CD pipeline → "Too much effort?"
Developer portal → "How much does this license cost?"
Monitoring for all services → "Do small services need this?"

Moreover, Platform Engineering's value is harder to prove than SRE's incident response. You're proving "losses that didn't happen" rather than "losses that did happen."

Three Frameworks to Justify Investment

So how do you explain the need for investment?

1. Engineer Ratio Approach

Rule of thumb: 1 SRE per 10-20 developers

50 developers → 3-5 SREs
100 developers → 5-10 SREs

Varies by service scale and complexity

Falling below this ratio increases the risk of entering a negative spiral.

2. Revenue Percentage Approach

Rule of thumb: 10-20% of IT budget for operations (including SRE)

Annual IT budget $1M → $100K-$200K

This is a rough industry standard.

3. Downtime Cost Calculation

Formula:

Calculate revenue lost per hour of downtime
Define acceptable annual downtime (e.g., 99.9% = 8.76 hours)
Calculate potential annual loss
Invest 10-20% of that amount

Example:

Hourly downtime loss: $50K
Annual acceptable downtime: 8.76 hours
Potential loss: $438K
Investment: $50K-$100K

If investment < expected loss, it's a rational investment.

Making It Clear for Non-Technical Executives

If your CTO or CEO has an engineering background, they'll understand. But when they don't, it gets tough.

You might not even get time to explain everything. And even if you do, they might not fully grasp it.

That's why we need to articulate the necessity of SRE and Platform Engineering at a level that non-engineers can understand.

Something you can say: "Read this first" — a primer that builds foundational understanding.

Executive Guide Available

I created "SRE & Platform Engineering Guide for Executives" with this in mind.

The guide covers:

Why digital services are "cities," not "buildings"
What SRE is (police/fire department analogy)
What Platform Engineering is (roads/utilities analogy)
Why investment is necessary (visualizing "invisible losses")
How much to invest (three frameworks)
Common misconceptions
Decision-making checklist
Next actions

All written to be understandable by non-engineers in 10 minutes.

The complete guide is available in the original article.

Use it as a resource for conversations with leadership.

Conclusion: Infrastructure is Investment, Not Cost

SRE and Platform Engineering investment is like fire insurance.

Companies pay hundreds of thousands annually for fire insurance. Nobody says "it's wasteful because no fire happened."

Similarly:

Expected loss: $3M/year (outage risk)
Investment: $500K (SRE)
If investment < expected loss, it's rational

But many companies only invest in SRE after experiencing a major outage.

Police and fire departments get budgets before major incidents happen. Because they're recognized as "social infrastructure".

SRE should be recognized as "digital infrastructure" too.

Honestly, there's no absolute answer to "how much is enough?" It becomes a matter of organizational values and priorities.

But at least we can provide "materials for thinking."

How much does your organization invest in SRE and Platform Engineering?

For more on this investment framework and the complete executive guide, check out the original article.

https://tielec.blog/en/tech/sre/why-sre-investment-undervalued

Final Chapter: Jenkins EFS Problem Solved - From 100% to 0% Throughput Usage

Yuto Takashi — Sat, 14 Feb 2026 04:35:45 +0000

TL;DR

After three articles tracking down a Jenkins EFS performance issue, enabling Shared Library cache reduced throughput usage from 100% spikes to near 0%. This article covers the final results and the complete SRE process from emergency response to permanent fix.

Previous Episodes (Quick Recap)

This is the final article in a 4-part series:

Episode 1: How Git Temp Files Killed Our Jenkins Performance
- Problem: Jenkins slowed down, Git clone failures, 504 errors
- Discovery: EFS metadata IOPS exhaustion
- Culprit: ~15GB of tmp_pack_* files accumulating
- Emergency fix: Provisioned throughput 300 MiB/s + cleanup job
Episode 2: How I Spent $69 in 26 Hours (and How to Avoid It)
- Cost: $69 in 26 hours
- Lesson: Didn't know about Elastic Throughput (1/20th the cost)
- Learning: Decision process was sound, but not the optimal solution
Episode 3: How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks
- Key finding: Symptom appeared on 1/26, but root cause started on 1/13
- Multiple factors:
- Shared Library cache was disabled (existed before)
- Changed to disposable agent approach (1/13) → metadata IOPS increased
- Development accelerated in new year → more builds
- Root fix: Enabled Shared Library cache, set Refresh time to 180 minutes

The Result: Dramatic Improvement

After enabling the Shared Library cache setting (Cache fetched versions on controller for quick retrieval) and setting Refresh time to 180 minutes, the EFS throughput usage changed dramatically.

Before the fix (left side, 06:00-12:00):

Throughput usage spiking to 100% frequently
Almost constantly under high load
Far exceeding the 75% warning zone

After the fix (after 12:00):

Throughput usage dropped dramatically and stabilized
Baseline near 0%
Regular small spikes every 3 hours (max ~60%)

The 3-hour spikes are from the Shared Library cache refresh checks (Refresh time: 180 minutes). In other words, it's working exactly as expected.

Honestly, I didn't expect the effect to be this clean.

The Complete Timeline: 5 Throughput Modes

Over the course of this incident, we went through 5 different EFS throughput modes:

Mode Progression:

Bursting (original)
    ↓ 1/27 emergency response
Provisioned 300 MiB/s (26 hours)
    ↓ 1/28 cost reduction
Elastic Throughput (1 day)
    ↓ 1/29 verification
Provisioned 10 MiB/s (current)
    ↓ planned
Bursting (return to original)

Cost Comparison:

Mode	Duration	Cost	Reason
Bursting	Until 1/27	Storage only	Normal operation
Provisioned 300 MiB/s	1/27 (26 hrs)	~$69	Emergency: ensure investigation could proceed
Elastic Throughput	1/28-1/29 (~1 day)	~$8	Cost reduction: pay per use
Provisioned 10 MiB/s	1/30-current	~$2.3/day	Verification: stable operation at low cost
Bursting (planned)	Soon	Storage only	Permanent fix: return to original

Why We Changed From Elastic Throughput

Elastic Throughput turned out to be "surprisingly costly":

Daily cost: ~$8
Monthly estimate: ~$240 (~¥35,000)

In contrast, Provisioned 10 MiB/s costs ~$72/month (~¥10,000). Given our current usage pattern (average throughput a few %, max ~60%), 10 MiB/s is sufficient.

However, this is just for the verification period. We plan to eventually return to Bursting mode.

Hypothesis Verification

Let's verify the hypothesis from Episode 3.

Initial Hypothesis

Did the change to disposable agent approach (1/13) cause the metadata IOPS spike?

Answer: Partially correct, but not the main culprit.

The Real Culprit

Shared Library cache was disabled (existed before the incident)

Just enabling the cache brought throughput usage down to near 0%. This means Shared Library's full fetch on every build was consuming the overwhelming majority of metadata IOPS.

Impact of Disposable Agent Approach

So was the disposable agent approach irrelevant?

Not quite. I believe the change to disposable agents was one factor that accelerated Burst Credit depletion.

The combination of three factors:

Shared Library cache disabled (pre-existing) → Controller-side metadata IOPS on every build
Disposable agent approach (from 1/13) → Agent-side metadata IOPS on every build
Development acceleration in new year → increased build frequency

These three factors together caused rapid Burst Credit depletion from 1/13, with symptoms appearing 2 weeks later on 1/26.

The SRE Process: From Detection to Resolution

Looking at the timeline of our response, you can see a clear SRE process:

The Complete SRE Workflow:

Problem Detection (1/27 morning)
- Symptoms: Jenkins slow, Git clone failures, 504 errors
- Metrics check: EFS throughput usage at 100%
- Time: ~30 minutes
Emergency Response (1/27 morning)
- Decision: Change to Provisioned throughput 300 MiB/s (executed next day)
- Goal: Ensure investigation could continue
- Tradeoff: High cost vs. continued development
Impact Mitigation (1/27 afternoon)
- Created cleanup job
- Planned tmp_pack_* deletion
- Implemented recurrence prevention
Root Cause Investigation (1/27-1/30)
- Stage 1: Found tmp_pack_* accumulation
- Stage 2: Burst Credit Balance graph analysis revealed 1/13 as starting point
- Stage 3: Discovered Shared Library cache was disabled

To be honest, I got stuck here. When I found tmp_pack_*, I thought "this is it," but it was actually just part of the symptom. Reviewing the graphs chronologically led me to the true root cause.

Permanent Fix (1/30)
- Enabled Shared Library cache
- Set Refresh time: 180 minutes
- Optimized throughput mode
Effect Measurement (1/30 onward)
- Confirmed dramatic improvement in throughput usage
- 3-hour spikes are as expected
- Continuing to monitor
Reflection & Knowledge Sharing (this article)
- Reflected on cost decisions (didn't know about Elastic Throughput)
- Understood the multiple contributing factors
- Sharing knowledge with the organization

This last part is surprisingly crucial. It's not just about solving the problem, but articulating "why it happened" and "how we decided" to apply to future situations.

Outstanding Issues and Next Steps

Short-term Tasks

1. Return to Bursting Mode

We're currently running on Provisioned 10 MiB/s, but plan to return to Bursting mode eventually.

What to check before switching back:

Has Burst Credit Balance recovered sufficiently?
Are new tmp_pack_* files being generated?
Is the cleanup job working correctly?

2. Strengthen Monitoring

This incident could have been caught earlier with proper monitoring.

Alerts to set up:

EFS throughput usage > 75%
Burst Credit Balance < threshold (TBD)
Abnormal storage capacity increase

Long-term Considerations

Reconsidering the Disposable Agent Approach

We're continuing with the disposable agent approach, but its impact on metadata IOPS can't be ignored.

Options to consider:

Extend agent lifecycle slightly to reuse across multiple jobs
Share Git cache on EFS across all agents
Place cache in S3 and sync on startup

Finding the right balance between cost and performance is the next challenge.

Key Takeaways

Looking back at the timeline, here's what I learned.

Technical Lessons

EFS Metadata IOPS Characteristics
- Massive small file operations are deadly
- File count matters more than storage capacity
- Burst Credit management is key
Jenkins Caching Mechanisms
- Importance of Shared Library cache
- Setting the right Refresh time balance
- Hidden costs of disabled caching
Throughput Mode Selection
- Elastic Throughput isn't a silver bullet
- Optimization based on usage patterns
- Importance of cost estimation

Process Lessons

But more importantly, it's about "how we decided."

Emergency Decision Making:

Make decisions without perfect information
Prioritize avoiding worst-case scenarios
Clarify tradeoffs

Investigation Approach:

Look at graphs chronologically, not just symptoms
Form hypotheses, test them, move to next hypothesis if wrong
Acknowledge when you're stuck

Accountability:

Costs can be explained after the fact
Articulate the decision process
Share both successes and failures

I regret not knowing about Elastic Throughput, but I don't regret the decision to "ensure investigation could continue."

The $69 tuition might have been expensive, but I think I got more than that in learning.

This is the final article in the series:

Episode 1: How Git Temp Files Killed Our Jenkins Performance
Episode 2: How I Spent $69 in 26 Hours
Episode 3: How Jenkins Slowly Drained Our EFS Burst Credits

I write more about SRE decision-making processes and the thinking behind technical choices on my blog.
Check it out: https://tielec.blog/en/tech/sre/jenkins-efs-final-report

How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks

Yuto Takashi — Sat, 31 Jan 2026 03:58:09 +0000

TL;DR

Our Jenkins started failing on 1/26, but the root cause began on 1/13. We discovered three compounding issues:

Shared Library cache was disabled (existing issue)
Switched to disposable agents (1/13 change)
Increased build frequency (New Year effect)

Result: 50x increase in metadata IOPS → EFS burst credits drained over 2 weeks.

Why You Should Care

If you're running Jenkins on EFS, this could happen to you. The symptoms appear suddenly, but the root cause often starts weeks earlier. Time-series analysis of metrics is crucial.

The Mystery: Symptoms vs. Root Cause

Previously, I wrote about how Jenkins became slow and Git clones started failing. We found ~15GB of Git temporary files (tmp_pack_*) accumulated on EFS, causing metadata IOPS exhaustion.

We fixed it with Elastic Throughput and cleanup jobs. Case closed, right?

Not quite.

When I checked the EFS Burst Credit Balance graph, I noticed something important:

The credit started declining around 1/13, but symptoms appeared on 1/26.

Timeline:

1/13: Credit decline starts
1/19: Rapid decline
1/26: Credit bottoms out
1/26-27: Symptoms appear

The tmp_pack_* accumulation was a symptom, not the root cause. Something changed on 1/13.

What Changed on 1/13?

Honestly, this stumped me. I had a few ideas, but nothing definitive:

1. Agent Architecture Change

Around 1/13, we changed our Jenkins agent strategy:

Before: Shared Agents

EC2 instances: c5.large, etc.
Multiple jobs share agents
Workspace reuse
git pull for incremental updates

After: Disposable Agents

EC2 instances: t3.small, etc.
One agent per job
Destroy after use
git clone for full clones every time

The goal was cost reduction. We didn't consider metadata IOPS impact.

2. Post-New Year Development Rush

Teams ramped up development after the New Year holiday, increasing overall Jenkins load.

The Math: 50x Metadata IOPS Increase

Let me calculate the impact:

Builds per day: 50 (estimated)
Files created per clone: 5,000

Shared agent approach:
  Clone once = 5,000 metadata operations

Disposable agent approach:
  50 builds × 5,000 files = 250,000 metadata operations/day

50x increase in metadata IOPS.

Add the New Year rush, and the numbers get even worse.

Understanding Git Cache in Jenkins

During investigation, I noticed /mnt/efs/jenkins/caches/ directory:

/mnt/efs/jenkins/caches/git-3e9b32912840757a720f39230c221f0e

This is Jenkins Git plugin's bare repository cache.

How Git Caching Works

Jenkins Git plugin optimizes clones by:

Caching remote repos in /mnt/efs/jenkins/caches/git-{hash}/ as bare repositories
Cloning to job workspaces using git clone --reference from this cache
Hash generated from repo URL + branch combination

The problem: Disposable agents might not benefit from this cache since they're new every time.

The Smoking Gun: tmp_pack_* Location

I revisited where tmp_pack_* files were located:

jobs/sample-job/jobs/sample-pipeline/builds/104/libs/
  └── 335abf.../root/.git/objects/pack/
      └── tmp_pack_WqmOyE  ← 100-300MB

These are in per-build directories:

jobs/sample-job/jobs/sample-pipeline/
└── builds/
    ├── 104/
    │   └── libs/.../tmp_pack_WqmOyE
    ├── 105/
    │   └── libs/.../tmp_pack_XYZ123
    └── 106/
        └── libs/.../tmp_pack_ABC456

Every build was re-checking out the Pipeline Shared Library, generating tmp_pack_* each time.

Question: Why is Shared Library being fetched on every build?

Root Cause: Cache Setting Was OFF

I checked Jenkins configuration and found the smoking gun.

The Shared Library setting Cache fetched versions on controller for quick retrieval was unchecked.

This meant:

Shared Library cache completely disabled
Full fetch from remote repository on every build
Temporary files generated in .git/objects/pack/
Massive metadata IOPS consumption

The Fix: Enable Caching

I immediately changed the settings:

Enabled Cache fetched versions on controller for quick retrieval
Set Refresh time in minutes to 180 minutes

Choosing the Refresh Time

This is actually important. Options:

60-120 min: Fast updates, moderate IOPS reduction
180 min (3 hours): Balanced - 8 updates/day
360 min (6 hours): Stable operation - 4 updates/day
1440 min (24 hours): Maximum IOPS reduction

Why I chose 180 minutes:

Updates check ~8 times/day (9am, 12pm, 3pm, 6pm...)
Shared Library changes reflected within half a day is fine
Significant IOPS reduction (every build → once per 3 hours)
Can manually clear cache for urgent changes

Jenkins has a "force refresh" feature, so urgent changes aren't a problem. I documented this in our runbook so we don't forget.

Measuring the Impact

Post-change monitoring plan:

Short-term (24-48 hours)

No new tmp_pack_* files generated
EFS metadata IOPS decrease

Mid-term (1 week)

Burst Credit Balance recovery trend
Stable build performance

Long-term (1 month)

Credits remain stable
No recurrence

Lessons Learned

1. Symptoms ≠ Root Cause Timeline

Symptom appearance: 1/26-1/27
Root cause: Around 1/13
Credit depletion: Gradual over 2 weeks

Time-series analysis is crucial. Fixing only the visible symptoms leads to superficial solutions.

2. Architecture Changes Have Hidden Costs

The disposable agent change was for cost optimization. We did reduce EC2 costs, but created problems elsewhere.

When changing architecture:

Evaluate performance impact beforehand
Set up monitoring before the change
Continue tracking metrics after

3. EFS Metadata IOPS Characteristics

Mass creation/deletion of small files is deadly
File count matters more than storage size
Burst mode requires credit management
Credit depletion happens gradually

Especially with .git/objects/ containing thousands of small files, behavior differs drastically from normal file I/O.

4. Compound Root Causes

This issue wasn't a single cause but three factors combining:

Shared Library cache disabled (pre-existing)
Disposable agent switch (1/13)
Increased builds (New Year)

Each alone might not have caused major issues, but together they exceeded the critical threshold.

Open Questions

While we enabled Shared Library caching, we're still using disposable agents.

Can agent-side Git cache be utilized effectively with disposable agents?

Possibilities:

Share EFS Git cache across all agents
Extend agent lifecycle slightly for reuse across jobs
Cache in S3 and sync on startup

Finding the right balance between cost and performance remains a challenge.

I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/

Python's Silent Failure: When `python -m` Does Nothing (And How to Fix It)

Yuto Takashi — Sat, 31 Jan 2026 03:34:25 +0000

Why You Should Care

Ever run python -m your.module and get... nothing? No errors, exit code 0, but your code never actually runs? I just spent an hour debugging this exact problem.

Here's what I learned.

The Problem

I was running a CLI command via Jenkins:

$ python -m monitoring_sdk.core.cli metabase-check --config config.yaml
$ echo $?
0
$ ls reports/
# Empty - nothing was created

Exit code 0 means success, right? But nothing happened. No output, no files, no errors. Just... silence.

The Root Cause

Missing __main__ entrypoint.

When you run python -m module.path, Python imports the module but doesn't automatically call your main() function. You need to explicitly tell it what to run.

The Quick Fix

Add one of these (or both):

Option 1: Create `main.py` (Recommended)

# monitoring_sdk/core/__main__.py
from monitoring_sdk.core.cli import main

if __name__ == "__main__":
    main()

Option 2: Add to your main file

# monitoring_sdk/core/cli.py
# ... your code ...

if __name__ == "__main__":
    main()

That's it. Problem solved.

What Actually Happens

When you run python -m monitoring_sdk.core.cli, Python:

Imports monitoring_sdk/core/cli.py
Executes module-level code (imports, decorators, function definitions)
Stops - it doesn't call main()
Exits with code 0

So you get:

✅ Module imported
✅ Functions defined
❌ Nothing executed
✅ Exit code 0 (looks like success!)

This is what I call a "silent failure" - the worst kind.

How I Debugged It

Step 1: Added logging everywhere

import logging

logger = logging.getLogger(__name__)

def metabase_check(config, output):
    logger.info("Starting command")  # This never appeared
    # ... code ...

Step 2: Checked what WAS running

$ DEBUG=true python -m monitoring_sdk.core.cli metabase-check
DEBUG - Package initialized  # Module imported
DEBUG - Monitors initialized # Module imported
# ... nothing else

Package initialization logs appeared (because __init__.py runs on import), but my function logs didn't. That's when I realized: the function was never called.

Step 3: Tried `--help`

$ python -m monitoring_sdk.core.cli --help
# No output at all

If even --help doesn't work, the entrypoint is definitely missing.

How Python Looks for Entrypoints

When you run python -m module.path:

Look for module/path/__main__.py → Run it
Otherwise, run module/path.py (but need if __name__ == "__main__")

How You Run It	What You Need
`python -m module.path`	`__main__.py`
`python script.py`	`if __name__ == "__main__"`

Why This Bit Me (AI-Generated Code)

Full transparency: I was using AI to generate code quickly. The AI created the CLI structure perfectly - decorators, commands, options - but forgot the entrypoint.

I assumed "AI generated it, so it's complete." I didn't even check. That was my mistake.

Lesson learned: AI is great for boilerplate, but you still need to verify the basics.

The Fix in Action

After adding __main__.py:

$ python -m monitoring_sdk.core.cli metabase-check --config config.yaml
INFO - Starting card check
INFO - Card check completed: status=OK
INFO - Report generated: file=reports/summary.md
INFO - Command completed: status=OK

$ ls reports/
summary.md  # Finally!

Key Takeaways

python -m requires an explicit entrypoint
Exit code 0 doesn't mean your code ran
Silent failures are hard to debug - add logging early
Even with AI-generated code, verify the basics

The worst bugs are the simple ones you don't think to check.

If you're interested in more troubleshooting processes and decision-making in engineering, I write about them here: https://tielec.blog/

AWS EFS Emergency Response: How I Spent $69 in 26 Hours (And How to Avoid It)

Yuto Takashi — Fri, 30 Jan 2026 06:12:13 +0000

TL;DR

During a Jenkins EFS incident, I switched to Provisioned Throughput (300 MiB/s) for emergency response. It cost $69 for just 26 hours. If I had known about Elastic Throughput, it would have been around $3.50. Here's what I learned about EFS throughput modes and cost optimization.

The Incident

Last week, our Jenkins CI/CD pipeline came to a halt due to EFS metadata IOPS exhaustion. As an emergency measure, I changed the EFS throughput mode to Provisioned Throughput at 300 MiB/s to keep Jenkins running while investigating the root cause.

The next day, I checked AWS Cost Explorer and saw:

$69.00

For 26 hours of usage. Ouch.

Why You Should Care

If you're running EFS for production workloads, understanding throughput modes is critical. A simple configuration choice can mean the difference between $3 and $69 for the same workload.

EFS Throughput Modes: A Quick Comparison

AWS EFS offers three throughput modes:

1. Bursting Throughput (Default)

Cost: Storage cost only

Performance scales with storage size. You get baseline throughput based on your storage capacity, plus burst credits for temporary spikes.

✅ No extra cost
❌ Performance degrades when credits run out (our problem)

2. Provisioned Throughput

Cost: Storage + Throughput cost

Tokyo region: ~$7.2 per MiB/s per month

For 300 MiB/s:

Monthly: 300 × $7.2 = $2,160
26 hours: $2,160 × (26/720) ≈ $78 (actual: $69)
✅ Guaranteed performance
❌ Very expensive, billed even when idle

3. Elastic Throughput

Cost: Storage + Actual usage

Tokyo region:

Read: $0.04/GB
Write: $0.07/GB

For 26 hours with ~50GB usage:

50GB × $0.07 ≈ $3.50
✅ Pay-per-use, auto-scales
❌ Harder to predict costs

Cost Comparison

Mode	26-hour Cost	When to Use
Bursting	$5.6/month	Normal operations
Provisioned	$69	Constant high throughput
Elastic	$3.50	Spike handling (best for most cases)

Difference: ~$65 (~$9,500 yen)

What I Should Have Done

Instead of jumping to Provisioned Throughput, here's the better approach:

Step 1: Switch to Elastic Throughput

aws efs put-file-system-policy \
  --file-system-id fs-xxxxxx \
  --throughput-mode elastic

This would have:

Auto-scaled during investigation
Cost only ~$3.50 for the same period
No manual capacity planning needed

Step 2: Investigate Root Cause

While Elastic Throughput handles the spike automatically, investigate and fix the underlying issue (in our case, Git temporary files accumulating).

Step 3: Set Up Monitoring

CloudWatch alarms for:

PercentIOLimit > 75%
Early warning before IOPS exhaustion

Why I Didn't Choose Elastic Throughput

Honestly? I didn't know it existed.

Elastic Throughput was announced in 2022, but I hadn't updated my knowledge. During the emergency, my mental model was:

Bursting = free but unreliable
Provisioned = expensive but guaranteed

I missed the third, better option.

Was the Decision Wrong?

Not entirely. Let's look at ROI:

Cost: $69 (10,000 yen)

Avoided Loss:

10 engineers × 3 hours waiting = 30 person-hours
At ~$50/hour = $1,500 in productivity loss
Plus deployment delays (hard to quantify)

ROI: ~20x

The decision to prioritize business continuity was correct. But knowing about Elastic Throughput would have achieved the same result for 1/20th the cost.

Lessons Learned

1. Always Research Current Options

Don't rely on old knowledge during emergencies. Take 5 minutes to check AWS documentation for the latest features.

2. Cost Estimation is Part of the Response

"Make it work first" is important, but:

List all options
Quick cost comparison
Choose based on data, not urgency

3. Document and Share

This $69 lesson becomes valuable when shared. Your team (and the community) can learn without paying the same price.

Action Items

If you're using EFS:

[ ] Check your current throughput mode
[ ] Consider Elastic Throughput for variable workloads
[ ] Set up CloudWatch alarms for PercentIOLimit
[ ] Document your throughput mode decision process

Bottom Line

Use Elastic Throughput for most production workloads.

It's the best of both worlds:

Handles spikes automatically
Pay only for what you use
No capacity planning required

Provisioned Throughput should be reserved for constant, predictable high-throughput scenarios.

Next time I face a similar situation, I'll reach for Elastic Throughput first.

I write more about technical decision-making and engineering practices on my blog.
Check it out: https://tielec.blog/

References

Measuring ROI of Forward-Looking Design Decisions with ADR

Yuto Takashi — Fri, 30 Jan 2026 06:08:08 +0000

Why You Should Care

Ever built a feature "just in case" only to never use it? Or skipped implementing something flexible, only to refactor it weeks later?

We all face this dilemma: YAGNI (You Aren't Gonna Need It) vs. forward-looking design.

The problem? We make these decisions based on gut feeling, not data.

This post shows how to make forward-looking design measurable using ADR (Architecture Decision Records).

The Core Problem

What we really want to know is:

How often do our predictions about future requirements actually come true?

Three dimensions to evaluate:

Aspect	Detail
Prediction	"We'll need X feature in the future"
Implementation	Code/design we built in advance
Reality	Did that requirement actually come?

We want to measure prediction accuracy.

YAGNI vs. Forward-Looking Design

YAGNI means "You Aren't Gonna Need It now", not "You'll Never Need It".

The problem is paying heavy upfront costs for low-accuracy predictions.

When forward-looking design makes sense

Extension points (interfaces, hooks, plugin architecture)
Database schema separation
Fields that are easy to add later

→ Low cost, low damage if wrong

When YAGNI is the answer

UI implementation
Complex business logic
External integrations
Permission/billing logic

→ Will need complete rewrite if wrong

The Estimation Problem

You might think: calculate value like this.

Value = (Probability × Cost_Saved_Later) − Cost_Now

But here's the catch: we can't estimate these accurately.

Nobody knows the real probability
Future costs are unknown until we do the work
Even upfront costs grow during implementation

If we could estimate accurately, we wouldn't need this system.

Learning from inaccuracy

We don't need perfect estimates.

What we need:

Learn which areas have accurate predictions
Learn which areas consistently miss
Build organizational knowledge

Example:

Area	Hit Rate	Tendency
Database design	70%	Forward-looking OK
UI specs	20%	Stick to YAGNI
External APIs	10%	Definitely YAGNI

Numbers are learning material, not absolute truth.

Recording Predictions in ADR

To make forward-looking design measurable, we need to record our decisions.

We use ADR (Architecture Decision Records).

Example ADR with forecast:

## Context
Customer-specific permission management might be needed in the future

## Decision
Keep simple role model for now

## Forecast
- Probability estimate: 30% (based on sales feedback)
- Cost if later: 20 person-days (rough estimate)
- Cost if now: 4 person-days (rough estimate)
- Decision: Don't build it now (negative expected value)

Estimates can be rough. What matters is recording the rationale.

Making It Measurable

Architecture

In our environment, this works:

Repositories (ADR + metadata.json)
  ↓
Jenkins (cross-repo scanning, diff collection)
  ↓
S3 (aggregated JSON)
  ↓
Microsoft Fabric (analysis & visualization)
  ↓
Dashboard

We already have Jenkins scanning repos for code complexity. We can extend this for ADR metadata.

Metadata Design

Keep ADR content free-form. Standardize only the metadata for aggregation.

docs/adr/ADR-023.meta.json:

{
  "adr_id": "ADR-023",
  "type": "forecast",
  "probability_estimate": 0.3,
  "cost_now_estimate_pd": 4,
  "cost_late_estimate_pd": 20,
  "status": "pending",
  "decision_date": "2025-11-01",
  "outcome": {
    "requirement_date": null,
    "actual_cost_pd": null
  }
}

Minimum fields needed:

adr_id: unique identifier
type: forecast
probability_estimate: 0-1
cost_now_estimate_pd: upfront cost (person-days)
cost_late_estimate_pd: later cost (person-days)
status: pending / hit / miss
outcome: actual results

Treat estimates as "estimates", not gospel truth.

Diff-Based Collection

Full scans get expensive. Collect only diffs:

# Record last commit SHA
git diff --name-only <prev>..<now> | grep '*.meta.json'

Scales as repos grow.

Comparing Predictions to Reality

After 6-12 months, review:

ID	Feature	Est. Prob	Actual	Est. Cost	Actual Cost	Result
F-001	CSV bulk import	30%	Came after 6mo	15pd later	1pd	Hit & overestimated
F-002	i18n	50%	Didn't come	-	-	Miss
F-003	Advanced perms	20%	Came after 3mo	20pd later	25pd	Hit & underestimated

Focus on trends and deviation reasons, not absolute accuracy.

Aggregation & Visualization

Two Types of Output

Raw facts (NDJSON, append-only):

{"adr_id":"ADR-023","type":"forecast","status":"hit",...}
{"adr_id":"ADR-024","type":"forecast","status":"miss",...}

Snapshot (daily/weekly metrics):

{
  "date": "2025-01-27",
  "metrics": {
    "success_rate": 0.30,
    "total_forecasts": 20,
    "hits": 6,
    "misses": 14,
    "avg_cost_deviation_pd": -3.5
  }
}

What Leadership Wants to See

CTOs and executives probably care about:

Forecast success rate (prediction accuracy)
Cost savings trend (rough ROI)
Learning curve (are we getting better?)

Treat ADR as transaction log. Handle visualization separately.

When Do We Know It Was Worth It?

Three evaluation moments:

① When requirement actually comes

"Did that requirement actually happen?"

No → prediction missed
Yes → move to next evaluation

Not "worth it" yet.

② When we measure implementation time (most important)

"How fast/cheap could we implement it?"

Case	Additional work
With forward-looking design	1 person-day
Without (estimated)	10 person-days

This is when we can say "forward-looking design paid off".

User adoption doesn't matter yet.

③ When users get value

This evaluates business value, but involves marketing, sales, timing, competition.

For technical decisions, focus on ② implementation cost difference.

Why Not Excel?

Excel management fails because:

Updates scatter across time
Unclear ownership
Diverges from decision log
Nobody looks at it

Excel becomes "create once, forget forever".

Treat ADR as input device, visualization as separate layer.

Summary

This system's goal isn't perfect estimates or perfect predictions.

Goals are:

Record decisions
Learn from results
Improve prediction accuracy over time

Wrong estimates aren't failures. Making the same wrong decision repeatedly without learning is the failure.

Treat numbers as learning material, not absolute truth.

Next Steps

Planning to propose:

Finalize metadata.json schema
PoC with 2-3 repos
Build Jenkins → S3 → Fabric pipeline
Start with hit rate & cost deviation
Run for 3 months, evaluate learning

Not sure if this will work, but worth trying to turn forward-looking design from "personal skill" into "organizational capability".

Measuring ROI of Forward-Looking Design Decisions with ADR

Yuto Takashi — Tue, 27 Jan 2026 13:05:05 +0000

Why You Should Care

Ever built a feature "just in case" only to never use it? Or skipped implementing something flexible, only to refactor it weeks later?

We all face this dilemma: YAGNI (You Aren't Gonna Need It) vs. forward-looking design.

The problem? We make these decisions based on gut feeling, not data.

This post shows how to make forward-looking design measurable using ADR (Architecture Decision Records).

The Core Problem

What we really want to know is:

How often do our predictions about future requirements actually come true?

Three dimensions to evaluate:

Aspect	Detail
Prediction	"We'll need X feature in the future"
Implementation	Code/design we built in advance
Reality	Did that requirement actually come?

We want to measure prediction accuracy.

YAGNI vs. Forward-Looking Design

YAGNI means "You Aren't Gonna Need It now", not "You'll Never Need It".

The problem is paying heavy upfront costs for low-accuracy predictions.

When forward-looking design makes sense

Extension points (interfaces, hooks, plugin architecture)
Database schema separation
Fields that are easy to add later

→ Low cost, low damage if wrong

When YAGNI is the answer

UI implementation
Complex business logic
External integrations
Permission/billing logic

→ Will need complete rewrite if wrong

The Estimation Problem

You might think: calculate value like this.

Value = (Probability × Cost_Saved_Later) − Cost_Now

But here's the catch: we can't estimate these accurately.

Nobody knows the real probability
Future costs are unknown until we do the work
Even upfront costs grow during implementation

If we could estimate accurately, we wouldn't need this system.

Learning from inaccuracy

We don't need perfect estimates.

What we need:

Learn which areas have accurate predictions
Learn which areas consistently miss
Build organizational knowledge

Example:

Area	Hit Rate	Tendency
Database design	70%	Forward-looking OK
UI specs	20%	Stick to YAGNI
External APIs	10%	Definitely YAGNI

Numbers are learning material, not absolute truth.

Recording Predictions in ADR

To make forward-looking design measurable, we need to record our decisions.

We use ADR (Architecture Decision Records). I've written about ADRs before in this post:

Yuto Takashi

Jan 25

Why did we choose this again?" - How ADRs Solved Our Documentation Problem

#architecture #documentation #productivity #softwareengineering

Comments

3 min read

Example ADR with forecast:

## Context
Customer-specific permission management might be needed in the future

## Decision
Keep simple role model for now

## Forecast
- Probability estimate: 30% (based on sales feedback)
- Cost if later: 20 person-days (rough estimate)
- Cost if now: 4 person-days (rough estimate)
- Decision: Don't build it now (negative expected value)

Estimates can be rough. What matters is recording the rationale.

Making It Measurable

Architecture

In our environment, this works:

Repositories (ADR + metadata.json)
  ↓
Jenkins (cross-repo scanning, diff collection)
  ↓
S3 (aggregated JSON)
  ↓
Microsoft Fabric (analysis & visualization)
  ↓
Dashboard

We already have Jenkins scanning repos for code complexity. We can extend this for ADR metadata.

Metadata Design

Keep ADR content free-form. Standardize only the metadata for aggregation.

docs/adr/ADR-023.meta.json:

{
  "adr_id": "ADR-023",
  "type": "forecast",
  "probability_estimate": 0.3,
  "cost_now_estimate_pd": 4,
  "cost_late_estimate_pd": 20,
  "status": "pending",
  "decision_date": "2025-11-01",
  "outcome": {
    "requirement_date": null,
    "actual_cost_pd": null
  }
}

Minimum fields needed:

adr_id: unique identifier
type: forecast
probability_estimate: 0-1
cost_now_estimate_pd: upfront cost (person-days)
cost_late_estimate_pd: later cost (person-days)
status: pending / hit / miss
outcome: actual results

Treat estimates as "estimates", not gospel truth.

Diff-Based Collection

Full scans get expensive. Collect only diffs:

# Record last commit SHA
git diff --name-only <prev>..<now> | grep '*.meta.json'

Scales as repos grow.

Comparing Predictions to Reality

After 6-12 months, review:

ID	Feature	Est. Prob	Actual	Est. Cost	Actual Cost	Result
F-001	CSV bulk import	30%	Came after 6mo	15pd later	1pd	Hit & overestimated
F-002	i18n	50%	Didn't come	-	-	Miss
F-003	Advanced perms	20%	Came after 3mo	20pd later	25pd	Hit & underestimated

Focus on trends and deviation reasons, not absolute accuracy.

Aggregation & Visualization

Two Types of Output

Raw facts (NDJSON, append-only):

{"adr_id":"ADR-023","type":"forecast","status":"hit",...}
{"adr_id":"ADR-024","type":"forecast","status":"miss",...}

Snapshot (daily/weekly metrics):

{
  "date": "2025-01-27",
  "metrics": {
    "success_rate": 0.30,
    "total_forecasts": 20,
    "hits": 6,
    "misses": 14,
    "avg_cost_deviation_pd": -3.5
  }
}

What Leadership Wants to See

CTOs and executives probably care about:

Forecast success rate (prediction accuracy)
Cost savings trend (rough ROI)
Learning curve (are we getting better?)

Treat ADR as transaction log. Handle visualization separately.

When Do We Know It Was Worth It?

Three evaluation moments:

① When requirement actually comes

"Did that requirement actually happen?"

No → prediction missed
Yes → move to next evaluation

Not "worth it" yet.

② When we measure implementation time (most important)

"How fast/cheap could we implement it?"

Case	Additional work
With forward-looking design	1 person-day
Without (estimated)	10 person-days

This is when we can say "forward-looking design paid off".

User adoption doesn't matter yet.

③ When users get value

This evaluates business value, but involves marketing, sales, timing, competition.

For technical decisions, focus on ② implementation cost difference.

Why Not Excel?

Excel management fails because:

Updates scatter across time
Unclear ownership
Diverges from decision log
Nobody looks at it

Excel becomes "create once, forget forever".

Treat ADR as input device, visualization as separate layer.

Summary

This system's goal isn't perfect estimates or perfect predictions.

Goals are:

Record decisions
Learn from results
Improve prediction accuracy over time

Wrong estimates aren't failures. Making the same wrong decision repeatedly without learning is the failure.

Treat numbers as learning material, not absolute truth.

Next Steps

Planning to propose:

Finalize metadata.json schema
PoC with 2-3 repos
Build Jenkins → S3 → Fabric pipeline
Start with hit rate & cost deviation
Run for 3 months, evaluate learning

Not sure if this will work, but worth trying to turn forward-looking design from "personal skill" into "organizational capability".

I write more about design decisions and engineering processes on my blog.
If you're interested, check it out: https://tielec.blog/

How Git Temp Files Killed Our Jenkins Performance (EFS Metadata IOPS Hell)

Yuto Takashi — Tue, 27 Jan 2026 11:29:23 +0000

Why You Should Care

If you're running Jenkins on AWS EFS, you might hit this exact problem. Git clone operations start timing out, Jenkins UI becomes painfully slow, and you get cryptic "Bad file descriptor" errors.

The culprit? Git temporary pack files accumulating over time, starving EFS of metadata IOPS.

The Problem

Monday morning. Jenkins dashboard takes forever to load. Hit the Replay button, build starts, then:

fatal: write error: Bad file descriptor
fatal: fetch-pack: invalid index-pack output
ERROR: Error cloning remote repo 'origin'

Transfer speed drops from 77KB/s to 51KB/s before timing out completely. 504 Gateway Timeout errors everywhere.

Initial Investigation

First thought: "Network issue?"

But looking closer at the error logs, I noticed pipeline-groovy-lib was failing during Shared Library checkout. That happens on the Jenkins Controller, not agents. So this is a Controller resource problem.

Checked CloudWatch metrics:

✅ CPU: Normal (0-5%, occasional 30% spikes)
✅ Network: Nothing unusual
✅ EBS disk latency: Stable at ~0.7s

Wait... this Jenkins uses EFS, not just EBS.

EFS Metrics Told the Real Story

Checked EFS CloudWatch metrics and found:

Throughput utilization: Hitting 100% during 00:00-03:00
IOPS: Metadata operations dominating
Storage: Growing from 14GB → 17GB

📊 See detailed metrics and graphs in the full write-up

EFS was starving on metadata IOPS, not storage capacity.

What's Metadata IOPS?

In EFS, metadata operations include:

File stat (checking size/timestamps)
Directory listings
File create/delete
Permission changes

In other words: lots of small file operations consume metadata IOPS.

Jenkins workloads are full of these:

Build logs (thousands of small files)
Git repositories (.git/objects with tons of files)
Shared Library clones
Build fingerprints

It's not about storage size. It's about file count.

Finding the Culprit

Checked directory sizes:

du -sh /mnt/efs/jenkins/*

342M    plugins
106M    war
332M    logs
373M    caches
174M    fingerprints
timeout jobs  # Suspicious timeout

Only found ~1.3GB. Missing ~15.7GB.

Searched for large files directly:

find /mnt/efs/jenkins -type f -size +100M 2>/dev/null -ls

Boom:

125829120 Jan 27 01:50 .../builds/873/libs/.../root/.git/objects/pack/tmp_pack_S3GPJw
122683392 Jan 27 01:50 .../builds/872/libs/.../root/.git/objects/pack/tmp_pack_c4EAwd
(dozens more of these...)

tmp_pack_* files everywhere. 100-300MB each.

Root Cause

Here's what was happening:

Jenkins clones Pipeline Shared Library
Git creates temporary pack files (tmp_pack_*)
EFS IOPS throttling causes timeout
Temp files never get cleaned up
This happens every build (nightly at 23:12)
~200-300MB garbage per build × dozens of builds = ~15GB

Vicious cycle:

EFS slow → Git fails → Files accumulate → EFS slower

The Fix

Immediate: Adjust EFS Throughput

Changed from Bursting mode to Provisioned throughput (300 MiB/s).

Why provisioned?

Predictable performance for metadata IOPS spikes
No waiting for burst credits to recover
Works during investigation (find, du commands)

⚠️ Note: EFS throughput mode changes have restrictions. Plan accordingly.

Cleanup Job

Created a daily cleanup pipeline:

pipeline {
    agent any
    triggers {
        cron('0 4 * * *')
    }
    stages {
        stage('Clean tmp_pack files') {
            steps {
                sh '''
                    find $JENKINS_HOME -name "tmp_pack_*" -type f -mtime +1 -delete
                    echo "Cleaned up tmp_pack_* files older than 1 day"
                '''
            }
        }
    }
}

Manual Cleanup (One-time)

systemctl stop jenkins
find /mnt/efs/jenkins -name "tmp_pack_*" -type f -delete
systemctl start jenkins

Freed up ~15GB instantly.

Key Takeaways

1. Storage Size ≠ Performance

Small files matter more than total GB on EFS. Metadata operations can bottleneck before you hit storage limits.

2. Bursting Mode Can Be Unpredictable

When problems accumulate gradually ("silently"), burst credits can run out unexpectedly.

3. Always Have a Safety Net

Changing to provisioned throughput bought us time to investigate properly without user impact.

Monitoring Setup

Added CloudWatch alarms:

EFS throughput utilization > 75%
Directory size monitoring (weekly reports)

Early detection prevents these surprises.

Conclusion

Surface symptom: Git clone errors
→ Deeper cause: EFS metadata IOPS exhaustion
→ Root cause: Git temp file accumulation

Problem-solving is about peeling back layers. Each hypothesis, each metric check, gets you closer to the truth.

If you found this useful, I write more about infrastructure debugging and SRE experiences here:
https://tielec.blog/

Full investigation details with metrics graphs:
https://tielec.blog/en/tech/sre/jenkins-efs-metadata-iops-issue

My 5-Year-Old Keyboard Died During a Winter Trip - Here's What I Learned

Yuto Takashi — Mon, 26 Jan 2026 02:48:53 +0000

Why You Should Care

If you use a Bluetooth keyboard daily, it will eventually fail. Understanding why it happened and how to choose the next one can save you time and money.

TL;DR:

5.5-year-old Filco Bluetooth keyboard died suddenly
Hypothesis: Winter temperature shock (15-20°C drop) + aging
Chose RealForce RC1 45g over HHKB due to ThinkPad compatibility
Lesson: 5 years is the expected lifespan for most electronics

What Happened

I bought a Filco Majestouch MINILA Air (Bluetooth, Cherry MX Black) in July 2019. It worked perfectly for 5.5 years.

Then, I left home for a week during a winter cold wave (heating off). When I returned, the keyboard was completely dead.

No power
No LED lights
New batteries didn't help

Troubleshooting Steps

I tried everything:

✓ Check power switch → ON
✓ Check battery polarity → Correct
✓ Clean battery contacts → No dirt
✓ Try new batteries (multiple brands) → No change
✓ Reset Bluetooth pairing → No response
✓ Full discharge (remove batteries, press keys) → No change
✓ Try pairing with smartphone → No response

Diagnosis: Power system failure

Possible causes (based on research):

Power IC failure
Solder cracks on power lines
Bluetooth board power management failure
Capacitor damage

Conclusion: Not worth repairing

Out of warranty (5.5 years old)
Filco's repair support ends after 5 years
Repair cost would exceed new keyboard price

Why Did It Fail?

"How can a keyboard die just because I didn't use it for a week?"

This question bothered me. Then I remembered: it was a winter cold wave, and I had the heating off.

Temperature Changes

Before trip (heating on): Room temp ~20-25°C
During trip (cold wave, no heating): Room temp ~5-10°C
After return (heating on): Room temp ~20-25°C

Temperature difference: 15-20°C

Hypothesis: Thermal Shock

After 5.5 years of use, solder joints likely had microscopic cracks. Normal temperature changes (from keyboard heat during use) were fine, but rapid temperature cycling (shrink → expand) may have caused complete fractures.

Research found similar cases:

PC support companies report increased "keyboard not working" calls during cold waves
Many cases involve keyboards 5+ years old
Most are temporary (work after warming up), but mine was permanent

Note: This is just a hypothesis. I have no definitive proof.

Choosing the Next Keyboard

Requirements

Wireless connection (Bluetooth or 2.4GHz)
Quiet typing (I used to like clicky switches, but now prefer silence)
Portable, but also used with ThinkPad

The last point turned out to be critical.

Keyboards Considered

Filco MINILA-R Convertible (~$100)

Cherry MX switches
Compact with arrow keys
Standard layout
Familiar brand

HHKB (Happy Hacking Keyboard) Professional HYBRID Type-S (~$250)

Ultra-compact (60% keyboard)
Lightweight (540g)
Topre switches (electrostatic capacitive)
But: no arrow keys, unique layout

RealForce RC1 (~$230)

70% keyboard
Arrow keys + function keys
Topre switches (electrostatic capacitive)
Standard layout
600g

Why I Rejected HHKB

I was initially attracted to HHKB's compactness and portability.

But the layout is very different:

No arrow keys (use Fn + keys instead)
Control key in unusual position
No function keys (F1-F12)

Problem: ThinkPad has a standard layout. If I use ThinkPad for work outside, switching between two different layouts would be confusing.

I prioritized "ThinkPad compatibility" over "portability".

Why I Chose RealForce RC1

Decision factors:

Standard layout (same as ThinkPad)
70% size (compact but has arrow/function keys)
Topre switches (quiet + durable)
600g (desk-focused but portable)

Key Weight: 30g vs 45g

RealForce RC1 comes in two versions: 30g and 45g.

30g:

Lighter feel
Quieter
Easier to mistype
Big difference from ThinkPad

45g:

Standard weight (similar to ThinkPad)
Less confusion when switching
Fewer mistypes

I chose 45g for ThinkPad compatibility.

Final decision: RealForce RC1 45g, Japanese layout

Price: ~$230. I'll wait for a sale. Until then, I'll use my old Filco wired keyboard.

Expected Lifespan of Filco Keyboards

I researched user experiences:

Usage period (based on user reports):

4-5 years: Most common
5-7 years: Above average
10+ years: Very rare (mostly wired models)

Manufacturer's support:

Filco repair support: 5 years
After 5 years, spare parts are no longer guaranteed

Bluetooth vs Wired:
Bluetooth models have more components (power IC, Bluetooth board), so more failure points.

Estimated lifespan: 4-6 years

My 5.5 years was actually pretty good for a Bluetooth model.

Lessons Learned

1. 5 Years is the Cutoff

Manufacturer warranties and support typically end around 5 years, matching real-world failure rates.

2. Aging Electronics + Temperature Shock = Risk

While I can't prove it, aged products may be more vulnerable to rapid temperature changes.

3. Prevention is Limited

Realistic options during long trips:

Store in temperature-stable location
Use cardboard + towels for insulation

But for 5+-year-old products, there may be no preventing eventual failure.

4. Timing, Not Duration

The root cause was 5.5 years of aging.

Winter cold wave + week-long temperature cycling likely triggered the final failure.

Even with continued use, it probably would have failed soon anyway.

Takeaway

Filco MINILA Air served me well for 5.5 years
Failure likely due to aging + temperature shock (hypothesis)
Next keyboard: RealForce RC1 45g (considering ThinkPad compatibility)
Electronics typically last ~5 years; 10+ years is lucky

More about my decision-making process:

https://tielec.blog/

I Wanted Zero-Input Time Tracking. Here's What I Learned.

Yuto Takashi — Mon, 26 Jan 2026 01:49:14 +0000

Time tracking tools. Task trackers. There are tons of them out there.

But I've never found one that truly clicked for me.

Why? Because every tool assumes you'll input something. And that input is exactly what I hate.

Why You Should Care

If you've ever:

Tried TaskChute or similar methods and gave up
Wished for "automatic" time tracking
Wondered why no tool just knows what you're working on

...then this post is for you. I went down the rabbit hole so you don't have to.

The Promise of Automatic Tracking

Tools like RescueTime, Timely, and Timing (Mac) promise automatic tracking. No timers, no manual input.

Sounds perfect, right?

Well, I found 4 limitations that made me rethink everything.

Limitation 1: "Which App" ≠ "What For"

Automatic trackers record traces. Which app you used. Which site you visited.

But here's the thing: I use Chrome for research, for social media, and for work docs. The tool can't tell the difference.

The "why" behind your actions? Only you know that.

Limitation 2: Idle Detection Doesn't Catch Everything

RescueTime detects when you stop typing or moving your mouse. Smart.

But what about:

Waiting for a build to finish
Reading logs
Just... thinking

You're working, but not "doing" anything. The tool marks you as idle.

Limitation 3: Only the Active Window Gets Tracked

Multiple monitors? Multiple browser windows? Too bad.

Only the window you're actively interacting with gets logged. If you're reading docs on the left and coding on the right, only the coding side counts.

Your multitasking reality? Invisible.

Limitation 4: iPhone Doesn't Play Nice

My setup: Windows + iPhone.

Windows? RescueTime works great.
iPhone? Apple doesn't allow background app monitoring. So RescueTime barely functions. You're stuck with Screen Time, which doesn't export or integrate with anything.

Two separate worlds. No unified view.

So What Does Everyone Else Do?

I wondered: surely someone has solved this?

Turns out... not really.

Most people don't bother tracking in detail
TaskChute enthusiasts power through with willpower (rare)
Managers have assistants do it for them
Engineers use Git history as a proxy

Zero-input, perfect tracking? I couldn't find anyone doing it.

My Realistic Compromise

Okay, perfect is impossible. So what can I accept?

My non-negotiable: zero input.
My goal: understand trends in how I spend time.
My setup: Windows + iPhone.

Solution: RescueTime (free) + check the dashboard once a week.

What I'm giving up:

Compromise	How I'm Dealing With It
iPhone integration	Just ignore it. PC is my main workspace.
"What for" context	"Which app" is good enough.
Perfect accuracy	Trends are enough.
Task-level tracking	Category-level is fine.

Expected outcome: "This week, I spent X hours on productive stuff, Y hours drifting." That's it.

The Takeaway

If you want perfect time tracking, you have to accept some manual input.

If you want zero input, you have to let go of perfection.

I'm trying RescueTime for a week. If it's not enough, I'll figure something else out.

That's where I landed. Maybe it helps you too.

I write about decisions and reflections like this on my blog.
If you're interested: https://tielec.blog/

Windows Battery Shows 10% But Suddenly Shuts Down? Here's How to Fix It

Yuto Takashi — Sun, 25 Jan 2026 07:52:31 +0000

Why You Should Care

Ever been working on something important when your laptop suddenly shuts down, even though the battery showed 10%? Yeah, me too. Lost some unsaved work and got pretty frustrated.

Turns out, it's not a bug—it's battery degradation messing with the displayed percentage. Here's how to diagnose it and prevent it from happening again.

What you'll learn:

How to check your actual battery health with one command
Why your battery percentage lies to you
How to adjust settings to avoid sudden shutdowns
Best practices for extending battery life

The Problem

My laptop showed 10% battery remaining. I was about to plug in the charger when—BAM—"Your PC will shut down in 1 minute" with no way to cancel. And it did. Hard shutdown, not even hibernate like it was supposed to.

Wait, what? I had 10% left!

Step 1: Check Your Battery Health

Windows has a built-in command to generate a detailed battery report. Open Command Prompt as Administrator and run:

powercfg /batteryreport

This creates an HTML file at C:\Windows\System32\battery-report.html.

Tip: I initially typed betteryreport and got an error. It's batteryreport (with an 'a'). Don't be like me. 😅

Step 2: Read the Report

Open the HTML file in your browser. Look for these key numbers:

Item	Example Value	Meaning
DESIGN CAPACITY	86,000 mWh	Battery capacity when new
FULL CHARGE CAPACITY	68,630 mWh	Current maximum capacity
CYCLE COUNT	287	Number of charge cycles

Calculate degradation:

68,630 ÷ 86,000 ≈ 0.798 = ~80% health

My battery had degraded by 20%. That's the culprit.

Why This Causes Sudden Shutdowns

Here's the thing: Windows shows percentages based on current capacity, not design capacity.

So when my laptop showed 10%:

Display: 10% of current capacity
Reality: ~8% of original capacity (10% × 0.8)

The shutdown threshold was set at 5%. From 10% to 5% is supposed to be a 5% buffer, but with degradation, it's really only ~4%. Under heavy load, that disappears in seconds.

The Fix

Fix 1: Adjust Warning Levels

Give yourself more buffer time before the forced shutdown.

Navigate to:

Control Panel 
→ Power Options 
→ Change plan settings 
→ Change advanced power settings 
→ Battery

Adjust these:

Low battery level: 15-20% (up from 10%)
Critical battery level: 7-10% (up from 5%)

Fix 2: Enable 80% Charge Limit (Lenovo)

If you use your laptop plugged in most of the time, limit charging to 80% to reduce battery wear.

For Lenovo laptops:

Open Lenovo Vantage app
Go to Hardware Settings → Power
Find "Battery Charge Threshold" or "Conservation Mode"
Set maximum charge to 80%

Other manufacturers have similar features—check your laptop's utility app.

Fix 3: Match Your Usage Pattern

Mostly at a desk?

Keep it plugged in
Set 80% charge limit
Use a cooling pad if it gets hot

Mobile user?

Charge when it hits 20-30%
Try to stay between 20-80%
Avoid draining to 0%

Bonus: AC vs Battery Power

Here's something that surprised me: AC power is actually better for your PC (though not necessarily for the battery).

Why?

Battery power: voltage fluctuates as it drains
AC power: stable voltage and current
Your CPU/GPU prefer stable power

So keeping it plugged in isn't bad for the computer. Just set that 80% charge limit to protect the battery.

About Charge Cycles

A charge cycle = 100% of battery capacity used.

Examples:

Use 50%, charge → Use 50%, charge = 1 cycle
Use 30%, charge × 3 times ≈ 1 cycle

When plugged in:

Battery hits 100% → charging stops
Power comes directly from AC adapter
Cycles barely increase

My 287 cycles meant I was actually using it unplugged quite a bit.

Using a Power Bank?

Make sure it's powerful enough!

Check your AC adapter:

Example: OUTPUT: 20V 3.25A
→ 20V × 3.25A = 65W

Power requirements:

Standard laptops: 45-65W
High-performance: 90-135W
Gaming laptops: 135W+

A 60W power bank works for most regular laptops.

Quick Reference

Degradation levels:

✅ 80%+: Healthy
⚠️ 70-80%: Adjust settings
🔴 Below 70%: Consider replacement

Action checklist:

[ ] Run powercfg /batteryreport
[ ] Calculate actual battery health
[ ] Raise low battery warning to 15-20%
[ ] Set 80% charge limit if mostly plugged in
[ ] Verify power bank wattage matches needs

Wrapping Up

That mysterious shutdown? Not so mysterious anymore. Battery degradation is sneaky—your laptop thinks it has more juice than it actually does.

One command (powercfg /batteryreport) and a few setting tweaks can save you from lost work and frustration.

Have you dealt with this issue? Drop a comment with your battery health percentage! 👇

I share more thoughts on technical decisions and problem-solving approaches on my blog if you're interested: https://tielec.blog/

Why did we choose this again?" - How ADRs Solved Our Documentation Problem

Yuto Takashi — Sun, 25 Jan 2026 06:34:27 +0000

Why You Should Care

Ever had a new team member ask "why are we using this tool?" and you couldn't remember the exact reason? Or worse, spent hours digging through Slack threads and meeting notes trying to reconstruct a decision from 6 months ago?

That was me until I discovered Architecture Decision Records (ADRs).

The Problem: Scattered Information

I recently wrote about evaluating project management tools for our 30-person team. We had:

Redmine (database maintenance nightmare)
Azure DevOps (too complex, nobody used half the features)
Discussions scattered across Slack, Google Docs, and email

Someone commented: "You should write an ADR for this."

My reaction? "What's an ADR?"

What is an ADR?

Architecture Decision Record = A short document that captures why you made a technical decision.

That's it. Simple format:

# 001. Evaluate Linear for Project Management

## Status
Proposed (evaluating)

## Context
- 30-person team, tools fragmented
- Redmine: maintenance issues
- Azure DevOps: too complex, underutilized

## Decision
Evaluate Linear as unified solution
- Simple UI (learned from Azure DevOps)
- Cost: ~$3,000/year vs $12,000 Redmine maintenance
- Cross-team visibility

## Consequences
Pros: Simple, cheaper, unified
Cons: English UI, less customization
Unknown: Actual usage experience

ADR vs Meeting Notes

Meeting notes capture what happened:

2:00 PM - Alice: I think Linear is good
2:05 PM - Bob: But it's in English
2:10 PM - Carol: What about Jira?
...

ADR captures the decision:

## Decision: Linear

## Why
- Simple (avoiding Azure DevOps mistake)
- Cost effective
- Cross-team visibility

## Alternatives Rejected
- Jira: expensive, complex
- GitHub Projects: weak reporting

Six months later, which one helps your future self more?

Keeping ADRs Alive (The Real Challenge)

The biggest risk? ADRs becoming dead documentation.

Here's what works:

1. Keep It Simple (5-10 minutes max)

If it takes an hour to write, you're doing it wrong.

# 003. Use Linear

## Context
Tools fragmented, need unification

## Decision
Linear - simple, affordable

## Consequences
Good: unified, cheap
Bad: English UI, less customization

Details: [Meeting notes](link)

2. Make Them Useful

Onboarding: New members read them
Questions arise: "Why this?" → "Check ADR-003"
Quarterly review: Update or deprecate

3. Living Documents

## Status
Accepted (2026-01-22)

## Update (2026-04-15)
After 3 months:
- English UI was fine
- BUT: needed spreadsheet alongside
  due to limited custom fields

4. Don't Force It

Culture comes from convenience, not mandates.

Start with one ADR. Share it. If someone says "this is helpful," you've won.

The Perfection Trap

"But what if I miss important information?"

You don't need 100% completeness.

Your future teammates need:

What you chose
Why you chose it
What else you considered
Main trade-offs

That's 80% of questions answered.

The other 20%? Link to meeting notes.

Comparison

Perfectionist approach:
- Time: 3 hours
- Completeness: 100%
- Written: 1-2x per year
→ Result: Most decisions undocumented

Pragmatic approach:
- Time: 5-10 minutes
- Completeness: 70-80%
- Written: 2-3x per month
→ Result: Most decisions documented

70% information beats 0% information.

Example: Google's Design Docs

Google has a similar practice called "Design Docs":

"A design doc is not a spec. It doesn't need to be perfect. It's a tool for discussion."

Their approach:

Not required, but encouraged
Bullet points are fine
Start discussion before coding
Review and improve together

My Takeaway

When I wrote that project management tool article, I was already doing ADR-style thinking:

Problem context ✓
Options considered ✓
Trade-offs ✓
Current decision ✓

I just didn't know the term.

ADRs are letters to your future self and team.

They answer: "Why did we do this?" when everyone has forgotten, moved on, or left the company.

Getting Started

Try one ADR for your next technical decision
Keep it short (5-10 minutes)
Store in Git (docs/adr/0001-your-decision.md)
Share with team and see if they find it useful
Iterate based on feedback

That's it. No fancy tools needed. Just markdown files in your repo.

Resources

Update: We actually implemented this at my company. Started with one ADR. Now it's standard practice. The trick? Don't mandate it—let the value speak for itself.

What's your experience with decision documentation? How do you handle "why did we choose this?" questions? Drop a comment below! 👇

I write more about decision-making and reflective practices for engineers.
If you're interested, you can find more here: https://tielec.blog/

Forem: Yuto Takashi

Why SRE Investment Gets Undervalued (And How to Fix It)

Why You Should Care

The Police Department Analogy

The Negative Spiral

The Chicken-and-Egg Problem

Platform Engineering: A New Approach

SRE vs Platform Engineering

But Budget Issues Remain

Three Frameworks to Justify Investment

1. Engineer Ratio Approach

2. Revenue Percentage Approach

3. Downtime Cost Calculation

Making It Clear for Non-Technical Executives

Executive Guide Available

Conclusion: Infrastructure is Investment, Not Cost

Final Chapter: Jenkins EFS Problem Solved - From 100% to 0% Throughput Usage

TL;DR

Previous Episodes (Quick Recap)

The Result: Dramatic Improvement

The Complete Timeline: 5 Throughput Modes

Why We Changed From Elastic Throughput

Hypothesis Verification

Initial Hypothesis

The Real Culprit

Impact of Disposable Agent Approach

The SRE Process: From Detection to Resolution

Outstanding Issues and Next Steps

Short-term Tasks

Long-term Considerations

Key Takeaways

Technical Lessons

Process Lessons

Related Articles

How Jenkins Slowly Drained Our EFS Burst Credits Over 2 Weeks

TL;DR

Why You Should Care

The Mystery: Symptoms vs. Root Cause

What Changed on 1/13?

1. Agent Architecture Change

2. Post-New Year Development Rush

The Math: 50x Metadata IOPS Increase

Understanding Git Cache in Jenkins

How Git Caching Works

The Smoking Gun: tmp_pack_* Location

Root Cause: Cache Setting Was OFF

The Fix: Enable Caching

Choosing the Refresh Time

Measuring the Impact

Lessons Learned

1. Symptoms ≠ Root Cause Timeline

2. Architecture Changes Have Hidden Costs

3. EFS Metadata IOPS Characteristics

4. Compound Root Causes

Open Questions

Python's Silent Failure: When `python -m` Does Nothing (And How to Fix It)

Why You Should Care

The Problem

The Root Cause

The Quick Fix

Option 1: Create __main__.py (Recommended)

Option 2: Add to your main file

What Actually Happens

How I Debugged It

Step 1: Added logging everywhere

Step 2: Checked what WAS running

Step 3: Tried --help

How Python Looks for Entrypoints

Why This Bit Me (AI-Generated Code)

The Fix in Action

Key Takeaways

AWS EFS Emergency Response: How I Spent $69 in 26 Hours (And How to Avoid It)

TL;DR

The Incident

Why You Should Care

EFS Throughput Modes: A Quick Comparison

1. Bursting Throughput (Default)

2. Provisioned Throughput

3. Elastic Throughput

Cost Comparison

Option 1: Create `main.py` (Recommended)

Step 3: Tried `--help`