Forem: Sreekanth Kuruba

CNI Plugins in Kubernetes Explained: The Networking Engine Behind Every Pod

Sreekanth Kuruba — Thu, 07 May 2026 09:50:58 +0000

You create a Pod.
It gets an IP address and can communicate with other Pods.

But how does that actually happen?

Kubernetes doesn’t manage networking itself. It delegates the entire job to CNI Plugins — the invisible plumbing system of Kubernetes.

Kubernetes schedules Pods, but CNI plugins give them network identity and connectivity.

Let’s break it down clearly.

What is CNI?

Container Network Interface is a specification, not a single tool.

It defines a standard way for Kubernetes (and container runtimes) to configure networking for Pods.

When Kubernetes needs to connect a Pod to the network, it calls a CNI plugin and says:

“Give this Pod an IP, set up connectivity, and make it work.”

Why Kubernetes Uses CNI

Networking needs vary across environments:

Simple setups for learning
High-performance production clusters
Strict security and network policies
Cloud provider integrations

CNI makes Kubernetes networking agnostic — you can choose different plugins without changing Kubernetes.

How CNI Works (Step-by-Step)

When you create a Pod:

kubelet detects the new Pod on the node
kubelet asks the container runtime (containerd/CRI-O), which then calls the CNI plugin
The plugin:

Creates a network namespace for the Pod
Sets up a veth pair (virtual cable between Pod and host)
Assigns an IP address (using IPAM)
Configures routing and interfaces
1. The Pod becomes ready and can communicate

This entire process usually takes milliseconds.

Core Components of CNI

Network Namespace — Isolated network stack for each Pod
veth Pair — Virtual Ethernet cable connecting Pod to the host
Bridge / Router — Connects multiple Pods (Linux bridge or direct routing)
IPAM — IP Address Management (assigns and tracks IPs)

Popular CNI Plugins (2026 Guide)

Plugin	Type	Best For	Strengths	Best Used When
Calico	Routing + Policy	Most production clusters	Excellent NetworkPolicy, scalable	You need strong security
Cilium	eBPF-based	Performance + Security	Kernel-level networking, observability	You want modern, high-performance
Flannel	Overlay	Learning & small clusters	Extremely easy to set up	Just getting started
AWS VPC CNI	Native	AWS EKS	Native AWS performance	Running on AWS

Recommendation:

Beginners → Flannel
Production → Calico or Cilium

Overlay vs Routing vs eBPF

Overlay (Flannel, Weave): Easy but adds encapsulation overhead
Routing (Calico): Better performance using real routing protocols
eBPF (Cilium): Modern approach — extremely fast with powerful security

Debugging CNI Issues

# Check running CNI pods
kubectl get pods -n kube-system | grep -E "calico|cilium|flannel"

# View CNI config
ls /etc/cni/net.d/

# Check Pod networking
kubectl exec -it <pod> -- ip addr

# Kubelet logs for CNI errors
journalctl -u kubelet | grep -i cni

Summary

CNI plugins are the networking engine of Kubernetes.
They handle IP assignment, interface creation, routing, and connectivity using Linux kernel primitives.

Understanding CNI helps you:

Choose the right networking solution
Debug connectivity issues faster
Design better Kubernetes clusters

Next in Series:
Kubernetes Services & kube-proxy Internals

Dockerfile & Image Build Internals: From Layers to Lightning-Fast Builds

Sreekanth Kuruba — Tue, 05 May 2026 12:31:48 +0000

You write a Dockerfile, run docker build, and get an image.

But what’s really happening under the hood? Docker isn’t just “building” your app — it’s assembling a stack of immutable filesystem layers.

Docker doesn’t build applications — it builds filesystem snapshots layer by layer.

Let’s break it down.

1. What is a Docker Image, Really?

A Docker image is not a single file.
It’s a stack of read-only layers.

Every instruction in your Dockerfile creates a new layer:

FROM → Base layer
RUN → Executes command and snapshots the result
COPY / ADD → Adds files into a new layer
ENV, WORKDIR, CMD → Metadata layers

These layers are:

Immutable
Content-addressed (using SHA256)
Reusable across images and builds

This design is what makes Docker fast and efficient.

2. How Docker Build Works (Step by Step)

When you run docker build .:

Docker CLI sends the build context (files + Dockerfile) to the daemon.
BuildKit (Docker’s modern build engine) takes control.
Dockerfile is read from top to bottom.
For each instruction:

Docker checks the cache.
Cache hit → Reuses existing layer (very fast).
Cache miss → Executes the instruction and creates a new layer.
1. All layers are stacked to create the final image.

3. Layer Caching – The Real Superpower

Docker follows one strict rule:
If a layer changes, Docker invalidates that layer and all subsequent layers.

Bad Order (Slow Builds)

FROM node:20
COPY . .                    # Code changes frequently
RUN npm install             # This runs every time

Good Order (Fast Builds)

FROM node:20
COPY package*.json ./       # Rarely changes
RUN npm install             # Cached most of the time
COPY . .

Rule of Thumb: Put stable things (dependencies) at the top. Put frequently changing things (your code) at the bottom.

4. BuildKit vs Legacy Builder

Feature	Legacy Builder	BuildKit (Recommended)
Speed	Slow	Much Faster
Parallel Execution	No	Yes
Cache Intelligence	Basic	Advanced
Multi-platform Build	Difficult	Easy
Secret Handling	Risky	Secure

Enable BuildKit:

DOCKER_BUILDKIT=1 docker build .

5. Multi-Stage Builds (The Pro Move)

# Build Stage
FROM node:20 AS builder
WORKDIR /app
COPY package*.json ./
RUN npm install
COPY . .
RUN npm run build

# Production Stage
FROM node:20-alpine
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
CMD ["node", "dist/server.js"]

Benefits: Smaller image, faster deployment, better security.

Multi-stage builds ensure only the final artifacts are kept — everything else is discarded.

6. Quick Debugging Tips

Build is slow → Reorder your Dockerfile
Cache not working → docker build --no-cache
Image too big → Use multi-stage + .dockerignore
See detailed output → docker build --progress=plain .

7. Under the Hood (How Layers Actually Work)

Docker uses a Union File System (like OverlayFS) to combine layers.

Lower layers → read-only
Top layer → writable (when container runs)

To you, it looks like a single filesystem.
Internally, it’s multiple layers merged together.

Summary

A Dockerfile is not just a list of commands.
It’s a performance blueprint for building layered, cached, and efficient images.

Master layer order and caching, and your builds will go from slow and frustrating to fast and predictable.

🔜 Next in Series

Docker Storage & Volumes Internals – Why containers eat disk space and how to control it.

Failover Sounds Good… Until It Doesn’t Work

Sreekanth Kuruba — Mon, 04 May 2026 12:32:53 +0000

“We have failover.”

That sounds reassuring.

But when real failure hits…

many systems still go down — hard.

Why?

Because failover is easy to configure — but extremely hard to make reliable at global scale.

Here are the most common ways failover fails in production:

❌ 1. Failover That Was Never Tested

RDS Multi-AZ enabled
Kubernetes failover configured

Looks good on paper.

Reality:

Takes minutes instead of seconds
Gets stuck
Or doesn’t trigger at all

Lesson: Untested failover = fake failover.

❌ 2. Failover Works… But Breaks Something Else

Sudden traffic spike crashes the secondary instance
Connection storms overload the database
DNS cache delays routing

Result: Failover triggers… but the system still suffers.

❌ 3. Manual Failover at the Worst Time

Someone has to manually promote the replica
Or run a script under pressure

At 3 AM with global users watching — this turns seconds into minutes of downtime.

❌ 4. Partial Failover Strategy

You protected the application ✔️

But forgot:

Database
Cache (Redis)
Message queue
Secrets manager
CI/CD pipeline

One missing piece = entire system impacted.

How to Make Failover Actually Work

Test it regularly — simulate real failures every month
Automate everything — zero human dependency
Reduce failover time — lower DNS TTL, fast retries, pre-warm instances
Handle traffic spikes — add rate limiting and circuit breakers
Run team drills — everyone must know what to do

🌟 Final Thought

Failover is not a checkbox you tick once.

It’s a capability that only proves itself when everything is on fire.

At global scale, the difference between a 10-second blip and a 40-minute outage is usually one thing:

How well your failover actually works under pressure.

💬 What’s the biggest failover issue you’ve seen?

Drop your experience below 👇

Why Most Systems Still Have Hidden Single Points of Failure (SPOF) – Even in 2026

Sreekanth Kuruba — Tue, 21 Apr 2026 12:45:58 +0000

Your system has replicas.

You use auto-scaling.

You have a load balancer.

So you’re safe… right?

👉 Most outages don’t come from what you planned for.

Not really.

Even well-architected systems can collapse because of hidden Single Points of Failure — the ones that look harmless until they bring everything down.

Here are the most dangerous hidden SPOFs that still exist in production systems at global scale:

🗄️ 1. Database Single Point of Failure (Most Critical)

Only one writer instance (even with read replicas)
No automatic failover configured
Backup exists but restore was never tested
Single connection string pointing to one endpoint

At global scale: One DB failure = entire application becomes unusable for millions of users.

🌐 2. DNS / Domain Resolution SPOF

All traffic pointing to one domain without proper failover routing
Single DNS provider with no backup
Missing TTL optimization or latency-based routing

⚖️ 3. Load Balancer / API Gateway SPOF

Single load balancer sitting in one Availability Zone
Weak or missing health checks
All traffic routed through one target group

🔄 4. CI/CD Pipeline SPOF

Single pipeline responsible for all production deployments
No proper rollback strategy
Pipeline failure = whole team blocked

📦 5. Secret & Configuration Management SPOF

Hardcoded secrets or environment variables
Single secrets manager without high availability
Configuration stored in one central place with no versioning

🛠️ 6. Monitoring & Alerting SPOF

All alerts going to one person or one Slack channel
Single monitoring tool with no redundancy
No proper escalation policy

🧠 The Hard Truth

Most systems don’t fail because of obvious SPOFs.

They fail because of the ones no one noticed.

At global scale, even a small hidden SPOF can impact users across multiple countries and time zones.

🛡️ How to Find and Fix Hidden SPOFs

Conduct a regular SPOF Audit
Ask the question: “What if this one component completely fails?”
Add redundancy + automation
Test failure scenarios regularly
Review architecture every quarter

🌟 Final Thought

The most dangerous Single Point of Failure is assuming you don’t have any.

Real resilience begins when you stop looking only at the obvious and start hunting for the hidden ones.

💬 What’s one SPOF that caused a real outage for you?

Let’s discuss 👇

How to Build Systems That Don’t Collapse at Global Scale

Sreekanth Kuruba — Mon, 20 Apr 2026 03:13:05 +0000

Modern systems rarely fail because of one small bug.

They fail when there’s no plan for when things inevitably go wrong.

In 2026, with global teams, multi-cloud environments, and millions of users, resilience isn’t optional — it’s foundational.

⚠️ A Real-World Incident (Why This Matters)

A primary database crashed during peak hours.

There was a backup
There was monitoring

But the critical gaps were:

No automatic failover
The restore process had never been properly tested

Result?

~40 minutes of downtime, manual recovery under pressure, frustrated users, and real business impact.

Lesson Learned:

Having tools and backups is not enough.

They must be automated, tested, and ready when real stress hits.

Here are the core DevOps (and SRE-inspired) principles for building production-ready, resilient systems:

🧩 1. Eliminate Single Points of Failure (SPOF)

One weak link can bring down the entire system.

Common SPOFs:

Single server handling all traffic
One database without replication
Critical service with no fallback

Solution:

Run multiple replicas
Deploy across multiple availability zones or regions
Use load balancers

Mindset: Always design systems assuming failure will happen.

🔄 2. Build Intelligent Failover Mechanisms

When one component fails, the system should recover automatically — without manual intervention.

Key practices:

Database replication (primary + read replicas)
Auto-scaling groups
Kubernetes self-healing (automatic pod restart & rescheduling)
Multi-region active-active architecture

🧪 3. Test Failure Before It Tests You

Most systems look stable… until real-world traffic hits.

Don’t just test success scenarios.

Instead:

Load testing — simulate real user traffic
Stress testing — push the system beyond limits
Chaos Engineering — deliberately inject failures (e.g., Chaos Monkey style)

👉 If you don’t test failure, failure will test you at the worst possible time.

📡 4. Invest in Observability, Not Just Monitoring

You can’t fix what you can’t see.

True observability includes:

Metrics — CPU, memory, latency, error rates
Logs — detailed application behavior
Traces — end-to-end request flow across services

Plus:

Smart alerting (avoid alert fatigue)
On-call rotations with clear runbooks
Actionable dashboards

🧱 5. Plan for Failure as the Default

“Everything is fine” is never a strategy.

Must-have practices:

Regular backup and restore testing
Disaster Recovery planning (clear RTO & RPO targets)
Blameless postmortems after every incident

👉 Treat reliability as a core feature, not an afterthought.

🧭 DevOps Resilience Checklist

No single point of failure
Multi-zone / multi-region deployment
Auto-scaling + load balancing
Full observability + smart alerting
Backup & disaster recovery regularly tested
Chaos engineering practiced
Incident response plan ready

🌟 Final Thought

Reliability is not about eliminating failure completely.

It’s about anticipating failure, detecting it early, and recovering gracefully.

The best DevOps teams don’t just ship faster —

they build systems that stay up when everything else is breaking.

That’s what separates good systems from truly resilient ones at global scale.

💬 What’s one resilience practice that saved your system during a real outage?

Or what’s the biggest reliability challenge you’re facing right now?

Let’s discuss 👇

DevOps vs Platform Engineering in 2026

Sreekanth Kuruba — Wed, 15 Apr 2026 12:29:40 +0000

DevOps transformed how teams build and ship software.
It helped organizations move faster with automation, CI/CD, and shared ownership.

But as companies scale across countries and teams, new challenges start to appear.

What worked for small teams doesn’t always work at global scale.

Why Global Companies Are Quietly Shifting

Main Theme:
At global scale, traditional DevOps starts to crack. Platform Engineering is the next evolution that makes DevOps truly scalable, consistent, and effective across countries and large teams.

Imagine this:

A company has engineering teams in India, the US, Europe, and Singapore.
Hundreds of developers working across time zones.

Yet, releasing even a small feature still takes weeks — not because the developers are slow, but because they’re stuck fighting:

Setting up environments
Fixing inconsistent CI/CD pipelines
Waiting for approvals
Dealing with tool chaos across teams

👉 This is the reality when traditional DevOps tries to scale internationally.

🧩 What DevOps Solved — And Where It Breaks at Global Scale

DevOps was revolutionary. It brought developers and operations together through:

Automation & CI/CD
Infrastructure as Code (IaC)
Shared responsibility (“You build it, you run it”)

It works beautifully for small and mid-sized teams.

But here’s the uncomfortable truth:
👉 At large scale, many developers become part-time infrastructure managers instead of product builders.

At global enterprise scale, DevOps starts showing serious cracks:

Every team picks different tools → massive tool sprawl
Same problems get solved repeatedly
Compliance and regulations (GDPR, data sovereignty, etc.) become extremely hard to manage
Developers waste more time on infrastructure than on actual features
DevOps fatigue kicks in — frustration, burnout, and slower delivery

🏗️ Platform Engineering: The Next Evolution

Here’s the sharper truth:
While DevOps focuses on collaboration, Platform Engineering focuses on developer productivity at scale.

Think of it like this:

DevOps = Every team manages their own kitchen
Platform Engineering = One professional central kitchen with ready tools, standard recipes, and built-in safety

So developers can stop worrying about setup and just focus on cooking great features.

⚙️ What Platform Engineering Actually Delivers

A dedicated platform team builds an Internal Developer Platform (IDP) that offers:

🚀 Self-service environment creation (minutes instead of days/weeks)
🛤️ “Golden Paths” — safe, standardized, and recommended workflows
🔐 Security, compliance, and observability built-in by default
🧭 A clean developer portal for easy self-service

👉 Often powered by tools like Backstage, Crossplane, along with core DevOps tools.

Result: Developers get guided freedom instead of complete chaos or total restriction.

⚖️ DevOps vs Platform Engineering – Clear Comparison

Aspect	DevOps	Platform Engineering
Main Focus	Collaboration between Dev & Ops	Developer productivity & experience at scale
Ownership	Shared by all teams	Dedicated platform team
Approach	Flexible (every team does it their way)	Standardized with smart guardrails
Best Suited For	Small to mid-size teams	Large global organizations
Key Metric	Deployment frequency & speed	Time saved + Developer Experience (DevEx)

One-line summary:
DevOps gives freedom.
Platform Engineering gives freedom that actually scales globally.

🌍 Why Global Companies Are Making This Shift in 2026

At international level, complexity explodes — multi-cloud setups, different regulations, time zone differences, and 100+ engineering teams.

Platform Engineering solves these by:

Drastically reducing repetitive work and cognitive load
Bringing consistency across countries and clouds
Making security & compliance automatic
Improving developer happiness and retention
Delivering faster feature delivery with lower risk

👉 This is exactly why Platform Engineering roles are becoming some of the highest-paying and most strategic positions in 2026.

⚠️ Challenges & Smart Way to Adopt

It’s not effortless. Common pitfalls:

Building the platform without real developer feedback
Making it too rigid
Ignoring legacy systems

Better approach:

Start small (fix one major pain point first)
Treat developers as customers
Iterate continuously based on feedback

🧭 What Should You Learn?

If you're an engineer (especially aiming for global or remote opportunities):

Step 1: Master DevOps fundamentals
→ Docker, Kubernetes, Terraform, CI/CD

Step 2: Level up to Platform Engineering
→ Internal Developer Platforms (IDP)
→ Developer portals (e.g., Backstage)
→ Developer Experience (DevEx) mindset

💡 Pro tip: Build even a small internal platform project — it gives you a massive edge in interviews and LinkedIn.

🔮 Final Thought

DevOps is not going away.

But the companies winning in 2026 are not just “doing DevOps”.
They are building Platform Engineering on top of it — turning DevOps into something scalable, structured, and developer-first at global scale.

👉 The future is DevOps made effortless through smart platforms.

💬 What about you?

What is the biggest time-waster in your current DevOps setup?

Environment setup delays?
CI/CD issues?
Too many tools?

Drop your real experience in the comments — curious to see what teams are struggling with most 👇

Types of APIs Explained: REST, GraphQL, gRPC & SOAP (With Real-World Examples)

Sreekanth Kuruba — Thu, 09 Apr 2026 12:07:06 +0000

Types of APIs Explained: REST, GraphQL, gRPC & SOAP (With Real-World Examples)

When beginners start learning APIs, they usually think there’s only one kind:

“Send a request → Get a response.”

But in reality, there are multiple types of APIs, each built for different purposes — speed, flexibility, security, or simplicity.

In this guide, you’ll learn the main types of APIs with simple explanations, code examples, and real-world use cases.

🧠 What is an API? (Quick Recap)

An API (Application Programming Interface) is a set of rules that allows different software systems to communicate with each other.

One system sends a request → another system processes it → and returns a response.

The style and protocol of this communication decide the type of API.

🔹 Main Types of APIs by Architecture Style

1. REST APIs – The Most Popular Type

REST (Representational State Transfer) is the most widely used API style in 2026.

It uses standard HTTP methods:

GET – Fetch data
POST – Create new data
PUT / PATCH – Update data
DELETE – Delete data

Example:

GET /users/1

Response:

{
  "id": 1,
  "name": "Sreekanth",
  "email": "sreekanth@example.com"
}

Best For: Public APIs, mobile apps, and web applications

Popular Examples: Stripe, Razorpay, GitHub, Google Maps

Why it’s popular: Simple, scalable, and works everywhere.

2. GraphQL – Get Exactly What You Need

GraphQL solves a major problem of REST called over-fetching.

Instead of getting extra data, the client can request exactly the fields it needs.

Example Query:

{
  user(id: 1) {
    name
    email
    posts {
      title
      createdAt
    }
  }
}

Best For: Modern frontend and mobile apps

Popular Examples: Facebook, Shopify, GitHub, Airbnb

Big Advantage: Faster responses and better control for developers.

3. gRPC – The Fastest for Microservices

gRPC is a high-performance framework developed by Google.

It uses Protocol Buffers (binary format) instead of JSON, making it much faster and lighter.

Key Strengths:

Extremely fast and low latency
Smaller data size
Strongly typed
Supports streaming

Best For: Internal microservices communication and high-traffic systems

Popular Examples: Uber, Netflix, Google, Kubernetes

When to choose gRPC: When you need maximum speed between services.

4. SOAP – The Secure Enterprise Option

SOAP (Simple Object Access Protocol) is an older but still important protocol, especially in large organizations.

It uses XML and has strong built-in security features.

Still Used In:

Traditional banking core systems
Government and highly regulated industries

Important Note for India:

Modern systems like UPI, BBPS, and most fintech apps primarily use REST APIs with ISO 20022 standards. They have largely moved away from SOAP for better speed and flexibility.

🔹 Types of APIs by Access Level

Public APIs → Open to everyone (Example: Weather API, Google Maps)
Private/Internal APIs → Used only inside a company
Partner APIs → Shared with specific business partners

🔄 Real-World Architecture Insight

Most modern applications use a hybrid approach:

External-facing (apps & websites) → REST or GraphQL
Internal microservices → gRPC (for high speed)
Legacy systems → SOAP (for security & compliance)

In India’s fintech ecosystem:

UPI and public integrations → REST APIs
High-volume internal services → gRPC
Old core banking systems → Often still use SOAP or hybrid setups

🎯 Final Takeaway

There is no single best API type — each has its own strengths:

REST → Best for simplicity and wide compatibility
GraphQL → Best for flexibility and precise data fetching
gRPC → Best for speed and microservices
SOAP → Best for security in enterprise environments

Understanding these types of APIs helps you design better systems and choose the right tool for every situation.

💬 Your Turn:

Which type of API have you used the most?

Which one do you want to learn next?

Drop your answers in the comments below! 👇

API Explained: From Basics to Real-World Systems (UPI Deep Dive)

Sreekanth Kuruba — Wed, 08 Apr 2026 07:21:54 +0000

When you send ₹100 using PhonePe or Google Pay, it feels instant.

But behind that single tap, multiple systems communicate in real time across different banks.

👉 This seamless communication is powered by APIs.

🧠 What is an API?

An Application Programming Interface (API) is a set of rules that allows one software system to request another system to perform an action and return a result.

In simple terms:

Request → Process → Response

🍽️ Simple Analogy

Think of a restaurant:

You → Client
Waiter → API
Kitchen → Backend

You don’t enter the kitchen yourself

You just place an order, the waiter handles everything, and you get your food

APIs work exactly the same way between different systems.

🔍 Types of APIs (Quick Overview)

REST APIs — Most common (uses HTTP methods: GET, POST, PUT, DELETE)
GraphQL — Client decides exactly what data it needs
gRPC / SOAP — Used in high-performance or enterprise systems

In this blog, we’ll mainly focus on REST APIs, as they power most modern applications including UPI.

⚙️ Basic API Example

Here’s a very simple API call:

GET /users/1

Response:

{
  "name": "Sreekanth",
  "role": "DevOps Engineer"
}

The client requests data → the server processes it → and sends back the response.

📲 Real-World Example: UPI Payment Flow

Let’s see what actually happens when you send ₹100:

App → Payer Bank → NPCI → Payee Bank → Response

Step-by-step:

App collects amount + UPI PIN (encrypted)
App sends a secure API request to your bank
Your bank validates:
- UPI PIN
- Account balance
- Daily limits
Request is forwarded to NPCI (National Payments Corporation of India)
NPCI routes the request to the payee’s bank
Payee’s bank credits the amount
Success response flows back to both apps

👉 Total time: Usually under 2–3 seconds ⚡

Here’s the high-level UPI transaction flow:

Another clean view of the UPI flow:

And a simplified version showing Payer PSP → NPCI → Payee PSP:

🔔 When Things Are Not Instant (Webhooks)

Sometimes the bank takes longer.

Instead of waiting:

Transaction marked Pending ⏳
Bank/NPCI sends a Webhook callback 🔔 once completed

👉 Think of it as:
“Don’t call us, we’ll call you.”

This makes systems asynchronous and scalable.

⚙️ Sample UPI API Request

POST /v1/payments/upi HTTP/1.1
Host: api.phonepe.com
Authorization: Bearer <access_token>
Content-Type: application/json

{
  "amount": 10000,           // in paise (₹100)
  "payeeVpa": "friend@oksbi",
  "remarks": "Lunch money",
  "txnId": "txn-12345"       // used for idempotency
}

Response:

{
  "status": "SUCCESS",
  "transactionId": "UPI987654321",
  "responseCode": "00"
}

🧩 API Request/Response Lifecycle (End-to-End Flow)

Every API call follows a clear lifecycle. Here’s what happens from the moment the request is sent until the response is received:

🔧 What Happens Inside the Backend

When an API receives a request, it goes through several important layers:

API Gateway – Handles rate limiting & routing
Authentication – JWT, OAuth2, or API keys
Input Validation – Checks request format and data
Business Logic – Balance check, fraud detection, rules
Database Operations – Secure debit/credit (ACID transaction)
External API Calls – To NPCI or other banks
Logging & Monitoring – For debugging and observability
Response – Sent back to the client

👉 To prevent duplicate payments, systems use idempotency keys (txnId).

🏦 Why ACID Matters in Payments

Banking systems rely on ACID properties:

Atomicity: Either full transaction happens or none
Consistency: Total money remains correct
Isolation: Millions can transact simultaneously safely
Durability: Once success is returned, it’s permanent

👉 This ensures no “money lost” scenarios.

🔄 Microservices Architecture

Modern apps like PhonePe are not built as one single block. They are divided into independent microservices:

User Service
Payment Service
Notification Service
Fraud Detection Service

These services talk to each other using internal APIs or message queues.

This makes systems scalable and fault-tolerant

⚡ Scaling APIs for Millions of Transactions

Apps like PhonePe and Google Pay handle crores of transactions every day using:

Load balancing
Horizontal scaling (Kubernetes)
Caching (Redis)
Message queues (Kafka)
Rate limiting
Circuit breakers

🎯 Final Takeaway

APIs are not just a technical concept — they are the invisible engine behind everything:

Sending money 💰
Logging in 🔐
Fetching data 📊

Once you understand APIs,
you start seeing the architecture behind every app you use.

Why "Just Restart It" Stopped Working

Sreekanth Kuruba — Tue, 24 Mar 2026 07:58:32 +0000

Why "Just Restart It" Stopped Working

A eulogy for the universal debugging technique

The Universal Truth

Every engineer has said it.

Every engineer has heard it.

Three words that have debugged more systems than all monitoring tools combined:

"Have you tried restarting it?"

It worked for decades. So well we turned it into a meme. A joke. A badge of honor.

"Did you turn it off and on again?"

We laughed because it was true.

When Restarting Made Sense

Once upon a time, a server was a physical thing.

One machine. One process. One problem.

When something broke:

Service stops responding
→ SSH into the box
→ ps aux | grep myapp
→ PID still there? Process hung?
→ kill -9 PID
→ ./start-myapp.sh
→ Everything works again

Total time: 2 minutes
Total stress: Minimal
Total sleep lost: None

Why did this work?

Because the problem was usually temporary.

A memory leak. A deadlock. A bad connection that timed out wrong.

The code had a bug, sure. But restarting reset the state to before the bug happened.

It wasn't elegant. It wasn't permanent.

But at 3 AM, that's all anyone cared about.

The First Sign of Trouble

Then we got more servers.

One box became ten.

Ten became a hundred.

Restarting stopped being a single command.

It became a deployment.

for server in $(cat servers.txt); do
    ssh $server "systemctl restart myapp"
done

This worked. Mostly.

Until the day it didn't.

The Cascade

I watched this happen once.

02:15 - Pager: "Database connections failing"

The on-call engineer checks the logs.

Database is overwhelmed. Too many connections.

The solution, burned into muscle memory from years of single-server debugging:

"Restart the database."

One command. One mistake.

systemctl restart postgresql

The database came back in 45 seconds.

In those 45 seconds:

All 200 application servers lost their connection pools
All 200 retried simultaneously, using identical retry logic
All 200 failed their health checks
The load balancer marked them all unhealthy
The site went down

The database was fine.

The app servers were fine.

The connections were gone.

The restart fixed nothing and broke everything.

One restart.

47 minutes of downtime.

Why Restarting Broke

Restarting worked when:

State lived in one place
Dependencies were simple
Recovery was faster than finding root cause

Restarting broke when:

State moved to databases, caches, message queues
Services started calling other services
"Just restart it" became "restart everything in the right order with the right delays and pray"

A restart is no longer a local action.

It's a distributed event.

You don't restart one thing.

You restart a graph of dependencies.

What Happens When You Restart Now

You restart Service A
↓
Service A disconnects from database
↓
Database releases locks
↓
Service B loses connection to Service A
↓
Service B retries aggressively
↓
Retries overwhelm Service C
↓
Service C crashes
↓
Everything is on fire

All because you restarted "just one thing."

The Lie We Tell Ourselves

"Restarting is harmless."

It isn't.

Every restart is:

A forced state reset
A connection teardown
A potential cascade trigger
A temporary partial outage (even if small)

We accepted restarts as "free" because the cost was invisible.

Until it wasn't.

What Replaced Restarting

The industry didn't ban restarts.

It made them unnecessary.

Health checks

Detect problems before users do.

# Kubernetes liveness probe example
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

If service unhealthy, don't send traffic
Let it recover or replace it
Users never see the failure

Graceful degradation

Fail partially, not completely.

Cache down? Serve stale data
Database slow? Queue writes, serve reads
Something broke? Everything else keeps running

Automatic replacement

Never restart. Always replace.

Pod dies? New one starts
Node fails? Pods move
Same binary. Clean state. No cascade

Rolling restarts

One at a time, with verification.

Restart server 1 of 10
Wait for health check
Restart server 2 of 10
Never lose capacity

The Systems That Don't Need Restarts

Netflix doesn't restart. It terminates and replaces.

Google doesn't restart. It shifts load and repairs.

Your bank doesn't restart. It fails over to another region.

These aren't magic.

They're design choices.

They assumed from day one that "restart" was not a strategy.

The Honest Confession

I still say "have you tried restarting it?"

Sometimes it's the fastest path to it works now.

But I don't pretend it's a fix anymore.

It's a diagnostic.

A temporary patch.

A way to buy time until the real problem reveals itself.

The difference is:

I know the difference now.

What You Can Do Monday

For your most critical service:

Find the last time it was restarted
Ask: "Why did that restart happen?"
Ask: "Could we have avoided it?"

If yes, build the automation.

If no, document why (so next time you know).

For your next outage:

Resist the restart reflex
Check dependencies first
Check connections second
Check logs third
Restart only when you understand what you're about to break

The Question

When was the last time you restarted something

and didn't know exactly what would happen when it came back?

Be honest.

This is part of a series on operations in the age of distributed systems. Next up: "The Pager Should Not Exist."

From Process Management to State Reconciliation

Sreekanth Kuruba — Tue, 24 Feb 2026 03:09:46 +0000

I used to restart servers at 2AM… Kubernetes made that job disappear

02:15 AM — Pager goes off
“nginx is down on web-01”

You wake up.
Grab your laptop.
SSH into the server.
Run a few commands. Restart the process.

02:22 AM — It’s back.

Try to sleep again.

This used to be normal.

Then Kubernetes changed the rules.

🧱 The old world: Process-driven operations

Before Kubernetes, everything revolved around processes.

A service was:

A Linux process
Running on a specific machine
Identified by a PID
Restarted manually (or via basic supervisors)

The assumptions were simple:

Machines are stable
Failures are rare
Humans fix problems

And when something broke…
👉 you fixed it

Availability depended on:

How fast someone could wake up and respond.

🐳 Containers helped… but didn’t solve the real problem

With tools like Docker, things improved:

Consistent environments
Faster deployments
Fewer “works on my machine” issues

But let’s be honest…

If a container crashed:

Maybe it restarted
Maybe it didn’t

If the node died?

You’re still in trouble

If dependencies failed?

Still your problem

👉 Containers improved portability
👉 They did NOT guarantee reliability

🔄 Kubernetes changed the question

Kubernetes doesn’t ask:

“Is this process running?”

It asks:

“Is the system in the state I declared?”

That’s a massive shift.

Instead of managing processes…
you define desired state.

⚙️ The magic: State reconciliation

You declare:

“I want 3 replicas”
“They should always be running”
“They should be healthy”

Kubernetes continuously checks:

Current state
Desired state

If something breaks…
👉 it fixes it automatically

Not later.
Not after a pager alert.
Continuously.

🔄 Traditional vs Kubernetes minds

🧠 Why Kubernetes doesn’t care about PIDs

In traditional systems:

PID = identity

In Kubernetes:

PID = irrelevant

Because a PID is:

Local to a machine
Temporary
Lost on restart

Kubernetes doesn’t track processes.

It tracks:

Desired outcomes

You don’t ask:

“What’s the PID?”

You ask:

“Do I have 3 healthy pods?”

👉 That’s the difference between instance thinking and system thinking

💥 The real shift: Replace, don’t repair

Old mindset:

Fix the broken process

New mindset:

Replace it

👉 Failure is handled through replacement, not repair.

Kubernetes doesn’t try to “save” things.

It simply ensures:

The system matches your declared state

🧪 Jobs are different too

Before:

Run jobs manually
Monitor externally
Retry manually

Now:

Define a Job
Kubernetes ensures completion
Retries automatically
Tracks success/failure

👉 You define intent.
👉 System enforces outcome.

⚠️ Failure is not an exception anymore

At scale, failure is constant.

Systems like Google’s Borg (Kubernetes’ ancestor) proved this:

Machines fail
Networks break
Processes crash

Not if
But how often

Kubernetes is built for this reality.

It assumes:

Nodes will disappear
Pods will die
Networks will glitch

And it’s okay with that.

🔁 What actually changed?

Before Kubernetes:

You maintained systems
You fixed failures
You reacted

After Kubernetes:

You define intent
The system maintains itself
Recovery is automatic

👉 Your job shifts from:
operator → system designer

🏁 Final thought

Kubernetes doesn’t remove failure.

It removes panic.

The system doesn’t ask:

“Who will fix this?”

It asks:

“What should this look like?”

And then it makes it happen.

💬 Your turn

What’s the last thing you had to fix manually at 2AM?

And could Kubernetes have handled it for you?

How Platform Engineering Changes the Game

Sreekanth Kuruba — Tue, 27 Jan 2026 14:45:50 +0000

DevOps isn't dying.

But the "central DevOps team doing everything" model is hitting limits at scale.

Here's what's replacing it — and why it works.

🧱 What Platform Teams Actually Build

(Not just theory)

1. Internal Developer Platforms (IDPs)

Single control plane for deployments, from dev → prod
Example: Backstage (Spotify), Internal Developer Portal
Result: 60% less time spent on deployment setup (Humanitec data)

2. Golden Paths, Not Guardrails

Pre-approved Terraform modules for AWS/GCP/Azure
Standardized K8s configurations with sane defaults
Security/compliance baked in, not bolted on
Outcome: 83% faster infra provisioning (Gartner)

3. Self-Service, Not Ticket-Based

Developers deploy via UI/API/Git push — no tickets
Automated approval workflows replace manual reviews
Impact: 10x more deployments with same team size

🏢 Real-World Example: Amazon's "You Build It, You Run It"

The famous mandate works because of the invisible platform:

What developers see:

git push → running service
Built-in monitoring, logging, alerting
One-click rollback, canary deployments

What platform provides:

CodePipeline templates (not custom Jenkins)
CDK constructs (not raw CloudFormation)
Internal service catalog
Standardized observability stack

The result:

150M+ deployments/year
Teams deploy thousands of times daily
No central bottleneck

⚙️ The Tooling Shift

OLD DevOps Stack:

Jenkins → Ansible → Custom scripts → Slack alerts → Manual dashboards

NEW Platform Stack:

Backstage (UI) → ArgoCD (GitOps) → Crossplane (Control Plane)

→ OpenTelemetry (Observability) → Internal APIs

Key difference:

Declarative over imperative
Git as source of truth for everything
API-first everything

📊 The Numbers Don't Lie

Companies with mature platforms report:

50% less production incidents (DORA)
75% faster mean time to recovery (MTTR)
40% less time spent on "keeping lights on"
3x more developer satisfaction (SPACE metrics)

🤖 Where AI Actually Helps Today

Not: "AI will write your Terraform"

But: "AI explains why your deployment failed"

Useful patterns right now:

AI-driven failure analysis in CI/CD logs
Cost optimization suggestions for cloud resources
Security misconfiguration detection
Documentation generation from code changes

Still needed:

Platform engineers to design the systems AI operates on
Human judgment for architecture decisions
Cultural change management

🚨 The Hard Parts (Nobody Talks About)

1. Platform adoption isn't automatic

Need developer buy-in
Must be better than the DIY alternative
Requires investment in UX

2. Platform teams get it wrong when:

They build what they think devs need (not what they actually need)
They create another complex tool (instead of simplifying)
They over-standardize and kill innovation

3. Success metrics are tricky

Not: "How many services use our platform?"
But: "How much faster can teams ship?"
And: "How many outages did we prevent?"

🎯 The Real Shift

From:

"Submit a ticket, wait 3 days, get your dev environment"

To:

"Click button, get environment, start coding in 5 minutes"

From:

"Ops owns stability, Dev owns features" (siloed)

To:

"Teams own their services, platform provides safety nets" (aligned)

💡 If You Remember One Thing

Platform engineering isn't about building tools.

It's about reducing cognitive load for developers.

The best platform is the one developers don't even notice —

because it just gets out of their way.

🔍 Are you building or using an internal platform?

What's the ONE thing that made it successful (or painful)?

Companies like Spotify (with Backstage) and Netflix scaled DevOps exactly this way — by building platforms instead of doing everything centrally.

Sreekanth Kuruba — Tue, 06 Jan 2026 12:00:40 +0000

Sreekanth Kuruba

Jan 6

Why Traditional DevOps Stops Scaling

#discuss #devops #platformengineering #career

Comments

2 min read