Forem: Faisal Dilawar

What if Your Commute Had a Co-Rider? Building CommuteShare for Earth Day

Faisal Dilawar — Sat, 18 Apr 2026 21:38:59 +0000

This is a submission for Weekend Challenge: Earth Day Edition

What I Built

CommuteShare

A cycling route-matching app that help finds other commuters who share a meaningful stretch of road with you.

Why

I honestly believe that if we can make cycling as a mode of commute easier, we can the environment. Not just by reducing the fossil fuel consumption but also by reducing noise pollution. It will definitely reduce the road congestion and improve commute time when commuting for a short distance. And if we can help more people adopt cycling as mode of commute we will be helping our planet a little more.
Being an avid cyclist I have faced lots of issue during my commute ride: lack of infrastructure, lack of empathy towards cyclist, lack of shower/changing facilities in workplaces.
But the issue that I think stops quite a lot of people even after they have dipped their toes into commute by cycling is the boring nature of it. Riding twice a day for upto 5 times a week on same route day in day out can become boring and demotivating fast.
To overcome the issue of boredom I have thought of this app which can bring people to ride together and enjoy the company or just the quite confidence that someone else is there with them. That motivation is a big emotional booster.

What's new?

But most ride-sharing apps match on origin and destination. CommuteShare matches on the full route geometry — two riders are compatible if they share a real stretch of road and cycle at compatible speeds. The idea: if you're both riding the same 3km or 15 min corridor every morning, why not do it together?

How it works

Draw your cycling route directly on the map (click to add waypoints)
Set your speed range and departure window
Hit Find Matches — the app finds compatible riders and highlights the exact shared road segment
Click any match to fly the map to that route
New rides from other users appear in real time as fuchsia dashed lines — no page refresh needed

What makes it technically interesting

Routes are stored as GPS polylines. Checking overlap isn't as simple as "do these lines cross?" — two routes can cross at a single point without sharing any meaningful road. What you actually want to know is: how long would these two riders spend side by side?
CommuteShare estimates this by sampling 50 evenly-spaced points along each candidate route and counting how many fall within 400m of the query route. That ratio gives an estimated shared distance, which converts to shared riding time at the rider's average speed. If you'd spend at least 5 minutes riding together — it's a match.

Live ride updates are powered by Server-Sent Events — a lightweight real-time push mechanism that doesn't need WebSockets. Open two browser tabs, post a ride in one, and it appears on the other map within a second.

Tech stack

Layer	Technology
Backend	Go 1.22, chi router
Database	PostgreSQL 15 + PostGIS
Frontend	React 19, Vite
Map	Leaflet + react-leaflet
Real-time	Server-Sent Events (SSE)

Demo

Code

GitHub Repo

How I Built It

I started with the hardest part first — the matching algorithm — and worked outward from there.

The matching algorithm

The first question was: how do you check if two cycling routes share a meaningful stretch of road?

Routes are stored as GPS polylines in PostGIS. The obvious answer is ST_Intersection — compute the geometric overlap directly. I tried it. It silently returned empty geometries. Turns out GEOS 3.9.0 (the geometry library bundled in the standard PostGIS Docker image) has a bug where ST_Intersection returns empty when a line is fully contained inside a polygon. Dead end.

The workaround: point-sampling. Instead of clipping geometries, I interpolate 50 evenly-spaced points along each candidate route and count how many fall within 400m of the query route. That ratio estimates shared distance, which converts to shared riding time. If you'd spend at least 5 minutes riding together — it's a match. Same result, no broken geometry functions.

The shared segment overlay (the white line showing exactly where two routes overlap) is built the same way — collect the close-sampled points and join them into a LineString with ST_MakeLine.

Backend

Go with chi router, three strict layers: handler → service → repository. Handlers never touch the DB. Repository never touches HTTP. This kept the code easy to reason about under time pressure.

The SSE live feed was the last piece. The hub is a simple map[chan []byte]struct{} with a mutex — new rides are pre-marshalled to JSON and fanned out to all connected subscribers non-blocking. If a client is slow, it gets dropped. For a local demo, that's fine.

One gotcha: the Vite dev proxy doesn't reliably forward SSE streams. The frontend EventSource connects directly to localhost:8080, not through the proxy — with CORS explicitly allowing localhost:5173.

Frontend

React 19 with react-leaflet. All state lives in App.jsx. The map has two modes — draw mode (click to add waypoints) and view mode (show posted ride, matches, shared segments, live rides). Switching between them is a single boolean.

The render layer order in JSX matters for z-index: live rides sit above the posted ride but below match routes, so they're visible over seed routes but don't obscure match labels.

Architecture Summary

View full architecture breakdown

Using Claude as a collaborator

I used Claude Code throughout — not to write code blindly, but as a pair programmer. I'd describe what I was trying to build, we'd work through the approach together, and I'd push back when something didn't fit. The PostGIS bug investigation, the SSE hub design, the color scheme iterations — all of that happened in conversation.

The architecture documentation you see in this post came out of those sessions too. Working with an AI that could hold the full context of the project across a weekend made it possible to ship all three features (shared segments, click-to-focus, live feed) in the time available.

Prize Categories

Not Applicable

Data Security Fundamentals: A Developer's Guide from Principles to Production

Faisal Dilawar — Thu, 09 Apr 2026 03:19:24 +0000

The Grim Reality

Let's start with the uncomfortable truth: data breaches aren't theoretical risks that happen to "other people or companies". They're devastating realities that have destroyed everything that comes their way : businesses, money, user trust. Here are four cautionary tales every developer should know.

Sony Pictures (2007): The Plain Text Disaster
Sony Pictures stored passwords and private encryption keys in plain text files and spreadsheets. Yup! When attackers gained access, they didn't need to crack anything, just open a CSV file.
The damage: Massive data exposure, embarrassing internal emails leaked publicly, and a security reputation that took years to rebuild. Estimated at over $100 million in remediation, legal fees, and lost business.
Heartbleed (2014): The Tiny Bug with Massive Impact
A minor coding error in the OpenSSL encryption library—just a missing bounds check—allowed attackers to read server memory. This meant they could extract encryption keys, passwords, and sensitive data from millions of servers worldwide.
The damage: Affected approximately 17% of all secure web servers globally (around 500,000 servers). The bug had existed for two years before discovery, meaning countless credentials and keys were potentially compromised. Companies spent millions patching systems, rotating certificates, and forcing password resets. The reputational damage to OpenSSL and affected organizations was immeasurable.
Code Spaces (2014): The Single Point of Failure
Code Spaces, a source code hosting company, stored everything—including their encryption keys—with a single cloud service provider. When an attacker gained access to their AWS console, they had complete control. The attacker deleted backups, destroyed data, and held the company hostage.
The damage: Code Spaces shut down permanently. The company couldn't recover. Their customers lost access to their repositories. Years of business building, gone in hours. This wasn't just a security failure; it was a business extinction event.
Equifax (2017): The Unpatched Vulnerability
Equifax failed to encrypt personal information for 147 million people and didn't patch a known software vulnerability in their database for months after the fix was available. Attackers exploited this gap and walked away with Social Security numbers, birth dates, addresses, and driver's license numbers.
The damage: The breach cost Equifax over $1.4 billion in remediation and settlements. Their CEO resigned. The company's stock plummeted. But the real victims were the 147 million people whose personal information—data that can't be changed like a password—was permanently compromised. Identity theft risks that will follow them for life.

Why This Matters to You

If you're reading this thinking "but it didn't happened to me" you're missing the point. These were major corporations with security budgets and dedicated InfoSec teams. They failed because somewhere in the chain, developers made architectural decisions that created vulnerabilities.

Here's the uncomfortable truth: Security isn't just for the InfoSec team.

As developers, we handle the actual data path—the flow, storage, and transformation of sensitive information. We build the doors. Every API endpoint, database connection, and file system interaction is a door we create. We're responsible for securing them properly.

Defense in Depth Starts Here

Layered security begins with our code. Network controls and firewalls are important, but they're not enough if our implementation is weak. If an attacker bypasses authentication and reaches your database, what's protecting the data? If someone gains access to your server, are your encryption keys sitting in environment variables, easily readable?

The breaches above happened because someone, somewhere, made a decision:

"Let's just put the keys in a spreadsheet for now"
"We'll patch that vulnerability next sprint"
"One cloud provider is fine, They are the best"
"Encryption is too complex, we'll add it later"

Those decisions had consequences. Your decisions will too.

Understanding the Basics: Key Terms

Before we dive into security strategies, let's establish a common vocabulary. These terms get thrown around interchangeably, but the distinctions matter.

Encryption vs. Encoding

Figure: Encryption vs Encoding

Encryption is hiding data to prevent unauthorized access. It's like placing your data behind a strong lock that requires a specific key to open.

Encoding is converting data from one format to another for system compatibility. It's transformation, not protection—anyone can decode it. e.g. Base64 encoding.

Encryption at Rest vs. In Transit

In Transit: Data moving over networks between systems. This is protected by TLS/SSL protocols during transmission—your HTTPS connections, API calls between services, database connections over the network.

At Rest: Data sitting on disk, in databases, or backup storage. This requires encryption as the final defense line—your database tables, log files, backups, cached data.

Why both matter: TLS protects data while it's moving, but once it reaches the server and gets written to disk, that protection ends. If an attacker bypasses authentication and gains access to your database files or backups.
Network controls like firewalls aren't enough. If an attacker gets through, encryption is ast line of defense for your users' data.

The 5 Levels of Encryption Security Maturity

Figure: Security Maturity Levels Pyramid

Not all data requires the same level of protection, and not all organizations have the same operational capacity. Security is a spectrum, and understanding where you fall—and where you should fall—is critical.

Here's a broad classification of data security level progressing from highly insecure to advanced security postures:

Level 1: Hardcoded Keys

Keys embedded directly in source code. Highly insecure—anyone with code access has the keys.
When this might be acceptable: Temporary files, non-sensitive development data, throwaway prototypes that will never see production. Even then, it's risky.

Level 2: Environment Variables

Keys stored on-host in environment variables. Better than hardcoding, but still accessible to anyone with server access.
When this might be acceptable: Internal tools with limited access, development environments, low-sensitivity data where the risk of exposure is minimal and the impact is contained.

Level 3: Secrets Management

Centralized systems like HashiCorp Vault or AWS Secrets Manager. Keys stored separately, access controlled, audit trails maintained.
When this is necessary: Any user PII (personally identifiable information), business-critical data, anything subject to regulatory compliance (GDPR, HIPAA, PCI-DSS). This is the minimum acceptable baseline for sensitive data.

Level 4: Envelope Encryption

Data encrypted with data keys (DEKs), which are themselves encrypted by master keys (KEKs). Limits blast radius of key compromise.
When this is necessary: Financial services, healthcare records, highly regulated industries, any scenario where a single key compromise could expose massive amounts of sensitive data. Banking and fintech typically operate here.

Level 5: Zero-Trust Dynamic Keys

Keys rotated automatically, short-lived credentials, assume breach mindset. Most secure but operationally complex.
When this is necessary: Government systems, defense contractors, cryptocurrency platforms, any system where the data is so sensitive that you must assume attackers are already inside your perimeter.

The key insight: Moving up this ladder increases security but also increases operational complexity and cost. The requirement here is to match your security level to your actual risk profile, not over-engineering for trivial data or under-securing critical information.

Choosing Your Approach: It's Not One-Size-Fits-All

The answer to "which security level should I use?" is always: "It depends."

Security requirements vary based on:

Data sensitivity: Is this public information, internal data, or deeply personal user data?
Regulatory compliance: Are you subject to GDPR, HIPAA, PCI-DSS, or other regulations?
Threat model: Who are your adversaries? Random hackers, Organized crime, nation-states?
Operational constraints: What's your team's capacity? What's your budget? What's your scale?

Key Compromise: When, Not If

Here's the hard truth about key compromise: it's not a theoretical, it's a reality. It doesn't just happen to "others". Being prepared isn't optional. IT IS MANDATORY.
Your security analysis and setup must account for both the likelihood and the impact of compromise. Design systems that minimize damage even when keys are exposed.

Beyond the Single Strong Wall

Figure: Castle Defense (Multi-layered Security)

Effective security isn't a single strong wall. It's should be a multi-layered mechanism requiring deep architectural thinking and continuous vigilance.

Think of medieval castle defenses: they didn't just build one massive wall and call it secure. They built multiple walls, each protecting the next. They added moats, drawbridges, gates, towers, and inner keeps. Breaching one layer didn't compromise the whole castle. More importantly, they had plan of what to do when a breach happened.

Modern security demands the same intricate design. Each layer protects the next, and breaching one doesn't compromise the whole system. This is defense in depth:

Network firewalls (outer wall)
Authentication and authorization (the gate)
Application-level security (inner walls)
Encryption at rest (the keep where the treasure is stored)
Key management (the vault within the keep)

If an attacker gets through your firewall, your authentication should stop them. If they bypass authentication, encryption should protect the data. If they somehow get a key, envelope encryption limits what that key can decrypt.

The Sample Challenge: Building a Secure Messaging Platform

Now let's move from theory to practice. We're going to walk through a real-world scenario, making decisions and observing its effect.

The Problem Statement

You're building a secure messaging platform. Your requirements are:

End-to-End Privacy: Protect both message text and file attachments from unauthorized access at rest and in transit. Users trust you with deeply personal conversations—any leak is a total breach of that trust.

Cost-Effective Storage: Leverage AWS S3 for scalable, economical object storage while maintaining security.

High Sensitivity: Messages are deeply personal. Unlike a data breach of email addresses (bad but recoverable), a breach of private messages can affect personal lives - medical discussions, confidential business negotiations, relationship conversations.

How do you architect this system?

Understanding the Players: Advanced Key Management

Figure: Key Management Players

Before we solve this problem, we need to understand few things that makes secure encryption at scale possible.

Key Vault

A centralized key management service (AWS KMS, HashiCorp Vault) that stores and protects your most sensitive cryptographic keys with hardware security. These systems use Hardware Security Modules (HSMs)—specialized, tamper-resistant hardware designed specifically for cryptographic operations.

KEK (Key Encryption Key / Master Key)

The Key Encryption Key never leaves the vault. This is your most powerful credential—it encrypts other keys. No application code or user ever reads it. It lives in the HSM, protected by hardware-level security.

DEK (Data Encryption Key / Worker Key)

The Data Encryption Key is for single-purpose use. These are short-lived keys that do the actual work of encrypting your application data, then get discarded. Your application uses these, not the master key.

The Core Principle: Envelope Encryption

Figure: Envelope Encryption Flow

Envelope encryption ensures your master key (KEK) never touches application servers, dramatically reducing attack surface. Here's how it works:

Your application requests a DEK from the vault
The vault generates a random DEK and encrypts it with the KEK
The vault returns both the plaintext DEK and the encrypted DEK to your application
Your application uses the plaintext DEK to encrypt data
Your application stores the encrypted data alongside the encrypted DEK
Your application immediately wipes the plaintext DEK from memory
When you need to decrypt, you send the encrypted DEK back to the vault
The vault decrypts it with the KEK and returns the plaintext DEK
You decrypt your data and immediately wipe the DEK again

Why this matters: If an attacker compromises your application server, they can't decrypt old data because they don't have the KEK. They only get access to data encrypted with DEKs they can obtain after the compromise. Your historical data remains protected.

The Data Encryption Lifecycle

Figure: Data Encryption Lifecycle (Encryption Flow)
Request DEK → Vault generates & encrypts DEK with KEK → Returns plaintext + encrypted DEK → Encrypt data with plaintext DEK → Store encrypted data + encrypted DEK → Wipe plaintext DEK from memory

Figure: Data Decryption Lifecycle (Decryption Flow)
Retrieve encrypted data + encrypted DEK → Send encrypted DEK to vault → Vault decrypts with KEK → Returns plaintext DEK → Decrypt data → Wipe plaintext DEK from memory

Developer responsibility: The "wipe" step is critical. You must ensure plaintext keys don't linger in memory, logs, or error messages. A key accidentally logged during an error is a key that's compromised. Memory dumps during crashes can expose keys. Proper key hygiene is non-negotiable.

Key Rotation: The Mandatory Refresh Cycle

Figure: Key Rotation Comparison

Keys have lifespans. The longer a key exists, the more opportunities an attacker has to compromise it. Rotation limits credential lifespan—if a key is compromised today, rotation ensures it becomes useless tomorrow.

KEK Rotation

Handled by: Vault infrastructure

Frequency: Annually or on compromise

Impact: Transparent to applications—the vault handles re-encryption of all DEKs internally.

DEK Rotation

Handled by: Application code

Frequency: 30-90 days recommended

Impact: Requires re-encrypting data with new keys, tracking old keys for decryption

DEK rotation is more complex. You need to:

Generate new DEKs
Re-encrypt data with the new DEKs
Keep old DEKs available for decrypting data that hasn't been re-encrypted yet
Track which DEK encrypted which data
Eventually phase out old DEKs once all data is re-encrypted.

Situation 1: Low Scale Foundation (~1,000 messages/day)

Figure: Situation 1 Architecture (Low Scale)

You're just launching. You have about 1,000 messages per day. How do you architect encryption?

The Problem

Minimize blast radius—each compromised key should expose minimal data. If an attacker gets one key, you want them to decrypt as few messages as possible.

The Solution

Generate a unique DEK per message. Store the encrypted DEK in S3 metadata alongside the encrypted content.

Here's the flow:

User sends a message
Your application requests a DEK from the vault
Encrypt the message with the DEK
Encrypt the DEK with the KEK (vault does this)
Store the encrypted message in S3
Store the encrypted DEK in the S3 object's metadata
Wipe the plaintext DEK from memory

Why this works: If a single DEK is compromised, only one message is exposed. The blast radius is minimal.

The New Problem

This works beautifully... until it doesn't.

Your app goes viral. Suddenly you're at 10,000 messages per day. Then 100,000. Each message requires a vault API call to generate a DEK. Vault services charge per API call.

At 1,000 messages daily, the cost is negligible—maybe $10/month. But at 100,000 messages per day, you're making 3 million vault API calls per month. Your security bill is now $3,000/month and climbing. And you're hitting API rate limits that throttle your application's performance.

Your security architecture that was perfect at low scale is now a liability.

Situation 2: Scaling the Wall (1,000 requests/second)

Figure: Situation 2 Architecture (Scaling)

You're successful. You're now handling 1,000 requests per second. That's 86.4 million messages per day.

The Problem

1,000 req/sec creates massive vault bills and API rate limits that throttle performance. The per-message DEK approach is financially and operationally unsustainable.

The Solution: The Pragmatism Pivot

Cache a single DEK for 1-hour windows. All messages sent within that hour share one key—dramatically reducing vault calls.

Instead of 86.4 million vault calls per day, you make 24. Your vault bill drops from $86,000/month to $2/month. Throttling disappears.

This is the "juice vs. squeeze" decision in action. You're trading perfect security (one key per message) for operational feasibility (one key per hour).

The New Problem: Blast Radius

Your blast radius just exploded. If a single hourly key is compromised, an attacker can decrypt every message sent during that hour.

Before: 1 compromised key = 1 message exposed
Now: 1 compromised key = 3.6 million messages exposed

Figure: Blast Radius Comparison

Is this acceptable? It depends on your operational capacity and the kind of data you are working with.

Rotation Cost Analysis

Key rotation becomes complex. If you need to rotate a compromised hourly key, you must:

Identify every message encrypted with that key
Re-encrypt 3.6 million messages
Do this without taking your service offline

Without proper indexing, identifying which S3 objects used which key becomes a nightmare due to inefficiency of S3 metadata search.

This is where architectural decisions start cascading into other systems.

Situation 3: The Searchability Trap (Massive Scale)

Figure: Situation 3 Architecture (Massive Scale with Mapping)

You're now at massive scale. Millions of users, billions of messages. One day, you detect suspicious activity. A DEK might be compromised.

The Problem: Incident Response Paralysis

A DEK is compromised, but S3 metadata isn't searchable at scale. How do you quickly identify which files need re-encryption?

You can't iterate through billions of S3 objects checking metadata. That would take days or weeks. So you can't rotate key as well. During that time, the compromised data remains vulnerable.

The Solution: Mapping Infrastructure

Build a database table linking S3 object paths to their DEK identifiers, enabling rapid queries during security incidents.

message_encryption_map
- message_id (primary key)
- s3_object_path
- dek_id
- encrypted_at (timestamp)
- key_rotation_status

Now when a DEK is compromised, you can query: "Give me all messages encrypted with DEK-12345" and get instant results. You can prioritize re-encryption, track progress, and complete the rotation in hours instead of weeks.

The New Problem: Database Selection

Which database handles 1,000 writes/sec during rotation without incurring prohibitive I/O costs?

Rotation cost: High I/O expenses for scanning or bulk-updating mappings across millions of records. You're now spending significant engineering time and infrastructure cost just to maintain the ability to rotate keys.

Every DB comes with its own pros and cons. PostgreSQL: Great for complex queries, but write-heavy workloads at this scale get expensive. DynamoDB: Optimized for high-throughput writes, but limited query flexibility. Cassandra: Excellent for write-heavy workloads and horizontal scaling, but operationally complex to manage.

The Broader Implications: Advanced Data Management

Notice how a security decision (key rotation requirements) has now forced you to make data architecture decisions. Few examples are as following:

Database selection: Evaluating PostgreSQL vs. DynamoDB vs. Aurora for different workloads
Leveraging S3: Exploring S3 tables for analytics, cold storage, and data lake integration
Archiving strategies: Designing efficient methods for archiving data from PostgreSQL to S3 while maintaining integrity and accessibility
Hybrid approaches: Considering hybrid data storage solutions to balance performance, cost, and security
Data lifecycle management: Implementing processes for cleaning up PostgreSQL records after corresponding object deletions to ensure consistency
Object updates: Addressing the complexities of updating encrypted objects and their associated key metadata
Search limitations: Strategies for restricted searchability on encrypted data without compromising end-to-end encryption principles

Security isn't isolated from the rest of your architecture. Your encryption strategy ripples through your entire data management approach. This is why security decisions need to be made early and with full awareness of their downstream implications.

The Nuclear Option: KEK Compromise

Figure: KEK Compromise Impact Visualization

Let's talk about the worst-case scenario: your master key (KEK) gets compromised.

Why This Matters

Remember, the KEK encrypts all your DEKs. If an attacker gets the KEK, they can decrypt every DEK you've ever created. Every message, every file, every piece of encrypted data in your system is now exposed.

How This Could Happen

KEKs are stored in hardened vaults with HSM backing, but compromise is still possible due to Insider threat, Vault provider breach, Misconfiguration or even Supply chain attack.

The Recovery Process

Detect the compromise: Hopefully through monitoring and audit logs, not through data showing up on the dark web
Generate a new KEK: The vault creates a fresh master key
Re-encrypt every DEK: Every single DEK in your system must be re-encrypted with the new KEK
Rotate all DEKs: Since the old KEK was compromised, you can't trust any DEK it encrypted
Re-encrypt all data: Every message, every file, everything must be re-encrypted with new DEKs

The Cost

Computational resources: Re-encrypting billions of objects requires massive compute. You're spinning up hundreds of workers, running them for days or weeks even months.

Storage I/O: Reading and writing billions of objects generates enormous I/O costs. S3 charges for requests, and you're making billions of them.

Engineering time: Your entire team drops everything to manage this crisis. Weeks or months of productivity lost.

Downtime: Depending on your architecture, you might need to take services offline or operate in degraded mode during re-encryption.

Business impact: Users can't access messages during re-encryption. Customer support is overwhelmed. Trust is shattered.

Total cost: Depending on you scale the direct cost (compute, storage, engineering time) could run in millions. In addition to lost business and reputational damage.

The Permanent Damage

Even after spending all this money and effort, the data that was accessed during the compromise is gone. If an attacker extracted messages before you detected the breach, those messages are compromised forever. No amount of money or engineering effort can undo that.

Why We Pay for Hardened Vaults

This catastrophic scenario explains why enterprise-grade vaults with HSM backing command premium pricing. The cost of the vault is insurance against the cost of KEK compromise.

A multi thousand vault bill seems expensive until you compare it to the millions in recovery cost plus permanent reputational damage.

Strategic Considerations: Your Security Cheat Sheet

After walking through the messaging platform evolution, here are the key principles to guide your security decisions:

1. Prepare for Eventualities

What happens if a key is compromised? What if data is exposed? Do you need recovery capabilities? Plan for worst-case scenarios.

Don't just have a theoretical incident response plan. Actually test it. Can you execute a key rotation under pressure? Do you have the infrastructure to re-encrypt data quickly? Have you practiced the runbook?

2. Define Blast Radius

How much damage is acceptable during a breach? Limit the scope of potential compromise.

Design your system so that the attacker needs to work for every piece of data.

3. Runbooks Are Vital

Avoid "headless chicken" mode during incidents. Document response procedures, rotation steps, and recovery processes.

Your runbook should include:

How to detect a compromise
Who to notify and in what order
Step-by-step rotation procedures
Scripts and tools for bulk operations
Communication templates for users
Post-incident review process

Test your runbook regularly. A runbook that's never been executed is just wishful thinking.

4. Think Like a Thief

Adopt an attacker's perspective. How would you break into your own system? Where are the weak points?

Conduct threat modeling exercises:

What's the most valuable data in your system?
What's the easiest way to access it?
What would you do if you compromised a developer's laptop?
What if you got access to the production database?
What if you social-engineered your way into the vault?

Find your vulnerabilities before attackers do.

5. Pragmatism: Juice vs. Squeeze

Don't over-engineer for non-sensitive data. Don't destroy SLAs with complexity. Don't build unfeasible solutions. Balance security with operational reality. Temp files don't need envelope encryption. User passwords do.

6. The Security Baseline

For any sensitive data, start at Level 3 minimum (Centralized Secrets Management). Anything lower requires documented justification.
"It's too complex" isn't a justification. "We don't have time" isn't a justification. "It's too expensive" might be, but you need to quantify the cost of the security measure vs. the cost of a breach.

Conclusion: Security as an Ongoing Conversation

You now have the framework to make informed security decisions. You understand the fundamentals, the maturity levels, the trade-offs, and the real-world implications of your choices.

But here's the final truth: security is never "done."

Security is an ongoing conversation between architecture and operational reality. The "perfect" system today might be your biggest vulnerability in two years.

Your job as a developer isn't to achieve perfect security—it's to make informed trade-offs, build defense in depth, plan for compromise, and continuously adapt as your system evolves.

You own the data path. You build the doors. Lock them well, but know that locks can be picked. Build multiple doors, multiple locks, and have a plan for when someone gets through.

Make better decisions.

About the Author: Faisal Dilawar is a Lead Technology Consultant at Technogise with experience building secure, scalable systems.

Investigating Performance Issues In A Library project

Faisal Dilawar — Tue, 07 Apr 2026 10:06:47 +0000

│ Part 2 of 2 — This piece covers library projects. Part 1 covers deployed applications and services, which come with a different set of constraints.

The Fundamental Difference

In Part 1, we talked about investigating performance in a deployed system — one where we control the runtime, monitoring and are able to trace requests end to end.

Libraries are different beasts altogether. We ship code. Someone else runs it.

We don't control the thread pool size, the hardware, or how many times our function gets called. We don't have dashboards. Usually we don't have logs. And the person filing the bug report often says
something in tune of "your library is slow" — with no reproducible scenario, no profiler output, and no context about how they're
using it.

This is the library performance problem. And it requires a different mindset.

Don't try to fix it....

Before we go any further I would like to put it out there "If you don't own a library code and have no access to an SME. And on top of that you don't have access to prod data like logs and monitoring then Don't attempt to fix the performance issues. Most probably you will fail in finding and fixing the root cause.
If you are in a pressure situation where you have to fix a bleeding without above tools: This article won't help you. Say a prayer and start debugging things blindly and hopefully you find a band-aid to stop immediate bleeding.
In this article I will mention a few conditions where its better to stop and ask for more details.

Where Most People Go Wrong

Just like part 1, the instinct is to open the codebase and start looking for "obviously slow" things. Maybe there's an allocation in a hot loop. Maybe a regex is being compiled on every call. You find something, fix it, release a patch, and close the issue (you missed saying a prayer in this case).

Two weeks later, the user says it's still slow.

What happened? You probably optimized a piece of code that wasn't the bottleneck in their specific usage pattern. Your benchmark showed improvements. Their workload did not.

The trap is the same as Part 1 — you acted on intuition instead of data. But in a library, the data is harder to get,
which makes the trap easier to fall into.

The Prejudice Problem (Library Edition)

The same trap from Part 1 applies here, but with an extra layer: you're tempted to assume the problem is in the client's
code, not yours.

"They must be calling it wrong." "They're not reusing the object." "Their environment is misconfigured."

Sometimes that's true. But start with the assumption that the problem is real and in your library. Prove otherwise with
data.

Before You Start: The 5 Things You Need (Library Edition)

These are different from Part 1. Some overlap, but the constraints change what's actually achievable.

A clear problem statement from the reporter. "Your library is slow" is not actionable. You need: Which API? What input size? What does slow mean — latency, throughput, memory? Push back until you have specifics. A good problem statement is the foundation of everything that follows.
If clear problem statement is not available, don't proceed.
A reproducible scenario you control
Unlike Part 1, you probably can't look into someone else's production environment. You need to build the scenario yourself — a
benchmark or test that demonstrates the reported problem under controlled conditions. If you can't reproduce it, you can't
fix it and you can't verify the fix. This is always better than asking the users to basically test your changes and then finding whether the changes have worked or not.
It's a not a blocker, but is very vital to have confidence in your fix without resorting to gut feeling.
Understanding of your own library's design
This sounds obvious? It isn't. Libraries accumulate complexity. The person investigating may not be the original author.
Know the hot paths — the APIs that get called most frequently, the ones that process large inputs, the ones that are called in loops. These are your candidates.
Here an SME can be really helpful.
Knowledge of common usage patterns
You don't control how clients uses your library, but you can study it. If possible look at your documentation examples, your issue
tracker, your GitHub discussions. How do people actually call your APIs? What input sizes are typical? What do they call
in loops? This shapes where you look.
This usually reduces your debug time.
Defined performance targets
Same as Part 1 — "fast" is not a target. Define what acceptable looks like: throughput at a given input size, memory
allocation per operation, latency at P99. Without this, you can't declare that you have achieved your target.
This will be your goal post.

Once you have these, several other things become discoverable:

Typical input characteristics — size, shape, edge cases. A library that handles 1KB payloads efficiently may fall apart at 100MB.
Call frequency patterns — is your API called once at startup or thousands of times per second in a hot loop (A heavily executed block of code that repeats rapidly, where even tiny inefficiencies multiply into significant performance bottlenecks.)? The answer changes what matters. Like Part 1, we don't worry too much about the one call at startup for performance issues.
Runtime environment assumptions — JVM version, GC settings, available memory. You can't control these, but you can document what you've tested against and what you assume. It also helps if you document known issues with some runtime environments.

The First Thing You Build: A Reproducible Benchmark

Before touching any code, build a benchmark that demonstrates the problem like we discussed in pre-requisites.

This is your equivalent of the reproducible scenario from Part 1 — but in a library context, it's entirely your
responsibility to construct. The reporter won't hand it to you.

A good benchmark answers:

Which API are we measuring? (e.g. Parser.parse(input))
With what input? (e.g. a 10MB JSON document was the input)
Under what call pattern? (e.g. called 1,000 times in a loop)
What does passing look like? (e.g. throughput > 500 ops/sec)

Use a proper benchmarking tool — JMH for Java, timeit/pytest-benchmark for Python.

Hot Tip: Warm up the runtime before measuring. JIT compilers, class loaders and caches all affect early measurements. You would be surprised how skewed your benchmark will be otherwise.

The Investigation Process

Step 1 — Again Reproduce First, Theorize Later

Run your benchmark. Confirm the problem exists under controlled conditions.

If you can't reproduce it, you have three options:

Go back to the reporter and get more detail about their environment and usage pattern
Expand your benchmark to cover more scenarios until you find the one that triggers it
Don't proceed with optimization.

Do not skip this step. Do not start reading code looking for problems until you have a benchmark that shows the problem.
Otherwise you're optimizing in the dark.

Step 2 — Profile, Don't Guess

Once you can reproduce the problem, profile it. Don't read the code — profile it.

Attach a profiler to your benchmark run and look at where time is actually spent. e.g. JFR (Java Flight Recorder) for Java/Kotlin or py-spy, cProfile for Python.

What you're looking for is a flame graph (A visual representation of a call stack where the width of each block shows exactly how much CPU time a function and its children consumed) or call tree that shows you which functions consume the most time. The thing you thought was slow may not be. The thing you never suspected could be.

Figure 1: Flame Graph

Step 3 — Identify the Hot Path in Your Library

From the profiler output, identify which internal functions are on the critical path. These are the ones worth optimizing.

Ask:

Is the time in your code, or in a dependency you're calling?
Is it CPU time (computation) or wall time (waiting on I/O, locks, or allocations)?
Is it one slow call, or many fast calls that add up?

Last one is very common in libraries. A single call to your API might look fine. But if the client calls it
10,000 times per second, a 50-microsecond allocation per call becomes 500ms of GC pressure (The performance penalty caused by the Garbage Collector frequently pausing the application to clean up a high volume of rapidly created, short-lived objects.) per second.

Step 4 — Categorize the Bottleneck

Same categories as Part 1, but with library-specific nuances:

CPU-bound: Heavy computation per call. Common in parsing, serialization, cryptography, compression. Look for algorithmic improvements — better data structures, avoiding redundant work, caching computed results.
Allocation / GC pressure: Creating too many short-lived objects. This is the most common library performance problem. The client pays the GC cost, not you. Look for object pooling, reusable buffers, or returning primitives instead of boxed types.
I/O-bound: Less common in pure libraries, but relevant if your library wraps file, network, or database access. Look at whether you're doing unnecessary I/O or whether async patterns would help.
Concurrency / thread safety overhead: If your library uses locks to be thread-safe, those locks may be contention points under concurrent load. Look at whether the locking granularity is appropriate, or whether lock-free structures are viable.
Initialization cost amortization (Paying a heavy, one-time execution cost upfront—like building a lookup table or parsing a configuration—so that all subsequent calls process much faster.): Some libraries do expensive work at construction time (loading configs, compiling regexes, building lookup tables). If clients are constructing your objects in a loop instead of reusing them, the fix might be documentation, not code — or making the expensive object clearly reusable.

Step 5 — Validate Before You Fix

Same discipline as Part 1. Before writing a fix:

Can your benchmark reproduce the problem consistently?
Can you explain why this specific thing is causing the slowness?
Does the profiler output support it?

If yes to all three — fix it. If not, keep profiling.

One extra check for libraries: make sure the fix doesn't break correctness. Performance optimizations in libraries could
involve caching, mutability, or reduced copying — all of which can introduce subtle bugs. Your fix needs to pass the full
test suite, not just the benchmark.

Step 6 — Verify and Document

Run your benchmark again after the fix. Measure the delta. Does it match your expectation?

Then document it:

What was the problem?
What was the fix?
What input sizes and call patterns does the improvement apply to?
Are there any trade-offs? (e.g., higher memory usage for better throughput)

This matters because library users need to understand when they'll see the benefit. A fix that helps at 10MB inputs may
not matter at 1KB inputs. Be honest and realistic about the scope.

Getting Closer to Production Visibility (Optional, But Powerful)

One of the hardest parts of library performance work is that you're investigating blind. The client has the production
environment. You have a benchmark. There's a gap between those two things, and that gap is where a lot of investigations
stall.

There are a few ways to close it.

Build optional diagnostic logging into your library.
Most logging frameworks support a concept of named loggers at configurable levels. If your library uses one (like SLF4J in Java) clients can enable debug-level output from your library without changing your code. Use this. Log things that matter for performance: input sizes, time spent in expensive operations, cache hit/miss rates, retry counts. Keep it off by default. But make it easy to turn on.
When a client reports a performance issue, your first ask can be: "Can you enable debug logging for our library and share the output?" That single step can replace hours of guessing.
Expose timing hooks or callbacks. Some libraries go further and expose explicit instrumentation hooks — callbacks or interfaces that clients can implement to receive timing data. This lets clients pipe your library's internal timings directly into their existing monitoring system — the same dashboards they use for everything else. You get visibility into their production environment without needing access to it. They get metrics without having to instrument your code themselves. Something like:

library.setMetricsListener(event -> {
    myMonitoringSystem.record(event.operationName(), event.durationMs());
});

Provide a built-in diagnostic mode (optional but useful).

A step beyond logging: a mode that, when enabled, collects and reports a structured summary of what the library did —
operations performed, time spent, allocations made, retries triggered. Think of it as a flight recorder. The client runs
their workload with diagnostic mode on, exports the report, and sends it to you.

This is more work to build, but for libraries where performance is a core concern, it's worth it. It's the closest thing
you'll get to having your own monitoring in someone else's production.

The key principle: you can't add monitoring to a client's production environment, but you can make your library observable
enough that the client can do it for you. Design for observability from the start — it's much harder to retrofit.

The Unique Challenge: You Can't See Their Production

Figure 2: Production environment is a black box for library project

The hardest part of library performance work is that you're always working with incomplete information. The reporter's
production environment is a black box.

A few things that help:

Ask for a heap dump or profiler output from their side. Even a rough flame graph from their environment is worth more than your best guess.
Provide a diagnostic mode or logging hooks. This is especially valuable for intermittent issues you can't reproduce.
Test against a range of environments. Different JVM versions, GC algorithms, and OS schedulers behave differently.
Be explicit about your performance contract. Document what you've benchmarked, under what conditions, and what the expected characteristics are.

Summary

Library performance investigation is harder than service performance investigation because you don't own the runtime. But
the discipline is the same: follow the data, not your gut.

The process:

Get a clear problem statement — which API, what input, what "slow" means
Build a reproducible benchmark before touching any code
Profile the benchmark — don't read code looking for problems
Identify the hot path from profiler output
Categorize the bottleneck type
Validate your hypothesis before fixing
Verify the fix with the benchmark
Document the improvement, its scope, and any trade-offs

The mindset shift from Part 1: you can't observe production, so your benchmark and profiler are your only sources of truth. Invest in making them accurate.

Part 1 covers the same topic for deployed services — where you have monitoring, distributed tracing, and control over the runtime.

Investigating Performance Issues In An Existing System: 101

Faisal Dilawar — Sat, 14 Mar 2026 21:10:50 +0000

Part 1 of 2 — This piece covers deployed applications and services. Part 2 covers library projects, which come with a different set of constraints.

Where Most People Go Wrong

You join a project and someone says "We also have performance issues". Where do you look first? Someone files a ticket: "The system feels slow" And someone comes in and asks have you looked at database connection pool settings, tweaking thread counts, adjusting timeout configs or look at optimizing queries. Sounds familiar?

Two weeks later, latency dropped by 5%. Everyone claps. But the system still feels slow.

Figure 1: "If you torture the data long enough, it will confess to anything." — Ronald Coase

What happened? We as a developer walked in with a theory and found evidence to support it. The DB query was slightly inefficient. We also increased thread counts just in case. Maybe even increase some resources. Fixing these did help a little. But the real bottleneck was a missing cache on the most used workflow that would have taken two days to fix.

This is the trap. And it's remarkably easy to fall into — even for experienced engineers.

The goal of this article is to give you a systematic approach so you're following data, not intuition. It's not a playbook but sort of starting point. Each performance issues are almost always unique. And no system is perfect. You can always find small issues in every system. But fixing them may not yield the desired results.

Figure 2

Before You Start: These Are The 5 Things You Absolutely Need

You cannot do a meaningful performance investigation without these. If any are missing, get them first — otherwise you're guessing in the dark.

1. Access to the codebase
You need to be able to trace execution paths, not just read dashboards. Dashboards tell you that something is slow. The code tells you why.

2. A monitoring system
Even basic metrics — request latency, error rate, CPU usage — are non-negotiable for a deployed service. Without them, you're navigating blind. (For libraries, this is different — we cover that in Part 2.).
If it's not in place as is case in some systems, create one. You need concrete proof of what you have achieved with your changes. It may be you have messed everything up. A monitoring system is the mirror to tell you the truth regarding your changes.

3. Understanding of the codebase, or access to a subject matter expert (SME)
This is the one people underestimate most. You cannot optimize code or fix a system you don't understand. If it's not your codebase, find the person who knows it and treat them as a key collaborator.
Hot tip: Use AI agents to analyze your codebase if it's possible and generate a comprehensive design of each flow even if you know the code base or have a SME at hand. (Use AI as a starting point, but trust your own tracing more. Also, ensure your organization is comfortable with an AI agent analyzing their codebase)

Figure3: You can't fix a system you don't understand

4. Knowledge of the most-used workflows
Not every feature gets equal traffic. And fixing performance issue in a very rarely used workflow may not be worthwhile right now. A bug in the login flow matters more than a bug in the settings page. Your monitoring system will usually tell you this directly — look at request frequency, not just latency.

5. Defined performance targets
"Fast" is not a target. "P99 latency under 200ms for search requests under normal load" is a target. Without a specific number, you can't declare victory and you can't prioritize.
In case it is not defined work with someone to come to a number which should be achievable. You cant do 10 DB queries and achieve a 10ms latency. This number will be your true north to guide you towards the end goal.

Think of these as your entry conditions. Once you have them, several other things become discoverable through investigation rather than needing to be handed to you upfront:

Infrastructure topology — visible from deployment configs, cloud console, or a conversation with DevOps. How many instances are deployed. What kind of resources is there in the pod/DB. 1 pod with 2GB RAM and 2 core CPU will not perform equal to 2 pods with 1GB RAM and 1 core CPU each.
Dependency performance map — which DBs, caches, queues, and external APIs does this service call, and what are their typical latencies? You can usually get this from code and configuration files. But if it's documented, nothing like it.
Data characteristics — volume, growth rate, and shape of data flowing through the system. Processing 100Kb messages is different that processing 10Gb message. What works for 10,000 requests per hour may that same configuration can be completely useless for 10million messages per hour.
A reproducible test scenario — more on this below

Figure 4: System Performance Framework

The First Thing You Build: A Reproducible Scenario

Before touching a single line of code or configuration, build a controlled test that demonstrates the performance problem.

This sounds obvious. Most people skip it.

Here's why it matters: without a reproducible scenario, you can't verify that anything you did actually helped. You might deploy a fix, check production metrics an hour later, and see latency improved. But was that your fix? Or lower traffic? Or a cache that warmed up? You don't know.

The scenario is your measuring stick. It's the equivalent of a failing test in TDD — you're not done until it passes, and you can't call it passing if you can't run it.

A good scenario answers:

What operation are we measuring? (e.g., GET /patients?name=smith)
Under what load? (e.g., 50 concurrent users)
With what data? (e.g., 1 million patient records in the DB)
What does "passing" look like? (e.g., P95 < 150ms)

The Investigation Process

Step 1 — Measure First, Theorize Later

Pull up your monitoring and answer these questions with data:

Which endpoints or operations are slow? If there are multiple operations which dont meet SLA pick the one with highest delta between SLA and actual performance. (Look at latency percentiles, not averages)
Is it constant or spiky? Spiky usually points to GC pauses, lock contention, or cache misses. Constant usually points to an algorithmic or query problem. That would help you focus on real issue. (Spiky latency can also be caused by Network Jitter or Cold Caches. But lets ignore that for now.)
Is it correlated with load? If latency is fine at 10 req/s but degrades at 100 req/s, you most probably have a concurrency or resource saturation problem.
When did it start? A sudden change usually means a deployment or a data volume threshold was crossed. Or a configuration change. Could be some change in 3rd party services or upgrade to a newer version of library.

Figure 5: **Do not form a hypothesis yet.* Just collect facts.*

Step 2 — Identify the Hot Path

Not everything in the system is equally important. Find the operations that are:

Called frequently
Slow (high latency)
High impact to the user

The holy union of those three is where you focus. A rarely-called admin endpoint that takes 2 seconds is less important than a core API that takes 300ms and is called 500 times per second.

Figure 6: AI generated this messy image. Still learning how to give good prompt to generate relevant image. (This line is not generated by AI :stuck_out_tongue)

Step 3 — Trace the Request End to End

For the hot path you identified, trace a single request through every layer:

Client → Load Balancer → App Server → [Business Logic] → Database/Cache/External API → Response

Figure 7: Usual path of a single request

At each layer, ask: how much time does this layer contribute? Is it acceptable?

Distributed tracing tools (Jaeger, Zipkin, Datadog APM) show you this as a flame graph or waterfall. If you don't have these, maybe your logs will tell you this. If even that is not possible add logs to get these details. Again, dont assume that my Business Logic is not consuming time, it can only be DB or 3rd party API.

What you're looking for is where time is actually spent, not where you assume it's spent.

A common finding: 80% of latency is in one DB query. Another common finding: 30% is in serialization you'd never have guessed. Another: a slow 3rd party API call sitting in the middle of what should be a fast operation.

Figure 8: Time breakdown across layers

Once your trace tells you WHICH layer is slow, you need to look at the 'shape' of that slowness to categorize it.

Step 4 — Categorize the Bottleneck

Once you've found where time is spent, categorize it. Each category needs very different solution.

CPU-bound
The service is doing heavy computation.
Symptoms: High CPU utilization, scales linearly with load.
Example: Running validation or transformation on every request without caching the result where possible.
I/O-bound
Time is spent waiting on DB, network, or disk.
Symptoms: CPU is low but latency is high, thread pool exhaustion under load.
Example: An N+1 query — fetching a list of 100 items then making 100 individual DB calls for related data.
Memory / GC pressure
Lots of object allocation causing garbage collection pauses.
Symptoms: Latency spikes rather than constant slowness, heap usage that grows and drops periodically.
Example: Creating large intermediate collections in a loop that runs thousands of times per request.
Concurrency / contention
Threads waiting on each other.
Symptoms: High thread count, low CPU, latency that gets much worse under concurrent load.
Example: A shared resource protected by a synchronized block that every request needs to acquire.
Data volume
Queries or algorithms that worked at 10k records fall apart at 10M.
Symptoms: Gradual degradation over time, correlated with data growth.
Example: A missing index, a full table scan, or an in-memory sort of a result set that used to be small.

Above are just some usual categories. Not an exhaustive list

Figure 9: Categorize the issue

Step 5 — Validate Before You Fix

Before writing a single line of fix code, validate your hypothesis:

Can you reproduce the slow behavior in your reproducible scenario?
Can you explain why this specific thing is causing the slowness?
Does the data support it? (e.g., slow query logs, profiler output, GC logs)

If you can answer yes to all three, you've found the root cause. Now fix it.

If not, go back to Step 3 and keep tracing. At this stage you may end up finding multiple issues. Not a single Root cause. Use your judgement to pick your fights. Your primary focus is the root cause.

Figure 10: Validate **before* you fix*

The Prejudice Problem

There is one common trap that I see very commonly (Although I did the same when I was naive).

You start a performance investigation already believing you know the answer — "it's the DB", "it's the thread pool", "it's the network", "it's the 3rd party api" — you will almost always find evidence to support that belief. No system is perfect. If you look hard enough at any layer, you'll find something to improve. And improving it will most likely help a little.

But "a little" is not the same as fixing the root cause. And chasing the wrong thing costs weeks of effort while users continue to experience slowness.

The discipline is to stay in data-collection mode until the data points clearly at something. Your hypothesis should be the last thing that forms, not the first.

Figure 11: Tackling low hanging fruits may not be the best solution for performance enhancements

A Note on Performance Targets

One thing that kills performance investigations: nobody defined what "good" looks like. You fix something, latency improves, but no one knows "Is this enough?".

Before you start, establish numbers. Some useful ones (Look these terms up if you are not sure):

P50 / P95 / P99 latency — average hides outliers; percentiles don't
Throughput at peak load — requests per second the system must handle
Error rate under load — a system that's fast but drops 2% of requests isn't performing well
Resource utilization ceiling — at what CPU/memory level does performance degrade?

These become your success criteria. The reproducible scenario you built in step one should be testing against these.

Summary

Performance investigation done right is less glamorous than people expect. It's mostly measurement, tracing, and resisting the urge to jump to a solution.

The process, stripped down:

1. Establish the 5 prerequisites before starting
2. Build a reproducible scenario first
3. Measure — let data tell you where time is spent
4. Identify the hot path
5. Trace end to end across layers
6. Categorize the bottleneck type
7. Validate your hypothesis before fixing
8. Verify the fix using your reproducible scenario

The mindset that makes this work: follow the data, not your gut. Your intuition about where the problem is might be right. But until the data confirms it, it's just a theory.

Part 2 covers the same topic for library projects — where you don't have a deployment, monitoring is your responsibility to build, and "production" is someone else's process.

Upgrading Java Libraries: A Developer’s Guide to Compatibility

Faisal Dilawar — Mon, 09 Mar 2026 11:16:41 +0000

In software engineering, we often treat "upgrading" as a purely positive step—new features, better performance, and patched vulnerabilities. However, when your project is used as a library by other applications, an upgrade can be a minefield.

Most experienced developers are vary of upgrading core/major libraries. And most of the people maintaing library projects (bless them!!) dont think about the actual devs using them.

While you control your own codebase, you don't control the hundreds or thousands of downstream projects that depend on your API. Here we wil talk about the updating you Java (or any other language) library projects without breaking the world.

Upgrading an Application

When you upgrade an internal service, you have complete visibility. If you rename a method, you can refactor every caller in one go. You have a single team to coordinate with and an immediate feedback loop.

Upgrading a Library

Library maintainers face "unknown unknowns." Your code is used in ways you never envisioned by teams you've never met. Every change must be viewed through the lens of backward compatibility because you cannot coordinate with all consumers simultaneously.

Consumer Pain Points: What Breaking Changes Feel Like

Breaking changes aren't just technical hurdles; they are business costs.

Compilation Failures: When signatures change or classes disappear, consumer velocity grinds to a halt as developers hunt through changelogs.
Documentation Gaps: Missing Javadocs (@ param, @return, @throws) turn an API into a guessing game.
Intuition Failures: Method names that suggest one behavior but implement another erode trust in your library.
Internal Behavior Shifts: The "Silent Killer." The code compiles, but logic breaks at runtime. These cause the most expensive production incidents.

The Gold Standard: Maintaining Backward Compatibility

Follow a simple mantra fro as long as possible: "Don't Break whats working."

1. Add, Don't Remove

Instead of modifying the signature of an existing method, introduce a new overload.

Before:

public class Calculator {
    public int calculate(int a) {
        return a * 2;
    }
}

After (Safe Evolution):

public class Calculator {
    // Original method delegates to the new implementation with defaults
    public int calculate(int a) {
        return calculate(a, Config.DEFAULT);
    }

    // New overload provides enhanced functionality
    public int calculate(int a, Config c) {
        return a * c.multiplier;
    }
}

2. Deprecate gracefully in phased manner

Mark obsolete APIs with the @Deprecated annotation. Use the since attribute and forRemoval=true to signal intent. Always link to the replacement in the Javadoc. Use this only if the older method will most probably work but for all intents and purposes the newer version is better.

/**
 * @deprecated Use {@link #newMethod()} instead.
 * This method will be removed in version 3.0.
 */
@Deprecated(since = "2.0", forRemoval = true)
public void oldMethod() {
    // Legacy implementation maintained for 2-3 major versions
}

3. The "Silent Killer": Internal Behavior Changes

The most problematic breaking changes are those that pass the compiler but fail at runtime. Consider changing a return type from null to Optional or a Custom Exception.

The Scenario:
A library method used to return null if a user wasn't found. Now, it returns Optional<User>.

The Consumer's Code:

User user = finder.findUser("123");
if (user != null) { 
    // This check is now PERMANENTLY true because Optional is an object!
    user.process(); // NullPointerException when calling methods on empty Optional
}

We saw this specific iussue once when we were upgrading java in our project and had to upgrade another libarary as part of that.Luckily both the library and app code was being maintained by us.

Hard Breaks & Dependency Hell

Sometimes a hard break is unavoidable due to security flaws or architectural debt. In these cases adding details to javadocs is the most helpful things to do for end users:

Explain the Why: Was it a security patch? A performance bottleneck?
Explain the How: Provide clear migration scripts or "Search and Replace" instructions if possible.

The Diamond Dependency Problem

When your library and the app require different versions of the same third-party dependency, consumers often face NoSuchMethodError. You can help them by documenting how to use Maven exclusions:

<dependency>
    <groupId>com.yourlibrary</groupId>
    <artifactId>awesome-lib</artifactId>
    <version>2.0.0</version>
    <exclusions>
        <exclusion>
            <groupId>org.conflicting.lib</groupId>
            <artifactId>clash-artifact</artifactId>
        </exclusion>
    </exclusions>
</dependency>

If nothing else your documentation and release notes will help them plan the migration better.

The Pre-Publish Checklist

Before you publish your next release, ask yourself:

Did I test existing consumer code?
Is my Javadoc complete (@ param, @return, @throws)?
Are deprecations clearly marked with a timeline?
Did I explain the "Why" for any hard breaks in the release notes?
Have I checked for behavioral contract shifts?

In the current environment of AI dependency, the agents also use your documentation as its guiding light for using ypur library whose codebase is not available.
Library development is an exercise of responsibility. Every change affects a developer who trusted your API contract. Ship with their needs in mind.

Two Versions, One Project - A Guide to Java Dependency Shading

Faisal Dilawar — Mon, 11 Aug 2025 04:02:06 +0000

Here, we'll learn how to use two versions of the same library in a single JVM project.

Problem statement:

In modern Java development we have the luxury of using build automation tools like maven or gradle to handle our dependencies once we tell them what we need. They are amazing at managing things when dealing with multiple libraries and transient dependencies.
But sometimes we can run into an issue where they may not be able to pick the correct dependency version.

Consider the following example.
A library has 2 versions:

Version 1 (v1) is the older one and has 3 methods
- methodA, methodB and methodC
Version 2 (v2) is the newer one and has 3 methods
- methodA, methodB (with slightly different logic but same signature) and methodD

You have your application setup with maven and your application needs version1 and specifically methodC.
Your application also has a dependency on a 3rd party library which needs version 2 and methodD or maybe even newer methodB.

What to do in that case? As maven will ensure only one version is used. Basically "nearest win" strategy. Either v1 or v2. And you can't have 2 classes with same package_name + file_name in your application. Diamond dependency?

Shading to the rescue

In a nutshell, it packs the complete version1 (and its dependency if needed) of the library in your jar.
It also renames the v1 package/path. So com.organization.project.library becomes something like com.organization.project.shaded.library.
While packaging maven will replace all the references (like import statements and fully qualified name) of the non-shaded (v1) package name with the shaded path.
The 3rd party library will obviously keep on using the non-shaded (v2) path as it's not been modified by Maven.

Its like my wife telling me not to buy any new bike and me try to convince her it's not exactly a bike but a lawn mower (Partially true story).

FYI: Shading plugin basically rewrites the bytecode.

A real-world example

Guava is a high-quality utility library used in many projects. But it suffers from one major flaw: it often breaks backward compatibility. Let's say your application uses v32, but you include a 3rd party library that relies on v23. This can lead to java.lang.NoSuchMethodError.
To handle this you bundle a copy of Guava v23 classes inside your final JAR and relocate its packages from com.google.common.* to a private path like com.myapp.shaded.guava.com.google.common.*

Sample pom for above mentioned example (check the build/plugin part):

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.myapp</groupId>
    <artifactId>shading-example</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
        <guava.version>32.1.3-jre</guava.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>${guava.version}</version>
        </dependency>
        <dependency>
            <groupId>com.example</groupId>
            <artifactId>some-data-library</artifactId>
            <version>1.2.0</version>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.5.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <artifactSet>
                                <includes>
                                    <include>com.example:some-data-library</include>
                                </includes>
                            </artifactSet>
                            <relocations>
                                <relocation>
                                    <pattern>com.google.common</pattern>
                                    <shadedPattern>com.myapp.shaded.guava</shadedPattern>
                                </relocation>
                            </relocations>
                            <createDependencyReducedPom>false</createDependencyReducedPom>
                        </configuration>
                    </execution>
                </executions>
                <dependencies>
                    <dependency>
                        <groupId>com.google.guava</groupId>
                        <artifactId>guava</artifactId>
                        <version>23.0</version>
                    </dependency>
                </dependencies>
            </plugin>
        </plugins>
    </build>
</project>

Jackson (com.fasterxml.jackson) and Kryo (com.esotericsoftware.kryo) are also 2 more examples where developers can face similar issues.

Shading is Simple. Right?

A more complex problem statement

What if your application itself needs both versions of the library simultaneously? I know it’s a very unusual scenario. And I hope you don’t have to face something similar. But we faced this and lived to tell the tale (you are reading it, right?).
Here, the standard shading approach fails. As the shading plugin modifies all occurrences of the original package, renaming them to the new shaded path. And we don’t want that. We want some of the references to use v1 and other v2.

I sincerely hope you are not in a scenario to support 3 different versions. If yes, do write an article about your own miserable coding life.

In our case, the environment itself where we had to deploy the application was providing us with a runtime dependency. The newer version of the library was essential for our application to function in the evolving runtime environment. At the same time, we needed the older version of that library to read existing, persisted data. So the older version was mandatory for us. And of course, no backward compatibility (you thought this would be easy?).

Solution

To solve this, we moved all the code that relied on the older library version into its own, separate library project.
In the pom.xml of our library project, we shaded the old dependency much like the Guava example (with a small difference we'll explain later).
And then we basically generated two distinct artifacts from this project: the original, standard JAR and the new, shaded JAR.
This new, shaded JAR (containing the old library) and the original version of the library were both added as dependencies to our main application.
And the code in our main application that needed the v1 was updated to import the new, relocated packages from our custom-shaded JAR. And any code which needed referencing to v2 was kept as is.

Library pom.

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.myapp.wrappers</groupId>
    <artifactId>guava-v23-wrapper</artifactId>
    <version>1.0.0</version>
    <packaging>jar</packaging>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>17</maven.compiler.source>
        <maven.compiler.target>17</maven.compiler.target>
    </properties>

    <dependencies>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>23.0</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.5.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <shadedClassifierName>shaded-guava-v23</shadedClassifierName>

                            <relocations>
                                <relocation>
                                    <pattern>com.google.common</pattern>
                                    <shadedPattern>com.myapp.shaded.guava.v23</shadedPattern>
                                </relocation>
                            </relocations>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

Major change is introduction of tag shadedClassifierName. It tells the maven-shade-plugin not to replace the main artifact. Instead, it creates a new JAR and appends -shaded-guava-v23 to its name.
In application pom.xml:

<dependency>
    <groupId>com.google.guava</groupId>
    <artifactId>guava</artifactId>
    <version>${guava.latest.version}</version>
</dependency>
<dependency>
    <groupId>com.myapp.wrappers</groupId>
    <artifactId>guava-v23-wrapper</artifactId>
    <version>1.0.0</version>
    <classifier>shaded-guava-v23</classifier>
</dependency>

In your application code, you can basically now do this:

import com.google.common.base.Strings;

public class GuavaVersionHandler {

    public String useNewGuava(String text) {
        return Strings.padEnd(text, 15, '.');
    }

    public boolean useOldGuava(String text) {
        return com.myapp.shaded.guava.v23.base.Strings.isNullOrEmpty(text);
    }

    public static void main(String[] args) {
        GuavaVersionHandler handler = new GuavaVersionHandler();

        String result1 = handler.useNewGuava("modern");
        System.out.println("New Guava result: " + result1);

        boolean result2 = handler.useOldGuava("");
        System.out.println("Old Guava result: " + result2);
    }
}

Final thoughts

So there you have it. Dependency shading is a powerful, if slightly deceptive, tool in the fight against dependency hell. Sometimes, you just need to put a cat costume on a library to keep a third-party dependency happy. Other times, you have to convince your own application that your new bicycle is actually a lawnmower to maintain backward compatibility.

While it shouldn't be your first resort—as it adds complexity and size to your project—knowing how to effectively shade a dependency is the perfect escape hatch for otherwise impossible version conflicts. Use this power wisely, and happy (and less miserable) coding! And remember if you dabble into shading, test the hell out of your application/code.

Would love to hear if you have used any other tools or strategies besides shading to resolve version conflicts in your projects.