Forem: Mohammad-Idrees

Designing Systems by Questioning from First Principles

Mohammad-Idrees — Thu, 22 Jan 2026 12:51:10 +0000

Why this blog exists

Most system design explanations jump straight to:

tables
services
Kafka
microservices
“best practices”

That’s intimidating — and worse, it hides how good designs are actually created.

Strong engineers don’t start with solutions.
They start with questions.

This blog teaches you:

How to think, not what to memorize
Which questions to ask, in which order
How to reason from first principles
How interviewers expect you to reason, even if they never say it

No prior knowledge required.

What is “first principles” thinking?

First principles thinking means:

Breaking a problem down to what must be true, before deciding how to implement anything.

Instead of:

“I’ll use X because everyone uses X”

You ask:

“What does this system fundamentally need to do?”

This applies to any problem:

backend
frontend
infra
data
even non-technical problems

The Core Idea: Design is a Questioning Process

Good design emerges from progressively sharper questions.

Think of it like peeling an onion:

What exists?
What changes?
What must never break?
What can happen at the same time?
What happens if things arrive late, twice, or out of order?

Each layer removes ambiguity.

The 6 First-Principles Questions (Problem-Agnostic)

These six questions work for any system design problem.

1️⃣ What are the things that exist?

Before tables, before APIs — ask:

“What are the nouns in this system?”

Examples:

user
order
referral
payment
message

These are entities.

💡 Rule:

If you can point at it or name it, it’s probably an entity.

Do not think about storage yet.

2️⃣ Which of these change over time?

Now ask:

“Which entities evolve?”

Examples:

order → created → paid → shipped
referral → sent → joined → failed

This introduces state.

💡 Rule:

If something changes, you must model how it changes.

This is where many designs fail — they ignore time.

3️⃣ Which data is identity vs state?

This is a critical mental separation.

Ask:

“Which fields define what this thing is?”
“Which fields define where it is in a process?”

Identity

IDs
relationships
who is involved
usually set once

State

status
progress
lifecycle
changes often

💡 Rule:

Identity answers “what is it?”
State answers “what is happening to it?”

You don’t need separate tables yet — just separate thinking.

4️⃣ What events can happen independently?

Now introduce time and reality.

Ask:

“Can these things happen at the same time?”
“Can they arrive out of order?”
“Can they be retried?”

Examples:

install event
signup event
payment confirmation

Even in one service, these can be async.

💡 Rule:

Async is about timing, not microservices.

5️⃣ What must never be allowed to happen?

These are invariants.

Ask:

“What states are illegal?”
“What combinations should never exist?”

Examples:

reward given twice
joined without a user
paid order without payment record

💡 Rule:

Invariants are stronger than code — enforce them in data models when possible.

6️⃣ Where can the system safely lose information?

This is subtle and very important.

Ask:

“If two things race, is it okay if one wins?”
“Do I need the full history, or only the outcome?”

Examples:

Final status (joined vs not joined) → overwrite OK
Money ledger → overwrite NOT OK

💡 Rule:

If overwriting loses truth, you need append-only data.

Case Study: Referral System (Simplified)

Let’s apply the questions to a concrete example.

Problem (simplified)

Users invite friends.
Friend either:

joins using the referral
or doesn’t

Step 1: Entities

Referral
User

Step 2: What changes?

Referral has a lifecycle.

Step 3: Identity vs State

Identity

referrer_user_id
referee_user_id (once joined)

State

invite-sent
joined
not-joined

Step 4: Async reality

Events:

invite sent
signup
code applied / missed

These can race.

Step 5: Invariants

joined ⇒ referee_user_id exists
not-joined ⇒ referee_user_id is null

Step 6: Can we lose intermediate info?

Yes.

We only care about:

joined
not joined

We don’t need:

install timestamp
intermediate states

Resulting Design

A single table is enough:

referrals (
  referral_id,
  referrer_user_id,
  referee_user_id NULL,
  status ENUM('INVITE_SENT', 'JOINED', 'NOT_JOINED')
)

Why this works:

states are terminal
outcomes are mutually exclusive
last write wins is acceptable

No over-engineering.

Why This Thinking Wins Interviews

Interviewers are not testing:

syntax
frameworks
memorized architectures

They are testing:

reasoning
clarity
trade-off awareness

If you explain:

“I chose X because these states are terminal and overwrites are safe”

You sound senior — even if the solution is simple.

Common Mistakes Junior Engineers Make

❌ Starting with databases
❌ Overusing “microservices”
❌ Adding Kafka without events
❌ Designing for scale without defining scale
❌ Not questioning requirements

The One-Page Interview Checklist

You can memorize this.

Before Designing Anything, Ask:

What are the entities?
Which ones change over time?
What is identity vs state?
Which events are independent / async?
What invariants must never break?
Can overwrites lose truth?

If you answer these clearly, the design almost writes itself.

Final Thought

Great system design is not about being clever.

It’s about being:

clear
deliberate
honest about trade-offs

If you learn to ask better questions, you will:

design better systems
perform better in interviews
grow faster as an engineer

And most importantly — you’ll know why your design works.

Contrast sync vs async failure classes using first principles

Mohammad-Idrees — Tue, 13 Jan 2026 05:15:43 +0000

1. Start from First Principles: What Is a “Failure Class”?

A failure class is not:

a bug
a timeout
an outage

A failure class is:

A category of things that can go wrong because of how responsibility, time, and state are structured

So we ask:

What must be true for correctness?
What assumptions does the model silently make?
What breaks when those assumptions are false?

2. Core Difference (One Sentence)

Synchronous systems fail by blocking and cascading.
Asynchronous systems fail by duplication, reordering, and invisibility.

Everything else is a consequence.

3. Synchronous Systems — Failure Classes

Definition (First Principles)

A synchronous system assumes:

“The caller waits while the callee finishes the work.”

This couples:

time
availability
correctness

Failure Class 1: Blocking Amplification

Question asked:

What happens while the system waits?

Reality:

Threads blocked
Connections held
Memory retained

Failure mode:

Load increases → latency increases → throughput collapses

This is not just “slow.”
It is non-linear failure.

Failure Class 2: Cascading Failure

Question asked:

What if a dependency slows down?

Because everything is waiting:

Agent slows → backend slows
Backend slows → frontend retries
Retries amplify load

Failure mode:

One slow dependency can take down the entire system

Failure Class 3: Availability Coupling

Question asked:

Can the system function if the dependency is down?

Answer in sync systems:

Failure mode:

Partial outage becomes total outage

Summary: Sync Failure Classes

Category	Root Cause
Blocking	Time is coupled
Cascades	Dependencies are inline
Global outage	Availability is transitive

4. Asynchronous Systems — Failure Classes

Definition (First Principles)

An async system assumes:

“Work can finish later, possibly multiple times, possibly out of order.”

This decouples time but removes guarantees.

Failure Class 1: Duplicate Execution

Question asked:

What happens if work is retried?

Reality:

At-least-once delivery
Worker crashes
Message reprocessed

Failure mode:

Same logical action happens multiple times

This breaks:

Exactly-once semantics
Idempotency assumptions

Failure Class 2: Ordering Violations

Question asked:

What defines sequence?

Reality:

Queues don’t know business order
Workers process independently

Failure mode:

Effects appear out of logical order

For chat systems:

Responses based on future messages
Context corruption

Failure Class 3: Completion Invisibility

Question asked:

How does the user know when work is done?

Reality:

No direct signal
Polling or guessing

Failure mode:

Users wait blindly or see stale state

Failure Class 4: Orphaned Work

Question asked:

What if the user disappears?

Reality:

Job keeps running
Response stored but never consumed

Failure mode:

Wasted compute, leaked state

Summary: Async Failure Classes

Category	Root Cause
Duplication	Retries
Reordering	Decoupled execution
Invisibility	No direct completion path
Orphans	Detached lifecycles

5. Side-by-Side Contrast (Mental Model)

Dimension	Synchronous	Asynchronous
Time	Coupled	Decoupled
Failure style	Blocking, cascades	Duplication, disorder
Availability	All-or-nothing	Partial
Correctness risk	Latency-based	Logic-based
Debugging	Easier	Harder

6. Deep Insight (This Is the Interview Gold)

Synchronous systems fail loudly and immediately.
Asynchronous systems fail quietly and later.

Sync failures are obvious (timeouts, errors)
Async failures are subtle (double writes, wrong order)

7. Why Neither Is “Better”

From first principles:

Sync systems protect causality but sacrifice availability
Async systems protect availability but sacrifice causality

Real systems exist to reintroduce the lost property:

Async systems add idempotency, ordering, state machines
Sync systems add timeouts, circuit breakers, fallbacks

8. One-Line Rule to Remember

Sync breaks under load.
Async breaks under ambiguity.

If you want next, we can:

Map these failure classes to real outages
Show how streaming combines both failure types
Practice identifying failure classes on a fresh system

Tell me the next direction.

Applying First-Principles Questioning to a Real Company Interview Question

Mohammad-Idrees — Tue, 13 Jan 2026 04:56:05 +0000

Case Study: Designing a Chat System (Meta / WhatsApp–Style)

This section answers a common follow-up interview request:

“Okay, now apply this thinking to a real problem.”

We will do exactly that — without jumping to tools or architectures first.

The goal is not to “design WhatsApp,” but to demonstrate how interviewers expect you to think.

The Interview Question (Realistic & Common)

“Design a chat system like WhatsApp.”

This is a real company interview question asked (in variants) at:

Meta
Uber
Amazon
Stripe

Most candidates fail this question not because it’s hard, but because they start in the wrong place.

What Most Candidates Do (Wrong Start)

Typical opening:

“We’ll use WebSockets”
“We’ll use Kafka”
“We’ll shard by user ID”

This skips reasoning.

A strong candidate pauses and applies the checklist.

Applying the First-Principles Checklist Live

We will apply the same five questions, in order, and show what problems naturally surface.

1. State

“Where does state live? When is it durable?”

Ask This Out Loud in the Interview

What information must the chat system remember for it to function correctly?

Identify Required State (No Design Yet)

Users
Conversations
Messages
Message delivery status

Now ask:

Which of this state must never be lost?

Answer:

Messages (core product)
Conversation membership

First-Principles Conclusion

Messages must be persisted
In-memory-only solutions are insufficient

What the Interviewer Sees

You identified correctness-critical state before touching architecture.

2. Time

“How long does each step take?”

Now we introduce time.

Break the Chat Flow

User sends message
Message is stored
Message is delivered to recipient(s)

Ask:

Which of these must be fast?

Sending a message → must feel instant
Delivery → may be delayed (offline users)

Critical Question

Does the sender wait for delivery confirmation?

If yes:

Latency depends on recipient availability If no:
Sending and delivery are time-decoupled

First-Principles Conclusion

Message acceptance must be fast
Delivery can happen later

This naturally introduces asynchrony, without naming any tools.

3. Failure

“What breaks independently?”

Now assume failures — explicitly.

Ask

What happens if the system crashes after accepting a message but before delivery?

Possible states:

Message stored
Recipient not notified yet

Now ask:

Can delivery be retried safely?

This surfaces a key invariant:

A message must not be delivered zero times or multiple times incorrectly.

Failure Scenarios Discovered

Duplicate delivery
Message loss
Inconsistent delivery status

First-Principles Conclusion

Message delivery must be idempotent
Storage and delivery failures must be decoupled

The interviewer now sees you understand distributed failure, not just happy paths.

4. Order

“What defines correct sequence?”

Now introduce multiple messages.

Ask

Does message order matter in a conversation?

Answer:

Yes — chat messages must appear in order

Now ask the dangerous question:

Does arrival order equal delivery order?

In distributed systems:

No guarantee

Messages can:

Be processed by different servers
Experience different delays

First-Principles Conclusion

Ordering is part of correctness
It must be explicitly modeled (e.g., sequence per conversation)

This is a senior-level insight, derived from questioning alone.

5. Scale

“What grows fastest under load?”

Now — and only now — do we talk about scale.

Ask

As usage grows, what increases fastest?

Likely answers:

Number of messages
Concurrent active connections
Offline message backlog

Now ask:

What happens during spikes (e.g., group chats, viral events)?

You discover:

Hot conversations
Uneven load
Memory pressure from live connections

First-Principles Conclusion

The system must scale on messages, not users
Load is not uniform

What We Have Discovered (Before Any Design)

Without choosing any tools, we now know:

Messages must be durable
Sending and delivery must be decoupled
Failures must not cause duplicates or loss
Ordering is a correctness requirement
Message volume, not user count, dominates scale

This is exactly what interviewers want to hear before you propose architecture.

What Comes Next (And Why It’s Easy Now)

Only after this reasoning does it make sense to talk about:

Persistent storage
Async delivery
Streaming connections
Partitioning strategies

At this point, architecture choices are obvious, not arbitrary.

Why This Approach Scores High in Interviews

Interviewers are evaluating:

How you reason under ambiguity
Whether you surface hidden constraints
Whether you understand failure modes

They are not testing whether you know WhatsApp’s internals.

This method shows:

Structured thinking
Calm problem decomposition
Senior-level judgment

Common Candidate Mistakes (Seen in This Question)

Jumping to WebSockets without discussing durability
Ignoring offline users
Assuming message order “just works”
Treating retries as harmless
Talking about scale before correctness

Every one of these mistakes is prevented by the checklist.

Final Reinforcement: The Checklist (Again)

Use this verbatim in interviews:

Where does state live? When is it durable?
Which steps are fast vs slow?
What can fail independently?
What defines correct order?
What grows fastest under load?

Final Mental Model

Strong candidates design systems.
Exceptional candidates design reasoning.

How to Question Any System Design Problem (With Live Interview Walkthrough)

Mohammad-Idrees — Tue, 13 Jan 2026 04:50:41 +0000

Thinking in First Principles:

Most system design interview failures are not caused by missing knowledge of tools.

They are caused by missing questions.

Strong candidates do not start by designing systems.
They start by interrogating the problem.

This post teaches you:

How to question a system from first principles
How to apply that questioning live in an interview
What mistakes candidates commonly make
A printable one-page checklist you can memorize and reuse

No prior system design experience required.

What “First Principles” Means in System Design

First principles means reducing a problem to fundamental truths that must always hold, regardless of:

Programming language
Framework
Infrastructure
Scale

Every system—chat apps, payment systems, video processing pipelines—must answer the same core questions about:

State
Time
Failure
Order
Scale

If a design cannot answer one of these, it is incomplete.

The 5-Step First-Principles Questioning Framework

You will apply these questions in order.

State – Where does information live? When is it durable?
Time – How long does each step take?
Failure – What breaks independently?
Order – What defines correct sequence?
Scale – What grows fastest under load?

This is not a checklist you recite.
It is a thinking sequence.

Let’s walk through each one.

1. State — Where Does It Live? When Is It Durable?

The Question

Where does the system’s information exist, and when is it safe from loss?

This is always the first question because nothing else matters if data disappears.

What You’re Really Asking

Is data stored in memory or persisted?
What survives a crash or restart?
What is the source of truth?

Example Case

Imagine a system that accepts user requests and processes them later.

If the request only lives in memory:

A restart loses it
A crash loses it
Another instance can’t see it

You have discovered a correctness problem, not a performance one.

Key Insight

If state only exists in a running process, it does not exist.

2. Time — How Long Does Each Step Take?

Once state exists, time becomes unavoidable.

The Question

Which steps are fast, and which are slow?

You are comparing orders of magnitude, not exact numbers.

What You’re Really Asking

Is there long-running work?
Does the user wait for it?
Is fast work blocked by slow work?

Example Case

A system:

Accepts a request (milliseconds)
Performs heavy processing (seconds)

If the request waits for processing:

Latency is dominated by the slowest step
Throughput collapses under load

Key Insight

The slowest step defines the user experience.

3. Failure — What Breaks Independently?

Now assume something goes wrong. It always will.

The Question

Which parts of the system can fail without the others failing?

What You’re Really Asking

What if the system crashes mid-operation?
What if work is retried?
Can the same work run twice?

Example Case

If work can be retried:

It may run twice
Side effects may duplicate
State may become inconsistent

This is not a bug.
It is the default behavior of distributed systems.

Key Insight

Distributed systems fail partially, not cleanly.

4. Order — What Defines Correct Sequence?

Ordering issues appear only after state, time, and failure are considered.

The Question

Does correctness depend on the order of operations?

What You’re Really Asking

Does arrival order equal processing order?
Can later work finish earlier?
Does that matter?

Example Case

Two requests arrive:

A then B

If B completes before A:

Is the system still correct?

If the answer is “no,” order must be explicitly enforced.

Key Insight

If order matters, it must be designed—not assumed.

5. Scale — What Grows Fastest?

Only now do we talk about scale.

The Question

As usage increases, which dimension grows fastest?

What You’re Really Asking

Requests?
Stored data?
Concurrent operations?
Waiting work?

Example Case

If each request waits on slow work:

Concurrent waiting grows with latency
Resources exhaust quickly

Key Insight

Systems fail at the fastest-growing dimension.

Live Mock Interview Case Study (Detailed)

Interviewer

“Design a system where users submit tasks and receive results later.”

Candidate (Correct Approach)

Candidate:
Before designing, I’d like to understand what state the system must preserve.

Step 1: State

Candidate:
We must store:

The user’s request
The result
A way to associate them

This state must survive crashes, so it needs to be persisted.

Interviewer:
Good. Continue.

Step 2: Time

Candidate:
Submitting a request is likely fast.
Producing a result could be slow.

If we make users wait for result generation, latency will be high and throughput limited.

So the system likely separates request acceptance from processing.

Step 3: Failure

Candidate:
Now I’ll assume failures.

If processing crashes mid-way:

The request still exists
Processing may retry

That means the same task could execute twice.

So we must consider whether duplicate execution is safe.

Step 4: Order

Candidate:
If users submit multiple tasks:

Does order matter?

If yes:

Arrival order ≠ completion order
We need to explicitly preserve sequence

If no:

Tasks can be processed independently

Step 5: Scale

Candidate:
Under load, the fastest-growing dimension is:

Pending background work

If processing is slow, the backlog grows quickly.

So the system must degrade gracefully under that pressure.

Interviewer Assessment

The candidate:

Asked structured questions
Identified real failure modes
Avoided premature tools
Demonstrated systems thinking

No tools were required to pass this interview.

Common Mistakes Candidates Make

1. Jumping to Solutions

❌ “We’ll use Kafka”
✅ “What happens if work runs twice?”

2. Treating State as Implementation Detail

❌ “We’ll store it somewhere”
✅ “What must never be lost?”

3. Ignoring Failure

❌ “Retries should work”
✅ “What if retries duplicate effects?”

4. Assuming Order

❌ “Requests are processed in order”
✅ “What enforces that order?”

5. Talking About Scale Too Early

❌ “Millions of users”
✅ “Which dimension explodes first?”

Printable One-Page Interview Checklist

You can print or memorize this.

First-Principles System Design Checklist

Ask these in order:

State

What information must exist?
Where does it live?
When is it durable?

Time

Which steps are fast?
Which are slow?
Does slow work block fast work?

Failure

What can fail independently?
Can work be retried?
What happens if it runs twice?

Order

Does correctness depend on sequence?
Is arrival order preserved?
What enforces ordering?

Scale

What grows fastest?
How does the system fail under load?

Final Mental Model

Great system design is not about building systems.
It is about exposing hidden assumptions.

This framework helps you do that—calmly, systematically, and convincingly.

Thinking in First Principles: How to Question an Async Queue–Based Design

Mohammad-Idrees — Tue, 13 Jan 2026 04:04:43 +0000

Async queues are one of the most commonly suggested “solutions” in system design interviews.

But many candidates jump straight to using queues without understanding:

What problems they actually solve
What new problems they introduce
How to systematically discover those problems

This post teaches a first-principles questioning process you can apply to any async queue design—without assuming prior knowledge.

Why This Matters

In interviews, interviewers are not evaluating whether you know Kafka, SQS, or RabbitMQ.

They are evaluating whether you can:

Reason about time
Reason about failure
Reason about order
Reason about user experience

Async queues change all four.

What “First Principles” Means Here

First principles means:

We do not start with solutions
We do not assume correctness
We ask basic, unavoidable questions that every system must answer

Async queues feel correct because they remove blocking—but correctness is not guaranteed by intuition.

The Reference Mental Model (Abstract)

We will reason about this abstract pattern, not a specific product:

User → API → Storage → Queue → Worker → Storage

No domain assumptions. This could be:

Chat messages
Emails
Payments
Notifications
Image processing

The questioning process stays the same.

Step 1: The Root Question (Always Start Here)

What is the system responsible for completing before it can respond?

This is the most important question in system design.

Why?
Because it defines:

Request boundaries
Latency expectations
Responsibility

In an async queue design, the implicit answer is:

“The request is complete once the work is enqueued.”

This is different from synchronous designs, where the request completes after work finishes.

So far, this seems good.

Step 2: Introduce Time (What Happens Later?)

Now ask:

Which part of the work happens after the request is done?

Answer:

The worker processing

This leads to an important realization:

The system has split work across time

Time separation is powerful—but it creates new questions.

Step 3: Causality Question (Identity Across Time)

Once work happens later, we must ask:

How does the system know which output belongs to which input?

This question always appears when time is decoupled.

Typical answer:

IDs in the job payload (request ID, entity ID)

This introduces a new invariant:

Each input must produce exactly one correct output

Now we test whether the system can guarantee this.

Step 4: Failure Question (The Queue Reality)

Now ask the most important async-specific question:

What happens if the worker crashes mid-processing?

Realistic answers:

The job is retried
The work may run again
The output may be produced twice

This leads to a critical realization:

Async queues are usually at-least-once, not exactly-once

This is not a tooling issue.
It is a fundamental property of distributed systems.

Step 5: Duplication Question (Invariant Violation)

Now ask:

What happens if the same job is processed twice?

Consequences:

Duplicate outputs
Duplicate side effects
Conflicting state

This violates the earlier invariant:

“Exactly one output per input”

At this point, we have discovered a correctness problem, not a performance problem.

Step 6: Ordering Question (Time Without Synchrony)

Now consider multiple inputs.

Ask:

What defines the order of processing?

Important realization:

Queue order ≠ business order
Different workers process at different speeds
Later inputs may finish first

Now ask:

Does correctness depend on order?

If yes (and many systems do):

Async queues alone are insufficient

This problem emerges only when you question order explicitly.

Step 7: Visibility Question (User Experience)

Now switch perspectives.

How does the user know the work is finished?

Possible answers:

Polling
Guessing
Timeouts

Each answer reveals a problem:

Polling wastes resources
Guessing is unreliable
Timeouts fail under load

This violates a core system principle:

Users should not wait blindly

Case Study: A Simple Example (Problem-Agnostic)

Imagine a system where users upload photos to be processed.

Flow:

User uploads photo
API stores metadata
Job is enqueued
Worker processes photo
Result is stored

Now apply the questions:

When does the upload request complete? → After enqueue
What if the worker crashes? → Job retried
What if it runs twice? → Two processed images
What if two photos depend on order? → Order not guaranteed
How does the user know processing is done? → Polling

None of these issues are about images.
They are about time, failure, identity, and visibility.

What Async Queues Actually Trade

Async queues solve one problem:

They remove blocking from the request path

But they introduce others:

Solved	Introduced
Blocking	Duplicate work
Latency coupling	Ordering ambiguity
Resource exhaustion	Completion uncertainty

This is not bad.
It just must be understood and handled.

The One-Page Interview Checklist (Memorize This)

For any async queue design, ask these five questions:

What completes the request?
What runs later?
What happens if it runs twice?
What defines order?
How does the user observe completion?

If you cannot answer all five clearly, the design is incomplete.

Final Mental Model

Async systems remove time coupling but destroy causality by default

Your job as an engineer is not to “use queues”
Your job is to restore correctness explicitly

That is what interviewers are looking for.

How to Identify System Design Problems from First Principles

Mohammad-Idrees — Tue, 13 Jan 2026 03:05:21 +0000

Why This Matters

In system design interviews, candidates often fail not because they don’t know tools, but because they don’t know how to ask the right questions.

Strong designers:

Discover problems before proposing solutions
Reason about failures without running systems
Explain why an architecture breaks under load

This post teaches how to identify system design problems from first principles, using a repeatable questioning process.

First Principles: What Are We Actually Designing?

Before any architecture, clarify this:

A system is a machine that accepts requests and holds resources until it finishes work.

So every design must answer:

What work must be done?
How long does it take?
What resources are held while it runs?
What happens when things slow or fail?

The Root Question (Always Start Here)

What must the system finish before it can respond?

This is the first and most important question.

Why?

It defines request boundaries
It determines latency
It determines failure coupling
It determines scalability

If you don’t answer this explicitly, the system will answer it implicitly — usually incorrectly.

The Question Ladder (Mental Checklist)

Once the root question is answered, follow this exact sequence:

1️⃣ What defines request completion?

When is the request “done”?
What must succeed before responding?

2️⃣ Which step is the slowest?

Database write? (ms)
Network call? (100s ms)
External service / ML model? (seconds)

The slowest step dominates the system.

3️⃣ What resources are held while waiting?

Ask concretely:

Is a goroutine/thread blocked?
Is an HTTP connection open?
Is memory retained?
Is a DB connection reserved?

If resources are held → risk exists.

4️⃣ What scales with traffic vs latency?

This is the diagnostic question that exposes blocking.

Does resource usage grow when traffic increases — or when latency increases?

This distinction is critical.

The Core Principle (Memorize This)

Healthy systems scale with traffic.
Broken systems scale with latency.

Latency is unpredictable and unbounded.
Traffic can be controlled.

If your system scales with latency, it will collapse under real-world conditions.

Case Study 1: Synchronous API with External Dependency

Scenario

Client → API → External Service → API → Client

The API waits for the external service before responding.

Apply the Question Ladder

1. What defines completion?
→ External service response

2. Slowest step?
→ External service (seconds)

3. Resources held?
→ Open HTTP request, goroutine, memory

4. What scales with latency?

Example:

10 requests/sec
External latency = 5 sec

After 5 seconds:

50 concurrent requests
50 goroutines blocked

Latency increases → concurrency explodes.

Failure Identified

Concurrency scales with latency, not traffic

This is blocking — even if traffic is low.

Case Study 2: Database Transaction Around Long Work

Scenario

Begin Transaction
→ Write data
→ Call external service
→ Write result
→ Commit

Apply the Questions

Completion defined by: external service
Slowest step: external service
Resources held: DB transaction, locks, connection
Scaling factor: external latency

Failure Identified

Slow work is holding scarce resources

This leads to:

Lock contention
Connection pool exhaustion
Cascading failures

Case Study 3: Real-Time Streaming System

Scenario

Client ⇄ WebSocket ⇄ Server ⇄ Generator

Tokens stream incrementally.

Apply the Questions

Completion defined by: final token
Slowest step: generation duration
Resources held: socket, memory buffers
Latency impact: long streams = long-held connections

New Failure Mode

Intermediate states now exist

Questions arise:

What if connection drops mid-stream?
What defines “done”?
Can streams overlap?

Streaming solves latency perception but introduces state complexity.

Why These Problems Are Found Before Architecture

Notice:

No queues
No databases
No technologies mentioned

Problems emerged purely by reasoning about time, state, and resources.

This is why first-principles thinking works across:

Chat systems
Payments
Notifications
File uploads
ML inference pipelines

The One-Page Interview Checklist

Use this on any system:

What must complete before responding?
What is the slowest step?
What resources are held while waiting?
What grows when latency increases?
Can failures occur independently?
What happens under partial completion?

If latency appears in #4 → design flaw detected

Final Takeaway

Blocking is not a performance issue.
It is a semantic mistake about responsibility boundaries.

The system incorrectly believes:

“I must finish long work before I can reply.”

Great system designers question that belief first — before proposing solutions.

Use this post as a thinking reference, not a pattern catalog.
If you can explain these questions clearly in an interview, you are already in the top tier.

🧱 The Blueprint of Success: Mastering the Technical Requirements Document (TRD)

Mohammad-Idrees — Wed, 26 Nov 2025 17:14:13 +0000

🧱 The Blueprint of Success: Mastering the Technical Requirements Document (TRD)

Hello Future Engineers! I’m here to talk about one of the most critical documents you'll encounter in your career: the Technical Requirements Document (TRD). You might also hear it called a Technical Specification Document (TSD) or System Design Document (SDD).

As an Architect/Principle Engineer, I can tell you that a well-written TRD is the difference between a smooth, successful project and months of frustrating, expensive rework. Think of the TRD as the detailed engineering blueprint that translates a designer's sketch (the PRD) into a constructible building. Your mission is to master this blueprint.

🧐 Why the TRD is Your Lifeline

A common mistake for junior engineers is jumping straight into coding. Don't do it! The TRD forces you to think deeply about the system and its constraints before you write the first line of code.

Here's why it's essential:

Clarity and Alignment: It serves as the single source of truth, ensuring every developer, QA engineer, and product manager is aligned on exactly what and how the feature will be built.
Prevents Scope Creep: By clearly defining what’s in-scope and, critically, what’s out-of-scope (Non-Goals), you prevent last-minute feature additions that derail schedules.
Facilitates Code Review & Testing: The TRD provides the acceptance criteria and the technical context needed for QA to design test cases and for senior engineers to conduct thorough, meaningful code reviews.
Enables Tradeoff Justification: It’s where you document your architectural choices and defend them against the system's needs (e.g., "We chose X database over Y because of the Z latency requirement").

📑 The Gold Standard TRD Structure (Section by Section)

A robust TRD follows a logical flow, moving from high-level context to specific implementation details.

I. Document Context and Administration 📝

This is the metadata that keeps the project organized.

Title & ID: A descriptive name and a unique ID (e.g., TRD-USER-AUTH-001).
Revision History: Crucial. Log every change, who made it, and why. This tracks the evolution of the design.
Summary & Business Context: Briefly state the problem you are solving and link to the source Product Requirements Document (PRD).
Stakeholders & Approvers: List the owners (Product, Engineering Lead, QA Lead) who must sign off on the design.
Goals (In-Scope): State the measurable outcomes (e.g., Implement secure user registration and login functionality.).
Non-Goals (Out-of-Scope): Explicitly list what you are not building (e.g., This TRD does not include social sign-in or password recovery functionality.).

II. Functional Requirements (The "What" to Build) 🔨

This section is derived directly from the PRD, but translated into technical language.

Describe the behavior of the system. Break it down by user story or use case.
Example:
- System must validate the user's password against the defined complexity rules (8 characters, 1 uppercase, 1 number).
- The API endpoint /api/v1/user/register must accept a POST request with email and password fields and return a HTTP 201 response on success.

III. Non-Functional Requirements (NFRs) (The "How Well" It Must Be Built) 🌟

These are the quality attributes that truly define the engineering challenge. As a junior engineer, you must learn to think in NFRs, as they drive every architectural decision.

Category	Description	Example of a Measurable Requirement
Performance	Speed and efficiency under a given workload.	API endpoint X must respond in $< 150$ milliseconds for $99\%$ of requests.
Scalability	Ability to handle future increases in workload.	System must support $5\text{x}$ traffic increase over the next year.
Security	Protecting the system and data from unauthorized access.	All sensitive data must be encrypted at rest using AES-256.
Availability	The percentage of time the system is operational.	Target $99.99\%$ uptime (less than 52 minutes of downtime per year).
Maintainability	Ease of fixing bugs, evolving the code, and monitoring.	Detailed logs must be retained for 90 days.

IV. System Architecture and Design (The Blueprint) 📐

This is the core engineering section. It details how you plan to meet the NFRs and Functional Requirements.

High-Level Architecture: Use diagrams to show where the new feature sits in the overall ecosystem.

[Image of Microservices Architecture Diagram]

Component Design: Detail the new services, modules, or libraries being created.
Data Model/Schema Changes:
- Show the new database tables, fields, indexes, and relationships.
- If using NoSQL, show the document structure.
API Specifications: Document the full contract for new or modified APIs (URL, Method, Request Body, Response Body, Error Codes).
Technology Usage & Tradeoff Justification:
- This is where you earn your stripes. Document the final technology choice (e.g., "We chose Redis for session management") and justify it by linking it back to the NFRs (e.g., "because the $150$ms performance NFR requires an in-memory cache solution for fast read times").
- Also, explicitly mention the tradeoff (e.g., "The tradeoff is higher cost compared to using the main database, but this is accepted due to the critical performance need.").
Assumptions, Constraints, & Dependencies:
- Assumptions: What are you taking for granted (e.g., "We assume the networking layer is already configured with load balancing").
- Constraints: Strict limitations (e.g., "Must be deployed on Kubernetes only," or "Budget limited to $X$ per month for hosting").

V. Testing, Deployment, and Operations ⚙️

Building it is only half the battle; operating it is the other half.

Acceptance Criteria (AC): These are the final, testable conditions for each major requirement. Example: AC for successful registration: A new user record exists in the database with a hashed password, and an email confirmation event is queued.
Testing Strategy: High-level plan (e.g., unit tests must cover $80\%$ of the new service; performance tests must validate the $150$ms latency NFR).
Monitoring & Alerting: How will you know if the feature is broken in production? What are the key metrics and who gets paged?
Deployment & Rollback Plan: A step-by-step release process (e.g., using blue/green deployment, feature flags) and the specific, tested steps for quickly reverting the change if an issue occurs.

🚀 Suggested Mind Map: Drafting a Simple, Effective TRD

When facing a new feature, use this four-step mind map to quickly structure your thinking and draft a solid TRD. Start with the "Why" and work your way to the "How."

Start with the PRD (The "WHAT"):
- Translate the PRD into Functional Requirements.
- Define Acceptance Criteria for each requirement.
Define the Quality (The "HOW WELL"):
- Determine the Non-Functional Requirements (NFRs): Performance, Security, Scale, Reliability.
Design the Solution (The "HOW"):
- Select the Technology Stack.
- Justify Tradeoffs (e.g., Postgres vs. Mongo $\rightarrow$ which NFR does it satisfy?).
- Diagram the Architecture and Data Model.
Operationalize (The "HOW TO RUN IT"):
- Define Testing Strategy.
- Determine Monitoring/Alerts.
- Document the Deployment/Rollback steps.