<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mohammad-Idrees</title>
    <description>The latest articles on Forem by Mohammad-Idrees (@mohammadidrees).</description>
    <link>https://forem.com/mohammadidrees</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1046680%2F443afdb2-0830-44b3-b1fe-f31bdc08c6fa.png</url>
      <title>Forem: Mohammad-Idrees</title>
      <link>https://forem.com/mohammadidrees</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mohammadidrees"/>
    <language>en</language>
    <item>
      <title>Designing Systems by Questioning from First Principles</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Thu, 22 Jan 2026 12:51:10 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/designing-systems-by-questioning-from-first-principles-1p6g</link>
      <guid>https://forem.com/mohammadidrees/designing-systems-by-questioning-from-first-principles-1p6g</guid>
      <description>&lt;h2&gt;
  
  
  Why this blog exists
&lt;/h2&gt;

&lt;p&gt;Most system design explanations jump straight to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;tables&lt;/li&gt;
&lt;li&gt;services&lt;/li&gt;
&lt;li&gt;Kafka&lt;/li&gt;
&lt;li&gt;microservices&lt;/li&gt;
&lt;li&gt;“best practices”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s intimidating — and worse, it hides &lt;strong&gt;how good designs are actually created&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Strong engineers don’t start with solutions.&lt;br&gt;
They start with &lt;strong&gt;questions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This blog teaches you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How to think&lt;/strong&gt;, not what to memorize&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Which questions to ask&lt;/strong&gt;, in which order&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How to reason from first principles&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;How interviewers expect you to reason, even if they never say it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No prior knowledge required.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is “first principles” thinking?
&lt;/h2&gt;

&lt;p&gt;First principles thinking means:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Breaking a problem down to what &lt;em&gt;must&lt;/em&gt; be true, before deciding &lt;em&gt;how&lt;/em&gt; to implement anything.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I’ll use X because everyone uses X”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What does this system fundamentally need to do?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This applies to &lt;strong&gt;any problem&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;backend&lt;/li&gt;
&lt;li&gt;frontend&lt;/li&gt;
&lt;li&gt;infra&lt;/li&gt;
&lt;li&gt;data&lt;/li&gt;
&lt;li&gt;even non-technical problems&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  The Core Idea: Design is a Questioning Process
&lt;/h2&gt;

&lt;p&gt;Good design emerges from &lt;strong&gt;progressively sharper questions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Think of it like peeling an onion:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What exists?&lt;/li&gt;
&lt;li&gt;What changes?&lt;/li&gt;
&lt;li&gt;What must never break?&lt;/li&gt;
&lt;li&gt;What can happen at the same time?&lt;/li&gt;
&lt;li&gt;What happens if things arrive late, twice, or out of order?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each layer removes ambiguity.&lt;/p&gt;


&lt;h2&gt;
  
  
  The 6 First-Principles Questions (Problem-Agnostic)
&lt;/h2&gt;

&lt;p&gt;These six questions work for &lt;strong&gt;any system design problem&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  1️⃣ What are the &lt;strong&gt;things&lt;/strong&gt; that exist?
&lt;/h3&gt;

&lt;p&gt;Before tables, before APIs — ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What are the nouns in this system?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user&lt;/li&gt;
&lt;li&gt;order&lt;/li&gt;
&lt;li&gt;referral&lt;/li&gt;
&lt;li&gt;payment&lt;/li&gt;
&lt;li&gt;message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are &lt;strong&gt;entities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you can point at it or name it, it’s probably an entity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Do &lt;strong&gt;not&lt;/strong&gt; think about storage yet.&lt;/p&gt;


&lt;h3&gt;
  
  
  2️⃣ Which of these &lt;strong&gt;change over time&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which entities evolve?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;order → created → paid → shipped&lt;/li&gt;
&lt;li&gt;referral → sent → joined → failed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This introduces &lt;strong&gt;state&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If something changes, you must model &lt;em&gt;how&lt;/em&gt; it changes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where many designs fail — they ignore time.&lt;/p&gt;


&lt;h3&gt;
  
  
  3️⃣ Which data is &lt;strong&gt;identity&lt;/strong&gt; vs &lt;strong&gt;state&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;This is a critical mental separation.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Which fields define &lt;em&gt;what this thing is&lt;/em&gt;?”&lt;br&gt;
“Which fields define &lt;em&gt;where it is in a process&lt;/em&gt;?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Identity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IDs&lt;/li&gt;
&lt;li&gt;relationships&lt;/li&gt;
&lt;li&gt;who is involved&lt;/li&gt;
&lt;li&gt;usually set once&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;status&lt;/li&gt;
&lt;li&gt;progress&lt;/li&gt;
&lt;li&gt;lifecycle&lt;/li&gt;
&lt;li&gt;changes often&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Identity answers &lt;strong&gt;“what is it?”&lt;/strong&gt;&lt;br&gt;
State answers &lt;strong&gt;“what is happening to it?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You don’t need separate tables yet — just separate thinking.&lt;/p&gt;


&lt;h3&gt;
  
  
  4️⃣ What events can happen &lt;strong&gt;independently&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;Now introduce time and reality.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can these things happen at the same time?”&lt;br&gt;
“Can they arrive out of order?”&lt;br&gt;
“Can they be retried?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;install event&lt;/li&gt;
&lt;li&gt;signup event&lt;/li&gt;
&lt;li&gt;payment confirmation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even in &lt;strong&gt;one service&lt;/strong&gt;, these can be async.&lt;/p&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Async is about &lt;strong&gt;timing&lt;/strong&gt;, not microservices.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h3&gt;
  
  
  5️⃣ What must &lt;strong&gt;never be allowed&lt;/strong&gt; to happen?
&lt;/h3&gt;

&lt;p&gt;These are &lt;strong&gt;invariants&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“What states are illegal?”&lt;br&gt;
“What combinations should never exist?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reward given twice&lt;/li&gt;
&lt;li&gt;joined without a user&lt;/li&gt;
&lt;li&gt;paid order without payment record&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Invariants are stronger than code — enforce them in data models when possible.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h3&gt;
  
  
  6️⃣ Where can the system &lt;strong&gt;safely lose information&lt;/strong&gt;?
&lt;/h3&gt;

&lt;p&gt;This is subtle and very important.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If two things race, is it okay if one wins?”&lt;br&gt;
“Do I need the full history, or only the outcome?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Final status (joined vs not joined) → overwrite OK&lt;/li&gt;
&lt;li&gt;Money ledger → overwrite NOT OK&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 Rule:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If overwriting loses truth, you need append-only data.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Case Study: Referral System (Simplified)
&lt;/h2&gt;

&lt;p&gt;Let’s apply the questions to a concrete example.&lt;/p&gt;
&lt;h3&gt;
  
  
  Problem (simplified)
&lt;/h3&gt;

&lt;p&gt;Users invite friends.&lt;br&gt;
Friend either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;joins using the referral&lt;/li&gt;
&lt;li&gt;or doesn’t&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Step 1: Entities
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Referral&lt;/li&gt;
&lt;li&gt;User&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Step 2: What changes?
&lt;/h3&gt;

&lt;p&gt;Referral has a lifecycle.&lt;/p&gt;


&lt;h3&gt;
  
  
  Step 3: Identity vs State
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Identity&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;referrer_user_id&lt;/li&gt;
&lt;li&gt;referee_user_id (once joined)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;invite-sent&lt;/li&gt;
&lt;li&gt;joined&lt;/li&gt;
&lt;li&gt;not-joined&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Step 4: Async reality
&lt;/h3&gt;

&lt;p&gt;Events:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;invite sent&lt;/li&gt;
&lt;li&gt;signup&lt;/li&gt;
&lt;li&gt;code applied / missed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These can race.&lt;/p&gt;


&lt;h3&gt;
  
  
  Step 5: Invariants
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;joined ⇒ referee_user_id exists&lt;/li&gt;
&lt;li&gt;not-joined ⇒ referee_user_id is null&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Step 6: Can we lose intermediate info?
&lt;/h3&gt;

&lt;p&gt;Yes.&lt;/p&gt;

&lt;p&gt;We only care about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;joined&lt;/li&gt;
&lt;li&gt;not joined&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We don’t need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;install timestamp&lt;/li&gt;
&lt;li&gt;intermediate states&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Resulting Design
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;single table&lt;/strong&gt; is enough:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;referrals&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;referral_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;referrer_user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;referee_user_id&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="nb"&gt;ENUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'INVITE_SENT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'JOINED'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'NOT_JOINED'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;states are terminal&lt;/li&gt;
&lt;li&gt;outcomes are mutually exclusive&lt;/li&gt;
&lt;li&gt;last write wins is acceptable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No over-engineering.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Thinking Wins Interviews
&lt;/h2&gt;

&lt;p&gt;Interviewers are not testing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;syntax&lt;/li&gt;
&lt;li&gt;frameworks&lt;/li&gt;
&lt;li&gt;memorized architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are testing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reasoning&lt;/li&gt;
&lt;li&gt;clarity&lt;/li&gt;
&lt;li&gt;trade-off awareness&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you explain:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I chose X because these states are terminal and overwrites are safe”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You sound senior — even if the solution is simple.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes Junior Engineers Make
&lt;/h2&gt;

&lt;p&gt;❌ Starting with databases&lt;br&gt;
❌ Overusing “microservices”&lt;br&gt;
❌ Adding Kafka without events&lt;br&gt;
❌ Designing for scale without defining scale&lt;br&gt;
❌ Not questioning requirements&lt;/p&gt;




&lt;h2&gt;
  
  
  The One-Page Interview Checklist
&lt;/h2&gt;

&lt;p&gt;You can memorize this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before Designing Anything, Ask:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What are the entities?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Which ones change over time?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What is identity vs state?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Which events are independent / async?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What invariants must never break?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Can overwrites lose truth?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you answer these clearly, &lt;strong&gt;the design almost writes itself&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Great system design is not about being clever.&lt;/p&gt;

&lt;p&gt;It’s about being:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;clear&lt;/li&gt;
&lt;li&gt;deliberate&lt;/li&gt;
&lt;li&gt;honest about trade-offs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you learn to &lt;strong&gt;ask better questions&lt;/strong&gt;, you will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;design better systems&lt;/li&gt;
&lt;li&gt;perform better in interviews&lt;/li&gt;
&lt;li&gt;grow faster as an engineer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly — you’ll know &lt;em&gt;why&lt;/em&gt; your design works.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>beginners</category>
      <category>interview</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Contrast sync vs async failure classes using first principles</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Tue, 13 Jan 2026 05:15:43 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/contrast-sync-vs-async-failure-classes-using-first-principles-d12</link>
      <guid>https://forem.com/mohammadidrees/contrast-sync-vs-async-failure-classes-using-first-principles-d12</guid>
      <description>&lt;h2&gt;
  
  
  1. Start from First Principles: What Is a “Failure Class”?
&lt;/h2&gt;

&lt;p&gt;A &lt;em&gt;failure class&lt;/em&gt; is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a bug&lt;/li&gt;
&lt;li&gt;a timeout&lt;/li&gt;
&lt;li&gt;an outage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A failure class is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A category of things that can go wrong because of how responsibility, time, and state are structured&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So we ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What &lt;em&gt;must&lt;/em&gt; be true for correctness?&lt;/li&gt;
&lt;li&gt;What assumptions does the model silently make?&lt;/li&gt;
&lt;li&gt;What breaks when those assumptions are false?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Core Difference (One Sentence)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Synchronous systems fail by blocking and cascading.&lt;br&gt;
Asynchronous systems fail by duplication, reordering, and invisibility.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Everything else is a consequence.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Synchronous Systems — Failure Classes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Definition (First Principles)
&lt;/h3&gt;

&lt;p&gt;A synchronous system assumes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“The caller waits while the callee finishes the work.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This couples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;time&lt;/li&gt;
&lt;li&gt;availability&lt;/li&gt;
&lt;li&gt;correctness&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Failure Class 1: Blocking Amplification
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens while the system waits?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Reality:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Threads blocked&lt;/li&gt;
&lt;li&gt;Connections held&lt;/li&gt;
&lt;li&gt;Memory retained&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Load increases → latency increases → throughput collapses&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is &lt;strong&gt;not&lt;/strong&gt; just “slow.”&lt;br&gt;
It is &lt;strong&gt;non-linear failure&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure Class 2: Cascading Failure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if a dependency slows down?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because everything is waiting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Agent slows → backend slows&lt;/li&gt;
&lt;li&gt;Backend slows → frontend retries&lt;/li&gt;
&lt;li&gt;Retries amplify load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;One slow dependency can take down the entire system&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Failure Class 3: Availability Coupling
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can the system function if the dependency is down?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Answer in sync systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Partial outage becomes total outage&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Summary: Sync Failure Classes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Blocking&lt;/td&gt;
&lt;td&gt;Time is coupled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cascades&lt;/td&gt;
&lt;td&gt;Dependencies are inline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global outage&lt;/td&gt;
&lt;td&gt;Availability is transitive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  4. Asynchronous Systems — Failure Classes
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Definition (First Principles)
&lt;/h3&gt;

&lt;p&gt;An async system assumes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Work can finish later, possibly multiple times, possibly out of order.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This decouples time but &lt;strong&gt;removes guarantees&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure Class 1: Duplicate Execution
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens if work is retried?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;At-least-once delivery&lt;/li&gt;
&lt;li&gt;Worker crashes&lt;/li&gt;
&lt;li&gt;Message reprocessed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Same logical action happens multiple times&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This breaks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exactly-once semantics&lt;/li&gt;
&lt;li&gt;Idempotency assumptions&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Failure Class 2: Ordering Violations
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What defines sequence?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queues don’t know business order&lt;/li&gt;
&lt;li&gt;Workers process independently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Effects appear out of logical order&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For chat systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Responses based on future messages&lt;/li&gt;
&lt;li&gt;Context corruption&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Failure Class 3: Completion Invisibility
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does the user know when work is done?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No direct signal&lt;/li&gt;
&lt;li&gt;Polling or guessing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Users wait blindly or see stale state&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Failure Class 4: Orphaned Work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Question asked:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if the user disappears?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Reality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job keeps running&lt;/li&gt;
&lt;li&gt;Response stored but never consumed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Failure mode:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Wasted compute, leaked state&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  Summary: Async Failure Classes
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Root Cause&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Duplication&lt;/td&gt;
&lt;td&gt;Retries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reordering&lt;/td&gt;
&lt;td&gt;Decoupled execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Invisibility&lt;/td&gt;
&lt;td&gt;No direct completion path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Orphans&lt;/td&gt;
&lt;td&gt;Detached lifecycles&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  5. Side-by-Side Contrast (Mental Model)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Synchronous&lt;/th&gt;
&lt;th&gt;Asynchronous&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Time&lt;/td&gt;
&lt;td&gt;Coupled&lt;/td&gt;
&lt;td&gt;Decoupled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Failure style&lt;/td&gt;
&lt;td&gt;Blocking, cascades&lt;/td&gt;
&lt;td&gt;Duplication, disorder&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Availability&lt;/td&gt;
&lt;td&gt;All-or-nothing&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Correctness risk&lt;/td&gt;
&lt;td&gt;Latency-based&lt;/td&gt;
&lt;td&gt;Logic-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Debugging&lt;/td&gt;
&lt;td&gt;Easier&lt;/td&gt;
&lt;td&gt;Harder&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Deep Insight (This Is the Interview Gold)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Synchronous systems fail loudly and immediately.&lt;br&gt;
Asynchronous systems fail quietly and later.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Sync failures are obvious (timeouts, errors)&lt;/li&gt;
&lt;li&gt;Async failures are subtle (double writes, wrong order)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Why Neither Is “Better”
&lt;/h2&gt;

&lt;p&gt;From first principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sync systems protect &lt;strong&gt;causality&lt;/strong&gt; but sacrifice &lt;strong&gt;availability&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Async systems protect &lt;strong&gt;availability&lt;/strong&gt; but sacrifice &lt;strong&gt;causality&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Real systems exist to &lt;strong&gt;reintroduce the lost property&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Async systems add idempotency, ordering, state machines&lt;/li&gt;
&lt;li&gt;Sync systems add timeouts, circuit breakers, fallbacks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. One-Line Rule to Remember
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Sync breaks under load.&lt;br&gt;
Async breaks under ambiguity.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;If you want next, we can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Map these failure classes to &lt;strong&gt;real outages&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Show how &lt;strong&gt;streaming combines both failure types&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Practice identifying failure classes on a fresh system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tell me the next direction.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>computerscience</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Applying First-Principles Questioning to a Real Company Interview Question</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Tue, 13 Jan 2026 04:56:05 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/applying-first-principles-questioning-to-a-real-company-interview-question-2c0j</link>
      <guid>https://forem.com/mohammadidrees/applying-first-principles-questioning-to-a-real-company-interview-question-2c0j</guid>
      <description>&lt;h2&gt;
  
  
  Case Study: Designing a Chat System (Meta / WhatsApp–Style)
&lt;/h2&gt;

&lt;p&gt;This section answers a common follow-up interview request:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Okay, now apply this thinking to a real problem.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We will do exactly that — &lt;strong&gt;without jumping to tools or architectures first&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The goal is not to “design WhatsApp,” but to demonstrate &lt;strong&gt;how interviewers expect you to think&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Interview Question (Realistic &amp;amp; Common)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“Design a chat system like WhatsApp.”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a &lt;strong&gt;real company interview question&lt;/strong&gt; asked (in variants) at:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Meta&lt;/li&gt;
&lt;li&gt;Uber&lt;/li&gt;
&lt;li&gt;Amazon&lt;/li&gt;
&lt;li&gt;Stripe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most candidates fail this question not because it’s hard, but because they &lt;strong&gt;start in the wrong place&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Most Candidates Do (Wrong Start)
&lt;/h2&gt;

&lt;p&gt;Typical opening:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“We’ll use WebSockets”&lt;/li&gt;
&lt;li&gt;“We’ll use Kafka”&lt;/li&gt;
&lt;li&gt;“We’ll shard by user ID”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This skips reasoning.&lt;/p&gt;

&lt;p&gt;A strong candidate &lt;strong&gt;pauses&lt;/strong&gt; and applies the checklist.&lt;/p&gt;




&lt;h1&gt;
  
  
  Applying the First-Principles Checklist Live
&lt;/h1&gt;

&lt;p&gt;We will apply the &lt;strong&gt;same five questions&lt;/strong&gt;, in order, and show what problems naturally surface.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. State
&lt;/h2&gt;

&lt;h3&gt;
  
  
  “Where does state live? When is it durable?”
&lt;/h3&gt;

&lt;h3&gt;
  
  
  Ask This Out Loud in the Interview
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;What information must the chat system remember for it to function correctly?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Identify Required State (No Design Yet)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Users&lt;/li&gt;
&lt;li&gt;Conversations&lt;/li&gt;
&lt;li&gt;Messages&lt;/li&gt;
&lt;li&gt;Message delivery status&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which of this state must never be lost?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages (core product)&lt;/li&gt;
&lt;li&gt;Conversation membership&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First-Principles Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Messages must be &lt;strong&gt;persisted&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;In-memory-only solutions are insufficient&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What the Interviewer Sees
&lt;/h3&gt;

&lt;p&gt;You identified &lt;strong&gt;correctness-critical state&lt;/strong&gt; before touching architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Time
&lt;/h2&gt;

&lt;h3&gt;
  
  
  “How long does each step take?”
&lt;/h3&gt;

&lt;p&gt;Now we introduce time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Break the Chat Flow
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;User sends message&lt;/li&gt;
&lt;li&gt;Message is stored&lt;/li&gt;
&lt;li&gt;Message is delivered to recipient(s)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which of these must be fast?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;Sending a message → must feel instant&lt;/li&gt;
&lt;li&gt;Delivery → may be delayed (offline users)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Critical Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Does the sender wait for delivery confirmation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If yes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency depends on recipient availability
If no:&lt;/li&gt;
&lt;li&gt;Sending and delivery are time-decoupled&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First-Principles Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Message acceptance must be fast&lt;/li&gt;
&lt;li&gt;Delivery can happen later&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This naturally introduces &lt;strong&gt;asynchrony&lt;/strong&gt;, without naming any tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Failure
&lt;/h2&gt;

&lt;h3&gt;
  
  
  “What breaks independently?”
&lt;/h3&gt;

&lt;p&gt;Now assume failures — explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ask
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens if the system crashes after accepting a message but before delivery?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Possible states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Message stored&lt;/li&gt;
&lt;li&gt;Recipient not notified yet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Can delivery be retried safely?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This surfaces a key invariant:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A message must not be delivered zero times or multiple times incorrectly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Failure Scenarios Discovered
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate delivery&lt;/li&gt;
&lt;li&gt;Message loss&lt;/li&gt;
&lt;li&gt;Inconsistent delivery status&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First-Principles Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Message delivery must be &lt;strong&gt;idempotent&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Storage and delivery failures must be decoupled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interviewer now sees you understand &lt;strong&gt;distributed failure&lt;/strong&gt;, not just happy paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Order
&lt;/h2&gt;

&lt;h3&gt;
  
  
  “What defines correct sequence?”
&lt;/h3&gt;

&lt;p&gt;Now introduce &lt;strong&gt;multiple messages&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ask
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Does message order matter in a conversation?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Yes — chat messages must appear in order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now ask the dangerous question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Does arrival order equal delivery order?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In distributed systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No guarantee&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Messages can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Be processed by different servers&lt;/li&gt;
&lt;li&gt;Experience different delays&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First-Principles Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ordering is part of correctness&lt;/li&gt;
&lt;li&gt;It must be explicitly modeled (e.g., sequence per conversation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a &lt;strong&gt;senior-level insight&lt;/strong&gt;, derived from questioning alone.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  “What grows fastest under load?”
&lt;/h3&gt;

&lt;p&gt;Now — and only now — do we talk about scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ask
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;As usage grows, what increases fastest?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Likely answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Number of messages&lt;/li&gt;
&lt;li&gt;Concurrent active connections&lt;/li&gt;
&lt;li&gt;Offline message backlog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens during spikes (e.g., group chats, viral events)?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You discover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hot conversations&lt;/li&gt;
&lt;li&gt;Uneven load&lt;/li&gt;
&lt;li&gt;Memory pressure from live connections&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  First-Principles Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The system must scale on &lt;strong&gt;messages&lt;/strong&gt;, not users&lt;/li&gt;
&lt;li&gt;Load is not uniform&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  What We Have Discovered (Before Any Design)
&lt;/h1&gt;

&lt;p&gt;Without choosing any tools, we now know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Messages must be durable&lt;/li&gt;
&lt;li&gt;Sending and delivery must be decoupled&lt;/li&gt;
&lt;li&gt;Failures must not cause duplicates or loss&lt;/li&gt;
&lt;li&gt;Ordering is a correctness requirement&lt;/li&gt;
&lt;li&gt;Message volume, not user count, dominates scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly what interviewers want to hear &lt;strong&gt;before&lt;/strong&gt; you propose architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Next (And Why It’s Easy Now)
&lt;/h2&gt;

&lt;p&gt;Only &lt;em&gt;after&lt;/em&gt; this reasoning does it make sense to talk about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent storage&lt;/li&gt;
&lt;li&gt;Async delivery&lt;/li&gt;
&lt;li&gt;Streaming connections&lt;/li&gt;
&lt;li&gt;Partitioning strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this point, architecture choices are &lt;strong&gt;obvious&lt;/strong&gt;, not arbitrary.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why This Approach Scores High in Interviews
&lt;/h1&gt;

&lt;p&gt;Interviewers are evaluating:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How you reason under ambiguity&lt;/li&gt;
&lt;li&gt;Whether you surface hidden constraints&lt;/li&gt;
&lt;li&gt;Whether you understand failure modes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are &lt;strong&gt;not&lt;/strong&gt; testing whether you know WhatsApp’s internals.&lt;/p&gt;

&lt;p&gt;This method shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Structured thinking&lt;/li&gt;
&lt;li&gt;Calm problem decomposition&lt;/li&gt;
&lt;li&gt;Senior-level judgment&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Candidate Mistakes (Seen in This Question)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Jumping to WebSockets without discussing durability&lt;/li&gt;
&lt;li&gt;Ignoring offline users&lt;/li&gt;
&lt;li&gt;Assuming message order “just works”&lt;/li&gt;
&lt;li&gt;Treating retries as harmless&lt;/li&gt;
&lt;li&gt;Talking about scale before correctness&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of these mistakes is prevented by the checklist.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Reinforcement: The Checklist (Again)
&lt;/h1&gt;

&lt;p&gt;Use this verbatim in interviews:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Where does state live? When is it durable?&lt;/li&gt;
&lt;li&gt;Which steps are fast vs slow?&lt;/li&gt;
&lt;li&gt;What can fail independently?&lt;/li&gt;
&lt;li&gt;What defines correct order?&lt;/li&gt;
&lt;li&gt;What grows fastest under load?&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Final Mental Model
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Strong candidates design systems.&lt;br&gt;
Exceptional candidates design &lt;em&gt;reasoning&lt;/em&gt;.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>career</category>
      <category>interview</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How to Question Any System Design Problem (With Live Interview Walkthrough)</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Tue, 13 Jan 2026 04:50:41 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/how-to-question-any-system-design-problem-with-live-interview-walkthrough-2cd4</link>
      <guid>https://forem.com/mohammadidrees/how-to-question-any-system-design-problem-with-live-interview-walkthrough-2cd4</guid>
      <description>&lt;h1&gt;
  
  
  Thinking in First Principles:
&lt;/h1&gt;

&lt;p&gt;Most system design interview failures are not caused by missing knowledge of tools.&lt;/p&gt;

&lt;p&gt;They are caused by &lt;strong&gt;missing questions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Strong candidates do not start by designing systems.&lt;br&gt;
They start by &lt;strong&gt;interrogating the problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This post teaches you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to question a system from &lt;strong&gt;first principles&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;How to apply that questioning &lt;strong&gt;live in an interview&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;What mistakes candidates commonly make&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;printable one-page checklist&lt;/strong&gt; you can memorize and reuse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No prior system design experience required.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “First Principles” Means in System Design
&lt;/h2&gt;

&lt;p&gt;First principles means reducing a problem to &lt;strong&gt;fundamental truths that must always hold&lt;/strong&gt;, regardless of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Programming language&lt;/li&gt;
&lt;li&gt;Framework&lt;/li&gt;
&lt;li&gt;Infrastructure&lt;/li&gt;
&lt;li&gt;Scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every system—chat apps, payment systems, video processing pipelines—must answer the same core questions about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Failure&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Order&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a design cannot answer one of these, it is incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 5-Step First-Principles Questioning Framework
&lt;/h2&gt;

&lt;p&gt;You will apply these questions &lt;strong&gt;in order&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;State&lt;/strong&gt; – Where does information live? When is it durable?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time&lt;/strong&gt; – How long does each step take?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure&lt;/strong&gt; – What breaks independently?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Order&lt;/strong&gt; – What defines correct sequence?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt; – What grows fastest under load?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is not a checklist you recite.&lt;br&gt;
It is a &lt;strong&gt;thinking sequence&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s walk through each one.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. State — Where Does It Live? When Is It Durable?
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Where does the system’s information exist, and when is it safe from loss?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is always the &lt;strong&gt;first question&lt;/strong&gt; because nothing else matters if data disappears.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You’re Really Asking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Is data stored in memory or persisted?&lt;/li&gt;
&lt;li&gt;What survives a crash or restart?&lt;/li&gt;
&lt;li&gt;What is the source of truth?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Case
&lt;/h3&gt;

&lt;p&gt;Imagine a system that accepts user requests and processes them later.&lt;/p&gt;

&lt;p&gt;If the request only lives in memory:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A restart loses it&lt;/li&gt;
&lt;li&gt;A crash loses it&lt;/li&gt;
&lt;li&gt;Another instance can’t see it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You have discovered a &lt;strong&gt;correctness problem&lt;/strong&gt;, not a performance one.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;If state only exists in a running process, it does not exist.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  2. Time — How Long Does Each Step Take?
&lt;/h2&gt;

&lt;p&gt;Once state exists, time becomes unavoidable.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Which steps are fast, and which are slow?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You are comparing &lt;strong&gt;orders of magnitude&lt;/strong&gt;, not exact numbers.&lt;/p&gt;

&lt;h3&gt;
  
  
  What You’re Really Asking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Is there long-running work?&lt;/li&gt;
&lt;li&gt;Does the user wait for it?&lt;/li&gt;
&lt;li&gt;Is fast work blocked by slow work?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Case
&lt;/h3&gt;

&lt;p&gt;A system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accepts a request (milliseconds)&lt;/li&gt;
&lt;li&gt;Performs heavy processing (seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the request waits for processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency is dominated by the slowest step&lt;/li&gt;
&lt;li&gt;Throughput collapses under load&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;The slowest step defines the user experience.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. Failure — What Breaks Independently?
&lt;/h2&gt;

&lt;p&gt;Now assume something goes wrong. It always will.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Which parts of the system can fail without the others failing?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What You’re Really Asking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;What if the system crashes mid-operation?&lt;/li&gt;
&lt;li&gt;What if work is retried?&lt;/li&gt;
&lt;li&gt;Can the same work run twice?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Case
&lt;/h3&gt;

&lt;p&gt;If work can be retried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It may run twice&lt;/li&gt;
&lt;li&gt;Side effects may duplicate&lt;/li&gt;
&lt;li&gt;State may become inconsistent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a bug.&lt;br&gt;
It is the &lt;strong&gt;default behavior&lt;/strong&gt; of distributed systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Distributed systems fail partially, not cleanly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  4. Order — What Defines Correct Sequence?
&lt;/h2&gt;

&lt;p&gt;Ordering issues appear only after state, time, and failure are considered.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Does correctness depend on the order of operations?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What You’re Really Asking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Does arrival order equal processing order?&lt;/li&gt;
&lt;li&gt;Can later work finish earlier?&lt;/li&gt;
&lt;li&gt;Does that matter?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Case
&lt;/h3&gt;

&lt;p&gt;Two requests arrive:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A then B&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If B completes before A:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the system still correct?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the answer is “no,” order must be explicitly enforced.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;If order matters, it must be designed—not assumed.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  5. Scale — What Grows Fastest?
&lt;/h2&gt;

&lt;p&gt;Only now do we talk about scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Question
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;As usage increases, which dimension grows fastest?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  What You’re Really Asking
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Requests?&lt;/li&gt;
&lt;li&gt;Stored data?&lt;/li&gt;
&lt;li&gt;Concurrent operations?&lt;/li&gt;
&lt;li&gt;Waiting work?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Example Case
&lt;/h3&gt;

&lt;p&gt;If each request waits on slow work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Concurrent waiting grows with latency&lt;/li&gt;
&lt;li&gt;Resources exhaust quickly&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Insight
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;Systems fail at the fastest-growing dimension.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h1&gt;
  
  
  Live Mock Interview Case Study (Detailed)
&lt;/h1&gt;

&lt;h3&gt;
  
  
  Interviewer
&lt;/h3&gt;

&lt;p&gt;“Design a system where users submit tasks and receive results later.”&lt;/p&gt;




&lt;h3&gt;
  
  
  Candidate (Correct Approach)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
Before designing, I’d like to understand what state the system must preserve.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 1: State
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
We must store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user’s request&lt;/li&gt;
&lt;li&gt;The result&lt;/li&gt;
&lt;li&gt;A way to associate them&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This state must survive crashes, so it needs to be persisted.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interviewer:&lt;/strong&gt;&lt;br&gt;
Good. Continue.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 2: Time
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
Submitting a request is likely fast.&lt;br&gt;
Producing a result could be slow.&lt;/p&gt;

&lt;p&gt;If we make users wait for result generation, latency will be high and throughput limited.&lt;/p&gt;

&lt;p&gt;So the system likely separates request acceptance from processing.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 3: Failure
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
Now I’ll assume failures.&lt;/p&gt;

&lt;p&gt;If processing crashes mid-way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The request still exists&lt;/li&gt;
&lt;li&gt;Processing may retry&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means the same task could execute twice.&lt;/p&gt;

&lt;p&gt;So we must consider whether duplicate execution is safe.&lt;/p&gt;




&lt;h3&gt;
  
  
  Step 4: Order
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
If users submit multiple tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Does order matter?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If yes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arrival order ≠ completion order&lt;/li&gt;
&lt;li&gt;We need to explicitly preserve sequence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If no:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks can be processed independently&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Step 5: Scale
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Candidate:&lt;/strong&gt;&lt;br&gt;
Under load, the fastest-growing dimension is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pending background work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If processing is slow, the backlog grows quickly.&lt;/p&gt;

&lt;p&gt;So the system must degrade gracefully under that pressure.&lt;/p&gt;




&lt;h3&gt;
  
  
  Interviewer Assessment
&lt;/h3&gt;

&lt;p&gt;The candidate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Asked structured questions&lt;/li&gt;
&lt;li&gt;Identified real failure modes&lt;/li&gt;
&lt;li&gt;Avoided premature tools&lt;/li&gt;
&lt;li&gt;Demonstrated systems thinking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No tools were required to pass this interview.&lt;/p&gt;




&lt;h2&gt;
  
  
  Common Mistakes Candidates Make
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Jumping to Solutions
&lt;/h3&gt;

&lt;p&gt;❌ “We’ll use Kafka”&lt;br&gt;
✅ “What happens if work runs twice?”&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Treating State as Implementation Detail
&lt;/h3&gt;

&lt;p&gt;❌ “We’ll store it somewhere”&lt;br&gt;
✅ “What must never be lost?”&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Ignoring Failure
&lt;/h3&gt;

&lt;p&gt;❌ “Retries should work”&lt;br&gt;
✅ “What if retries duplicate effects?”&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Assuming Order
&lt;/h3&gt;

&lt;p&gt;❌ “Requests are processed in order”&lt;br&gt;
✅ “What enforces that order?”&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Talking About Scale Too Early
&lt;/h3&gt;

&lt;p&gt;❌ “Millions of users”&lt;br&gt;
✅ “Which dimension explodes first?”&lt;/p&gt;




&lt;h1&gt;
  
  
  Printable One-Page Interview Checklist
&lt;/h1&gt;

&lt;p&gt;You can print or memorize this.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;First-Principles System Design Checklist&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ask these in order:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;State&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;What information must exist?&lt;/li&gt;
&lt;li&gt;Where does it live?&lt;/li&gt;
&lt;li&gt;When is it durable?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Which steps are fast?&lt;/li&gt;
&lt;li&gt;Which are slow?&lt;/li&gt;
&lt;li&gt;Does slow work block fast work?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Failure&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;What can fail independently?&lt;/li&gt;
&lt;li&gt;Can work be retried?&lt;/li&gt;
&lt;li&gt;What happens if it runs twice?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Order&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Does correctness depend on sequence?&lt;/li&gt;
&lt;li&gt;Is arrival order preserved?&lt;/li&gt;
&lt;li&gt;What enforces ordering?&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;What grows fastest?&lt;/li&gt;
&lt;li&gt;How does the system fail under load?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Mental Model
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Great system design is not about building systems.&lt;br&gt;
It is about exposing hidden assumptions.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This framework helps you do that—calmly, systematically, and convincingly.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>career</category>
      <category>interview</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Thinking in First Principles: How to Question an Async Queue–Based Design</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Tue, 13 Jan 2026 04:04:43 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/thinking-in-first-principles-how-to-question-an-async-queue-based-design-5cf1</link>
      <guid>https://forem.com/mohammadidrees/thinking-in-first-principles-how-to-question-an-async-queue-based-design-5cf1</guid>
      <description>&lt;p&gt;Async queues are one of the most commonly suggested “solutions” in system design interviews.&lt;/p&gt;

&lt;p&gt;But many candidates jump straight to &lt;em&gt;using&lt;/em&gt; queues without understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What problems they actually solve&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What new problems they introduce&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How to systematically discover those problems&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post teaches a &lt;strong&gt;first-principles questioning process&lt;/strong&gt; you can apply to &lt;em&gt;any&lt;/em&gt; async queue design—without assuming prior knowledge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;In interviews, interviewers are not evaluating whether you know Kafka, SQS, or RabbitMQ.&lt;/p&gt;

&lt;p&gt;They are evaluating whether you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reason about &lt;strong&gt;time&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reason about &lt;strong&gt;failure&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reason about &lt;strong&gt;order&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Reason about &lt;strong&gt;user experience&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Async queues change all four.&lt;/p&gt;




&lt;h2&gt;
  
  
  What “First Principles” Means Here
&lt;/h2&gt;

&lt;p&gt;First principles means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We do &lt;strong&gt;not&lt;/strong&gt; start with solutions&lt;/li&gt;
&lt;li&gt;We do &lt;strong&gt;not&lt;/strong&gt; assume correctness&lt;/li&gt;
&lt;li&gt;We ask &lt;strong&gt;basic, unavoidable questions&lt;/strong&gt; that every system must answer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Async queues &lt;em&gt;feel&lt;/em&gt; correct because they remove blocking—but correctness is not guaranteed by intuition.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Reference Mental Model (Abstract)
&lt;/h2&gt;

&lt;p&gt;We will reason about this &lt;strong&gt;abstract pattern&lt;/strong&gt;, not a specific product:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → API → Storage → Queue → Worker → Storage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No domain assumptions. This could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chat messages&lt;/li&gt;
&lt;li&gt;Emails&lt;/li&gt;
&lt;li&gt;Payments&lt;/li&gt;
&lt;li&gt;Notifications&lt;/li&gt;
&lt;li&gt;Image processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The questioning process stays the same.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: The Root Question (Always Start Here)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What is the system responsible for completing before it can respond?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the most important question in system design.&lt;/p&gt;

&lt;p&gt;Why?&lt;br&gt;
Because it defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request boundaries&lt;/li&gt;
&lt;li&gt;Latency expectations&lt;/li&gt;
&lt;li&gt;Responsibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  In an async queue design, the implicit answer is:
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;“The request is complete once the work is enqueued.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is different from synchronous designs, where the request completes after work finishes.&lt;/p&gt;

&lt;p&gt;So far, this seems good.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Introduce Time (What Happens Later?)
&lt;/h2&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Which part of the work happens &lt;em&gt;after&lt;/em&gt; the request is done?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The worker processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to an important realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The system has split work across time&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Time separation is powerful—but it creates new questions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Causality Question (Identity Across Time)
&lt;/h2&gt;

&lt;p&gt;Once work happens later, we must ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How does the system know which output belongs to which input?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This question always appears when time is decoupled.&lt;/p&gt;

&lt;p&gt;Typical answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IDs in the job payload (request ID, entity ID)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This introduces a new invariant:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Each input must produce exactly one correct output&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now we test whether the system can &lt;em&gt;guarantee&lt;/em&gt; this.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Failure Question (The Queue Reality)
&lt;/h2&gt;

&lt;p&gt;Now ask the most important async-specific question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What happens if the worker crashes mid-processing?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Realistic answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The job is retried&lt;/li&gt;
&lt;li&gt;The work may run again&lt;/li&gt;
&lt;li&gt;The output may be produced twice&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This leads to a critical realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Async queues are usually &lt;em&gt;at-least-once&lt;/em&gt;, not &lt;em&gt;exactly-once&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is not a tooling issue.&lt;br&gt;
It is a &lt;strong&gt;fundamental property of distributed systems&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Duplication Question (Invariant Violation)
&lt;/h2&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What happens if the same job is processed twice?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Consequences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Duplicate outputs&lt;/li&gt;
&lt;li&gt;Duplicate side effects&lt;/li&gt;
&lt;li&gt;Conflicting state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This violates the earlier invariant:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Exactly one output per input”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this point, we have discovered a &lt;strong&gt;correctness problem&lt;/strong&gt;, not a performance problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 6: Ordering Question (Time Without Synchrony)
&lt;/h2&gt;

&lt;p&gt;Now consider multiple inputs.&lt;/p&gt;

&lt;p&gt;Ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What defines the order of processing?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Important realization:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Queue order ≠ business order&lt;/li&gt;
&lt;li&gt;Different workers process at different speeds&lt;/li&gt;
&lt;li&gt;Later inputs may finish first&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does correctness depend on order?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If yes (and many systems do):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Async queues alone are insufficient&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This problem emerges &lt;em&gt;only&lt;/em&gt; when you question order explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 7: Visibility Question (User Experience)
&lt;/h2&gt;

&lt;p&gt;Now switch perspectives.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How does the user know the work is finished?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Possible answers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Polling&lt;/li&gt;
&lt;li&gt;Guessing&lt;/li&gt;
&lt;li&gt;Timeouts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each answer reveals a problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Polling wastes resources&lt;/li&gt;
&lt;li&gt;Guessing is unreliable&lt;/li&gt;
&lt;li&gt;Timeouts fail under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This violates a core system principle:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Users should not wait blindly&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Case Study: A Simple Example (Problem-Agnostic)
&lt;/h2&gt;

&lt;p&gt;Imagine a system where users upload photos to be processed.&lt;/p&gt;

&lt;p&gt;Flow:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User uploads photo&lt;/li&gt;
&lt;li&gt;API stores metadata&lt;/li&gt;
&lt;li&gt;Job is enqueued&lt;/li&gt;
&lt;li&gt;Worker processes photo&lt;/li&gt;
&lt;li&gt;Result is stored&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now apply the questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When does the upload request complete? → After enqueue&lt;/li&gt;
&lt;li&gt;What if the worker crashes? → Job retried&lt;/li&gt;
&lt;li&gt;What if it runs twice? → Two processed images&lt;/li&gt;
&lt;li&gt;What if two photos depend on order? → Order not guaranteed&lt;/li&gt;
&lt;li&gt;How does the user know processing is done? → Polling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these issues are about images.&lt;br&gt;
They are about &lt;strong&gt;time, failure, identity, and visibility&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Async Queues Actually Trade
&lt;/h2&gt;

&lt;p&gt;Async queues solve one problem:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;They remove blocking from the request path&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But they introduce others:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Solved&lt;/th&gt;
&lt;th&gt;Introduced&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Blocking&lt;/td&gt;
&lt;td&gt;Duplicate work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency coupling&lt;/td&gt;
&lt;td&gt;Ordering ambiguity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Resource exhaustion&lt;/td&gt;
&lt;td&gt;Completion uncertainty&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is not bad.&lt;br&gt;
It just must be &lt;strong&gt;understood and handled&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The One-Page Interview Checklist (Memorize This)
&lt;/h2&gt;

&lt;p&gt;For &lt;strong&gt;any async queue design&lt;/strong&gt;, ask these five questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;What completes the request?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What runs later?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What happens if it runs twice?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;What defines order?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;How does the user observe completion?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you cannot answer all five clearly, the design is incomplete.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Mental Model
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Async systems remove time coupling but destroy causality by default&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Your job as an engineer is not to “use queues”&lt;br&gt;
Your job is to &lt;strong&gt;restore correctness explicitly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is what interviewers are looking for.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>interview</category>
      <category>learning</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>How to Identify System Design Problems from First Principles</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Tue, 13 Jan 2026 03:05:21 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/how-to-identify-system-design-problems-from-first-principles-4n28</link>
      <guid>https://forem.com/mohammadidrees/how-to-identify-system-design-problems-from-first-principles-4n28</guid>
      <description>&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;In system design interviews, candidates often fail &lt;strong&gt;not because they don’t know tools&lt;/strong&gt;, but because they &lt;strong&gt;don’t know how to ask the right questions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Strong designers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discover problems &lt;em&gt;before&lt;/em&gt; proposing solutions&lt;/li&gt;
&lt;li&gt;Reason about failures without running systems&lt;/li&gt;
&lt;li&gt;Explain &lt;em&gt;why&lt;/em&gt; an architecture breaks under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This post teaches &lt;strong&gt;how to identify system design problems from first principles&lt;/strong&gt;, using a repeatable questioning process.&lt;/p&gt;




&lt;h2&gt;
  
  
  First Principles: What Are We Actually Designing?
&lt;/h2&gt;

&lt;p&gt;Before any architecture, clarify this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A system is a machine that accepts requests and holds resources until it finishes work.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So every design must answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What work must be done?&lt;/li&gt;
&lt;li&gt;How long does it take?&lt;/li&gt;
&lt;li&gt;What resources are held while it runs?&lt;/li&gt;
&lt;li&gt;What happens when things slow or fail?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Root Question (Always Start Here)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What must the system finish before it can respond?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is the &lt;strong&gt;first and most important question&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It defines request boundaries&lt;/li&gt;
&lt;li&gt;It determines latency&lt;/li&gt;
&lt;li&gt;It determines failure coupling&lt;/li&gt;
&lt;li&gt;It determines scalability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t answer this explicitly, the system will answer it &lt;em&gt;implicitly&lt;/em&gt; — usually incorrectly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question Ladder (Mental Checklist)
&lt;/h2&gt;

&lt;p&gt;Once the root question is answered, follow this &lt;strong&gt;exact sequence&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  1️⃣ What defines request completion?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;When is the request “done”?&lt;/li&gt;
&lt;li&gt;What must succeed before responding?&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2️⃣ Which step is the slowest?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Database write? (ms)&lt;/li&gt;
&lt;li&gt;Network call? (100s ms)&lt;/li&gt;
&lt;li&gt;External service / ML model? (seconds)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The slowest step dominates the system.&lt;/p&gt;




&lt;h3&gt;
  
  
  3️⃣ What resources are held while waiting?
&lt;/h3&gt;

&lt;p&gt;Ask concretely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is a goroutine/thread blocked?&lt;/li&gt;
&lt;li&gt;Is an HTTP connection open?&lt;/li&gt;
&lt;li&gt;Is memory retained?&lt;/li&gt;
&lt;li&gt;Is a DB connection reserved?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If resources are held → risk exists.&lt;/p&gt;




&lt;h3&gt;
  
  
  4️⃣ What scales with traffic vs latency?
&lt;/h3&gt;

&lt;p&gt;This is the &lt;strong&gt;diagnostic question&lt;/strong&gt; that exposes blocking.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does resource usage grow when traffic increases — or when latency increases?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This distinction is critical.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Principle (Memorize This)
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Healthy systems scale with traffic.&lt;br&gt;
Broken systems scale with latency.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Latency is unpredictable and unbounded.&lt;br&gt;
Traffic can be controlled.&lt;/p&gt;

&lt;p&gt;If your system scales with latency, it will collapse under real-world conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case Study 1: Synchronous API with External Dependency
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client → API → External Service → API → Client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API waits for the external service before responding.&lt;/p&gt;




&lt;h3&gt;
  
  
  Apply the Question Ladder
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. What defines completion?&lt;/strong&gt;&lt;br&gt;
→ External service response&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Slowest step?&lt;/strong&gt;&lt;br&gt;
→ External service (seconds)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Resources held?&lt;/strong&gt;&lt;br&gt;
→ Open HTTP request, goroutine, memory&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. What scales with latency?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;10 requests/sec&lt;/li&gt;
&lt;li&gt;External latency = 5 sec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After 5 seconds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50 concurrent requests&lt;/li&gt;
&lt;li&gt;50 goroutines blocked&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Latency increases → concurrency explodes.&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure Identified
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Concurrency scales with latency, not traffic&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is blocking — even if traffic is low.&lt;/p&gt;




&lt;h2&gt;
  
  
  Case Study 2: Database Transaction Around Long Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Begin Transaction
→ Write data
→ Call external service
→ Write result
→ Commit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Apply the Questions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Completion defined by:&lt;/strong&gt; external service&lt;br&gt;
&lt;strong&gt;Slowest step:&lt;/strong&gt; external service&lt;br&gt;
&lt;strong&gt;Resources held:&lt;/strong&gt; DB transaction, locks, connection&lt;br&gt;
&lt;strong&gt;Scaling factor:&lt;/strong&gt; external latency&lt;/p&gt;




&lt;h3&gt;
  
  
  Failure Identified
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Slow work is holding scarce resources&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This leads to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lock contention&lt;/li&gt;
&lt;li&gt;Connection pool exhaustion&lt;/li&gt;
&lt;li&gt;Cascading failures&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Case Study 3: Real-Time Streaming System
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client ⇄ WebSocket ⇄ Server ⇄ Generator
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tokens stream incrementally.&lt;/p&gt;




&lt;h3&gt;
  
  
  Apply the Questions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Completion defined by:&lt;/strong&gt; final token&lt;br&gt;
&lt;strong&gt;Slowest step:&lt;/strong&gt; generation duration&lt;br&gt;
&lt;strong&gt;Resources held:&lt;/strong&gt; socket, memory buffers&lt;br&gt;
&lt;strong&gt;Latency impact:&lt;/strong&gt; long streams = long-held connections&lt;/p&gt;




&lt;h3&gt;
  
  
  New Failure Mode
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Intermediate states now exist&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Questions arise:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What if connection drops mid-stream?&lt;/li&gt;
&lt;li&gt;What defines “done”?&lt;/li&gt;
&lt;li&gt;Can streams overlap?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Streaming solves latency perception but introduces &lt;strong&gt;state complexity&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why These Problems Are Found &lt;em&gt;Before&lt;/em&gt; Architecture
&lt;/h2&gt;

&lt;p&gt;Notice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No queues&lt;/li&gt;
&lt;li&gt;No databases&lt;/li&gt;
&lt;li&gt;No technologies mentioned&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Problems emerged &lt;strong&gt;purely by reasoning about time, state, and resources&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is why first-principles thinking works across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chat systems&lt;/li&gt;
&lt;li&gt;Payments&lt;/li&gt;
&lt;li&gt;Notifications&lt;/li&gt;
&lt;li&gt;File uploads&lt;/li&gt;
&lt;li&gt;ML inference pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The One-Page Interview Checklist
&lt;/h2&gt;

&lt;p&gt;Use this on any system:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What must complete before responding?&lt;/li&gt;
&lt;li&gt;What is the slowest step?&lt;/li&gt;
&lt;li&gt;What resources are held while waiting?&lt;/li&gt;
&lt;li&gt;What grows when latency increases?&lt;/li&gt;
&lt;li&gt;Can failures occur independently?&lt;/li&gt;
&lt;li&gt;What happens under partial completion?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If latency appears in #4 → &lt;strong&gt;design flaw detected&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Takeaway
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Blocking is not a performance issue.&lt;br&gt;
It is a semantic mistake about responsibility boundaries.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system incorrectly believes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I must finish long work before I can reply.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Great system designers question that belief first — &lt;em&gt;before&lt;/em&gt; proposing solutions.&lt;/p&gt;




&lt;p&gt;Use this post as a &lt;strong&gt;thinking reference&lt;/strong&gt;, not a pattern catalog.&lt;br&gt;
If you can explain these questions clearly in an interview, you are already in the top tier.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>interview</category>
      <category>systemdesign</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>🧱 The Blueprint of Success: Mastering the Technical Requirements Document (TRD)</title>
      <dc:creator>Mohammad-Idrees</dc:creator>
      <pubDate>Wed, 26 Nov 2025 17:14:13 +0000</pubDate>
      <link>https://forem.com/mohammadidrees/the-blueprint-of-success-mastering-the-technical-requirements-document-trd-306g</link>
      <guid>https://forem.com/mohammadidrees/the-blueprint-of-success-mastering-the-technical-requirements-document-trd-306g</guid>
      <description>&lt;h2&gt;
  
  
  🧱 The Blueprint of Success: Mastering the Technical Requirements Document (TRD)
&lt;/h2&gt;

&lt;p&gt;Hello Future Engineers! I’m here to talk about one of the most critical documents you'll encounter in your career: the &lt;strong&gt;Technical Requirements Document (TRD)&lt;/strong&gt;. You might also hear it called a Technical Specification Document (TSD) or System Design Document (SDD).&lt;/p&gt;

&lt;p&gt;As an Architect/Principle Engineer, I can tell you that a well-written TRD is the difference between a smooth, successful project and months of frustrating, expensive rework. Think of the TRD as the &lt;strong&gt;detailed engineering blueprint&lt;/strong&gt; that translates a designer's sketch (the PRD) into a constructible building. Your mission is to master this blueprint.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧐 Why the TRD is Your Lifeline
&lt;/h2&gt;

&lt;p&gt;A common mistake for junior engineers is jumping straight into coding. Don't do it! The TRD forces you to think deeply about the &lt;strong&gt;system&lt;/strong&gt; and its &lt;strong&gt;constraints&lt;/strong&gt; &lt;em&gt;before&lt;/em&gt; you write the first line of code.&lt;/p&gt;

&lt;p&gt;Here's why it's essential:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Clarity and Alignment:&lt;/strong&gt; It serves as the single source of truth, ensuring every developer, QA engineer, and product manager is aligned on &lt;strong&gt;exactly what&lt;/strong&gt; and &lt;strong&gt;how&lt;/strong&gt; the feature will be built.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Prevents Scope Creep:&lt;/strong&gt; By clearly defining what’s &lt;em&gt;in-scope&lt;/em&gt; and, critically, what’s &lt;strong&gt;out-of-scope (Non-Goals)&lt;/strong&gt;, you prevent last-minute feature additions that derail schedules.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Facilitates Code Review &amp;amp; Testing:&lt;/strong&gt; The TRD provides the &lt;strong&gt;acceptance criteria&lt;/strong&gt; and the technical context needed for QA to design test cases and for senior engineers to conduct thorough, meaningful code reviews.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Enables Tradeoff Justification:&lt;/strong&gt; It’s where you document your architectural choices and defend them against the system's needs (e.g., "We chose X database over Y because of the Z latency requirement").&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  📑 The Gold Standard TRD Structure (Section by Section)
&lt;/h2&gt;

&lt;p&gt;A robust TRD follows a logical flow, moving from high-level context to specific implementation details.&lt;/p&gt;

&lt;h3&gt;
  
  
  I. Document Context and Administration 📝
&lt;/h3&gt;

&lt;p&gt;This is the metadata that keeps the project organized.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Title &amp;amp; ID:&lt;/strong&gt; A descriptive name and a unique ID (e.g., &lt;code&gt;TRD-USER-AUTH-001&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Revision History:&lt;/strong&gt; &lt;strong&gt;Crucial.&lt;/strong&gt; Log every change, who made it, and why. This tracks the evolution of the design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Summary &amp;amp; Business Context:&lt;/strong&gt; Briefly state the problem you are solving and link to the source &lt;strong&gt;Product Requirements Document (PRD)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stakeholders &amp;amp; Approvers:&lt;/strong&gt; List the owners (Product, Engineering Lead, QA Lead) who must sign off on the design.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Goals (In-Scope):&lt;/strong&gt; State the measurable outcomes (e.g., &lt;em&gt;Implement secure user registration and login functionality.&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-Goals (Out-of-Scope):&lt;/strong&gt; Explicitly list what you are &lt;strong&gt;not&lt;/strong&gt; building (e.g., &lt;em&gt;This TRD does not include social sign-in or password recovery functionality.&lt;/em&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  II. Functional Requirements (The "What" to Build) 🔨
&lt;/h3&gt;

&lt;p&gt;This section is derived directly from the PRD, but translated into technical language.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Describe the behavior of the system. Break it down by user story or use case.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;System must validate the user's password against the defined complexity rules (8 characters, 1 uppercase, 1 number).&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;The API endpoint &lt;code&gt;/api/v1/user/register&lt;/code&gt; must accept a POST request with &lt;code&gt;email&lt;/code&gt; and &lt;code&gt;password&lt;/code&gt; fields and return a HTTP 201 response on success.&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  III. Non-Functional Requirements (NFRs) (The "How Well" It Must Be Built) 🌟
&lt;/h3&gt;

&lt;p&gt;These are the &lt;strong&gt;quality attributes&lt;/strong&gt; that truly define the engineering challenge. As a junior engineer, you must learn to think in NFRs, as they drive every architectural decision.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;th&gt;Example of a Measurable Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Speed and efficiency under a given workload.&lt;/td&gt;
&lt;td&gt;API endpoint X must respond in &lt;strong&gt;$&amp;lt; 150$ milliseconds&lt;/strong&gt; for $99\%$ of requests.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ability to handle future increases in workload.&lt;/td&gt;
&lt;td&gt;System must support $5\text{x}$ traffic increase over the next year.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Protecting the system and data from unauthorized access.&lt;/td&gt;
&lt;td&gt;All sensitive data must be encrypted &lt;strong&gt;at rest&lt;/strong&gt; using AES-256.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The percentage of time the system is operational.&lt;/td&gt;
&lt;td&gt;Target &lt;strong&gt;$99.99\%$ uptime&lt;/strong&gt; (less than 52 minutes of downtime per year).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Maintainability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ease of fixing bugs, evolving the code, and monitoring.&lt;/td&gt;
&lt;td&gt;Detailed logs must be retained for 90 days.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  IV. System Architecture and Design (The Blueprint) 📐
&lt;/h3&gt;

&lt;p&gt;This is the core engineering section. It details &lt;em&gt;how&lt;/em&gt; you plan to meet the NFRs and Functional Requirements.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;High-Level Architecture:&lt;/strong&gt; Use diagrams to show where the new feature sits in the overall ecosystem. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;[Image of Microservices Architecture Diagram]&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Component Design:&lt;/strong&gt; Detail the new services, modules, or libraries being created.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Data Model/Schema Changes:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Show the new database tables, fields, indexes, and relationships.&lt;/li&gt;
&lt;li&gt;If using NoSQL, show the document structure.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;API Specifications:&lt;/strong&gt; Document the full contract for new or modified APIs (URL, Method, Request Body, Response Body, Error Codes).&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Technology Usage &amp;amp; Tradeoff Justification:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;This is where you earn your stripes.&lt;/strong&gt; Document the final technology choice (e.g., "We chose &lt;strong&gt;Redis&lt;/strong&gt; for session management") and &lt;strong&gt;justify it by linking it back to the NFRs&lt;/strong&gt; (e.g., "because the &lt;strong&gt;$150$ms performance NFR&lt;/strong&gt; requires an in-memory cache solution for fast read times").&lt;/li&gt;
&lt;li&gt;Also, explicitly mention the &lt;strong&gt;tradeoff&lt;/strong&gt; (e.g., "The tradeoff is higher cost compared to using the main database, but this is accepted due to the critical performance need.").&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Assumptions, Constraints, &amp;amp; Dependencies:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Assumptions:&lt;/strong&gt; What are you taking for granted (e.g., "We assume the networking layer is already configured with load balancing").&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Constraints:&lt;/strong&gt; Strict limitations (e.g., "Must be deployed on Kubernetes only," or "Budget limited to $X$ per month for hosting").&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  V. Testing, Deployment, and Operations ⚙️
&lt;/h3&gt;

&lt;p&gt;Building it is only half the battle; operating it is the other half.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Acceptance Criteria (AC):&lt;/strong&gt; These are the final, testable conditions for each major requirement. &lt;strong&gt;Example:&lt;/strong&gt; &lt;em&gt;AC for successful registration: A new user record exists in the database with a hashed password, and an email confirmation event is queued.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Testing Strategy:&lt;/strong&gt; High-level plan (e.g., unit tests must cover $80\%$ of the new service; performance tests must validate the $150$ms latency NFR).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring &amp;amp; Alerting:&lt;/strong&gt; How will you know if the feature is broken in production? What are the key metrics and who gets paged?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment &amp;amp; Rollback Plan:&lt;/strong&gt; A step-by-step release process (e.g., using blue/green deployment, feature flags) and the specific, tested steps for &lt;strong&gt;quickly reverting&lt;/strong&gt; the change if an issue occurs.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚀 Suggested Mind Map: Drafting a Simple, Effective TRD
&lt;/h2&gt;

&lt;p&gt;When facing a new feature, use this four-step mind map to quickly structure your thinking and draft a solid TRD. Start with the "Why" and work your way to the "How."&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Start with the PRD (The "WHAT"):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Translate the PRD into &lt;strong&gt;Functional Requirements&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Define &lt;strong&gt;Acceptance Criteria&lt;/strong&gt; for each requirement.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Define the Quality (The "HOW WELL"):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Determine the &lt;strong&gt;Non-Functional Requirements (NFRs)&lt;/strong&gt;: Performance, Security, Scale, Reliability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Design the Solution (The "HOW"):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Select the &lt;strong&gt;Technology Stack&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Justify Tradeoffs&lt;/strong&gt; (e.g., Postgres vs. Mongo $\rightarrow$ which NFR does it satisfy?).&lt;/li&gt;
&lt;li&gt;Diagram the &lt;strong&gt;Architecture&lt;/strong&gt; and &lt;strong&gt;Data Model&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Operationalize (The "HOW TO RUN IT"):&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Define &lt;strong&gt;Testing Strategy&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Determine &lt;strong&gt;Monitoring/Alerts&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Document the &lt;strong&gt;Deployment/Rollback&lt;/strong&gt; steps.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>architecture</category>
      <category>career</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
