<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aisalkyn Aidarova</title>
    <description>The latest articles on Forem by Aisalkyn Aidarova (@jumptotech).</description>
    <link>https://forem.com/jumptotech</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3549986%2F206d9086-7051-445f-8608-9cfccb349e11.png</url>
      <title>Forem: Aisalkyn Aidarova</title>
      <link>https://forem.com/jumptotech</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jumptotech"/>
    <language>en</language>
    <item>
      <title>THE MOST IMPORTANT CONCEPT: MEASURING RELIABILITY: SLO, SLA, SLI</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 07 May 2026 13:55:56 +0000</pubDate>
      <link>https://forem.com/jumptotech/the-most-important-concept-measuring-reliability-slo-sla-sli-n0b</link>
      <guid>https://forem.com/jumptotech/the-most-important-concept-measuring-reliability-slo-sla-sli-n0b</guid>
      <description>&lt;p&gt;Site Reliability Engineering is not just monitoring or fixing servers.&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Applying software engineering principles to operations to make systems reliable at scale.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You don’t manually fix things → you &lt;strong&gt;automate&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You don’t guess → you &lt;strong&gt;measure&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You don’t react → you &lt;strong&gt;design for failure&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Core mindset
&lt;/h2&gt;

&lt;p&gt;A normal engineer asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Is the system working?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An SRE asks:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How well is it working, how often does it fail, and how much failure is acceptable?”&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Before SRE existed, companies said:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System should be reliable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That means nothing.&lt;/p&gt;

&lt;p&gt;SRE changed that to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reliability must be measurable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where &lt;strong&gt;SLI, SLO, SLA&lt;/strong&gt; come in.&lt;/p&gt;




&lt;h1&gt;
  
  
  🧠 PART 3 — SLI (SERVICE LEVEL INDICATOR)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What it really is
&lt;/h2&gt;

&lt;p&gt;An SLI is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A real measurement of user experience&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not system metrics like CPU — but &lt;strong&gt;user-facing metrics&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU = 70%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We measure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Request success rate
Request latency
Error rate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real example
&lt;/h2&gt;

&lt;p&gt;Imagine your API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1000 requests&lt;/li&gt;
&lt;li&gt;990 succeed&lt;/li&gt;
&lt;li&gt;10 fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your SLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Success rate = 99%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Important rule
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLI must reflect USER experience
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If user is unhappy → your SLI is wrong&lt;/p&gt;




&lt;h1&gt;
  
  
  🧠 PART 4 — SLO (SERVICE LEVEL OBJECTIVE)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What it really is
&lt;/h2&gt;

&lt;p&gt;SLO is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A target you set for your system performance&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;You define:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99.9% of requests must succeed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is your SLO.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why SLO exists
&lt;/h2&gt;

&lt;p&gt;Because perfection is impossible.&lt;/p&gt;

&lt;p&gt;So instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System must never fail ❌
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;System can fail within limits ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Another example
&lt;/h2&gt;

&lt;p&gt;Latency SLO:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;95% of requests &amp;lt; 200ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Key idea
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SLO defines acceptable reliability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 5 — SLA (SERVICE LEVEL AGREEMENT)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What it really is
&lt;/h2&gt;

&lt;p&gt;SLA is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Business contract based on SLO&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If uptime &amp;lt; 99.9% → customer gets refund
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Important difference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Concept&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SLI&lt;/td&gt;
&lt;td&gt;measurement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLO&lt;/td&gt;
&lt;td&gt;internal goal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLA&lt;/td&gt;
&lt;td&gt;external contract&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  🧠 PART 6 — ERROR BUDGET (THIS IS SENIOR LEVEL)
&lt;/h1&gt;

&lt;p&gt;This is the most important concept in SRE.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it is
&lt;/h2&gt;

&lt;p&gt;If your SLO is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99.9% uptime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.1% failure is allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is your &lt;strong&gt;error budget&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  In real time
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;~43 minutes downtime per month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why it matters
&lt;/h2&gt;

&lt;p&gt;It creates balance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developers → want speed
SRE → want stability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Error budget decides:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If budget remains → deploy
If exhausted → stop releases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real rule
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No error budget = no deployments
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 7 — HOW WE MEASURE AVAILABILITY
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Formula
&lt;/h2&gt;

&lt;p&gt;Availability =&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(Total time - downtime) / total time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;30 days = 720 hours&lt;br&gt;
Downtime = 2 hours&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(720 - 2) / 720 = 99.72%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE levels
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;basic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;production&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.99%&lt;/td&gt;
&lt;td&gt;critical&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.999%&lt;/td&gt;
&lt;td&gt;extreme&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  🧠 PART 8 — LATENCY (WHY AVERAGE IS WRONG)
&lt;/h1&gt;

&lt;p&gt;Average lies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;99 requests = 100ms
1 request = 10 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Average looks fine — but system is broken.&lt;/p&gt;




&lt;h2&gt;
  
  
  Solution
&lt;/h2&gt;

&lt;p&gt;Use percentiles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;P50 → normal&lt;/li&gt;
&lt;li&gt;P95 → slow users&lt;/li&gt;
&lt;li&gt;P99 → worst users&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real SLO
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;95% of requests &amp;lt; 200ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 9 — MONITORING (WHAT SRE ACTUALLY WATCHES)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Golden Signals (Google SRE)
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Traffic&lt;/li&gt;
&lt;li&gt;Errors&lt;/li&gt;
&lt;li&gt;Saturation&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  What this means
&lt;/h2&gt;

&lt;p&gt;You monitor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;How fast?
How many?
How broken?
How loaded?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tools
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus&lt;/li&gt;
&lt;li&gt;Grafana&lt;/li&gt;
&lt;li&gt;CloudWatch&lt;/li&gt;
&lt;li&gt;ELK&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧠 PART 10 — ALERTING (VERY IMPORTANT)
&lt;/h1&gt;

&lt;p&gt;Bad alert:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CPU &amp;gt; 80%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good alert:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error rate &amp;gt; 5% for 5 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Rule
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert only when users are impacted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 11 — INCIDENT MANAGEMENT
&lt;/h1&gt;

&lt;p&gt;Incident = system failure affecting users&lt;/p&gt;




&lt;h2&gt;
  
  
  SRE process
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Detect&lt;/li&gt;
&lt;li&gt;Respond&lt;/li&gt;
&lt;li&gt;Fix&lt;/li&gt;
&lt;li&gt;Learn&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Postmortem
&lt;/h2&gt;

&lt;p&gt;Must be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Blameless
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  You document
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;timeline&lt;/li&gt;
&lt;li&gt;root cause&lt;/li&gt;
&lt;li&gt;impact&lt;/li&gt;
&lt;li&gt;fix&lt;/li&gt;
&lt;li&gt;prevention&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧠 PART 12 — RELIABILITY ENGINEERING
&lt;/h1&gt;

&lt;p&gt;You design systems that:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Expect failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;Instead of 1 server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB → multiple EC2 → DB replicas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Goal
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No single point of failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 13 — SCALING
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Vertical
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bigger machine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Horizontal
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;more machines
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE prefers
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Horizontal scaling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🧠 PART 14 — NETWORKING (WHAT YOU DID)
&lt;/h1&gt;

&lt;p&gt;You must understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC&lt;/li&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;NAT vs IGW&lt;/li&gt;
&lt;li&gt;TGW&lt;/li&gt;
&lt;li&gt;PrivateLink&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧠 PART 15 — AUTOMATION
&lt;/h1&gt;

&lt;p&gt;Rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If you repeat it → automate it
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Tools
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Terraform&lt;/li&gt;
&lt;li&gt;Bash&lt;/li&gt;
&lt;li&gt;Python&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧠 PART 16 — CI/CD
&lt;/h1&gt;

&lt;p&gt;You must know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pipelines&lt;/li&gt;
&lt;li&gt;deployments&lt;/li&gt;
&lt;li&gt;rollback&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Strategies
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;rolling&lt;/li&gt;
&lt;li&gt;blue/green&lt;/li&gt;
&lt;li&gt;canary&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧠 FINAL UNDERSTANDING
&lt;/h1&gt;

&lt;p&gt;SRE is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Measure → Define → Monitor → Improve → Automate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  💬 PERFECT INTERVIEW ANSWER
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;SRE focuses on maintaining system reliability by defining measurable objectives like SLOs, monitoring system health, managing incidents, and automating infrastructure while balancing system stability with development velocity.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    <item>
      <title>What is AWS PrivateLink?</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 07 May 2026 13:49:18 +0000</pubDate>
      <link>https://forem.com/jumptotech/what-is-aws-privatelink-2mgo</link>
      <guid>https://forem.com/jumptotech/what-is-aws-privatelink-2mgo</guid>
      <description>&lt;p&gt;&lt;strong&gt;AWS PrivateLink&lt;/strong&gt; lets you access AWS services or your own services hosted in another VPC &lt;em&gt;privately&lt;/em&gt; — traffic never leaves the AWS network, never touches the internet, and the two VPCs don't need to be peered or connected via Transit Gateway.&lt;/p&gt;

&lt;p&gt;The core idea: instead of exposing a service publicly or opening up full VPC-to-VPC networking, PrivateLink creates a &lt;strong&gt;one-way, private endpoint&lt;/strong&gt; — the consumer VPC gets a private IP in its own subnet that tunnels traffic to the provider service. That's it. No route tables to manage, no CIDR conflicts to worry about.&lt;strong&gt;Diagram 1: How PrivateLink works&lt;/strong&gt; — the core mechanism is an Interface Endpoint (an ENI in your subnet) that maps to the provider's service via AWS's internal network.The consumer VPC creates an &lt;strong&gt;Interface Endpoint&lt;/strong&gt; — just a private IP (ENI) sitting in its own subnet. DNS resolves the service name to that private IP. Traffic flows through AWS's internal backbone to the provider's Network Load Balancer, then to the actual service. The two VPCs never need to peer, share route tables, or even know each other's CIDR ranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagram 2: The three ways PrivateLink is used&lt;/strong&gt; in practice.---&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Uses of PrivateLink
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Accessing AWS-managed services privately.&lt;/strong&gt; Services like S3, SQS, ECR, Secrets Manager, KMS, and 100+ others support PrivateLink. Instead of your EC2 or Lambda hitting &lt;code&gt;s3.amazonaws.com&lt;/code&gt; over the internet, an Interface Endpoint gives it a private IP inside your subnet. For S3 and DynamoDB specifically, there's a simpler free variant called a Gateway Endpoint — same idea, slightly different implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Consuming a partner SaaS service.&lt;/strong&gt; Vendors like Datadog, Splunk, Snowflake, and many others publish themselves as PrivateLink Endpoint Services. You create an Interface Endpoint in your VPC pointing to their service name, and your traffic to them never touches the internet. The vendor's VPC and your VPC never peer — they can't see your network at all, only receive the specific calls you make.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Publishing your own internal service.&lt;/strong&gt; You place a Network Load Balancer in front of your service, register it as an Endpoint Service, then whitelist which AWS accounts can connect. Other teams or customers create Interface Endpoints in their own VPCs pointing at yours. This is how internal platform teams build shared services — auth, payments, data APIs — without opening full VPC-to-VPC access.&lt;/p&gt;




&lt;h2&gt;
  
  
  PrivateLink vs the Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PrivateLink&lt;/th&gt;
&lt;th&gt;VPC Peering&lt;/th&gt;
&lt;th&gt;Transit Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traffic path&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CIDR conflicts&lt;/td&gt;
&lt;td&gt;No problem&lt;/td&gt;
&lt;td&gt;Breaks everything&lt;/td&gt;
&lt;td&gt;No problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access scope&lt;/td&gt;
&lt;td&gt;Single service only&lt;/td&gt;
&lt;td&gt;Full VPC-to-VPC&lt;/td&gt;
&lt;td&gt;All attached VPCs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direction&lt;/td&gt;
&lt;td&gt;One-way (consumer → provider)&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-region&lt;/td&gt;
&lt;td&gt;Yes (via interface EP)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (TGW peering)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Exposing a specific service privately&lt;/td&gt;
&lt;td&gt;Small number of VPCs needing full access&lt;/td&gt;
&lt;td&gt;Large-scale hub-and-spoke networking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: PrivateLink is surgical. VPC Peering and Transit Gateway open up &lt;em&gt;networking&lt;/em&gt; — entire VPCs can talk to each other. PrivateLink exposes only &lt;em&gt;one service&lt;/em&gt; through a single endpoint. If you just want your app to call an internal payments API without routing everything through a shared network, PrivateLink is exactly the right tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AWS Transit Gateway?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Transit Gateway (TGW)&lt;/strong&gt; is a central network hub that connects multiple VPCs, on-premises networks, and AWS accounts together — like a cloud router that everything plugs into.&lt;/p&gt;

&lt;p&gt;Think of it this way: without Transit Gateway, if you have 5 VPCs that all need to talk to each other, you'd need a mesh of VPC peering connections. With Transit Gateway, every VPC just connects to &lt;em&gt;one&lt;/em&gt; hub.&lt;strong&gt;Diagram 1: The problem Transit Gateway solves&lt;/strong&gt; — without it, VPCs connecting to each other require a full mesh of peering connections that grows unmanageable fast.The key pain point with VPC peering: it is &lt;em&gt;non-transitive&lt;/em&gt;. If VPC A peers with VPC B, and VPC B peers with VPC C, VPC A still cannot talk to VPC C. You'd need a direct peering for every pair. With 10 VPCs that's 45 connections to manage. Transit Gateway fixes this entirely — everything connects through one hub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagram 2: What Transit Gateway connects&lt;/strong&gt; — it's not just VPC-to-VPC. It acts as the central router for your entire network.---&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Need Transit Gateway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Scale without the mesh chaos.&lt;/strong&gt; VPC peering connections grow as N×(N-1)/2 — 4 VPCs need 6 connections, 10 VPCs need 45. TGW keeps it at N connections regardless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Transitivity.&lt;/strong&gt; VPC peering is not transitive — traffic can't hop through an intermediate VPC. TGW routes traffic across all attached networks as a proper router would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Centralized on-premises connectivity.&lt;/strong&gt; Without TGW, each VPC needs its own VPN tunnel to your data center. With TGW, one VPN attachment serves all VPCs attached to the gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Traffic isolation via route tables.&lt;/strong&gt; The TGW has its own route tables. You can define that your Dev VPC can only reach other Dev VPCs and a shared-services VPC, while a security VPC sees everything for traffic inspection. This is impossible with VPC peering alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Cross-account and cross-region.&lt;/strong&gt; TGW works with AWS Resource Access Manager to share the gateway across multiple accounts. TGW peering connects gateways across regions.&lt;/p&gt;




&lt;h2&gt;
  
  
  TGW vs VPC Peering — When to Use Which
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2–3 VPCs that need to talk&lt;/td&gt;
&lt;td&gt;VPC Peering (simpler, cheaper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4+ VPCs, especially growing&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account, cross-region networking&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises + multiple VPCs&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need centralized firewall/inspection&lt;/td&gt;
&lt;td&gt;Transit Gateway (route all traffic through security VPC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Just two VPCs, same account&lt;/td&gt;
&lt;td&gt;VPC Peering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core mental model: &lt;strong&gt;VPC Peering is a direct cable between two VPCs. Transit Gateway is a router that every VPC plugs into.&lt;/strong&gt; Once you have more than a handful of VPCs, the router approach wins every time.&lt;/p&gt;

&lt;p&gt;Great question — this is where a lot of people get confused because all three keep traffic on the AWS backbone. The real difference is &lt;em&gt;what problem each one solves&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's a decision framework:&lt;strong&gt;The one question that drives the choice:&lt;/strong&gt; what scope of access do you need?---&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Mental Model
&lt;/h2&gt;

&lt;p&gt;Think of it in terms of &lt;strong&gt;scope of access&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PrivateLink&lt;/strong&gt; = expose &lt;em&gt;one service&lt;/em&gt; to another VPC. The consumer gets a single private IP endpoint — nothing else. They cannot reach any other resource in your VPC. This is surgical, zero-trust access. Use it when two VPCs don't need to talk to each other broadly — they just need one service to be reachable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC Peering&lt;/strong&gt; = full network access between exactly &lt;em&gt;two VPCs&lt;/em&gt;. Both VPCs can reach any resource in the other (subject to security groups and NACLs). Simple to set up, no cost for the connection itself, but does not scale — every new VPC pair needs its own peering, and it's non-transitive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transit Gateway&lt;/strong&gt; = full network access between &lt;em&gt;many VPCs, accounts, and on-premises networks&lt;/em&gt;, all routed through one hub. More complex and costs money per attachment + data processed, but scales linearly and supports centralized routing policies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Scenarios
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1 — Startup with 2 VPCs (prod + dev)&lt;/strong&gt;&lt;br&gt;
Dev team occasionally needs to pull from a shared database in prod. → Use &lt;strong&gt;VPC Peering&lt;/strong&gt;. Two VPCs, simple, done in minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2 — Company with 8 VPCs across 3 AWS accounts&lt;/strong&gt;&lt;br&gt;
Networking team needs all VPCs to reach a shared DNS resolver and a centralized logging service, plus a VPN back to the data center. → Use &lt;strong&gt;Transit Gateway&lt;/strong&gt;. One VPN connection shared by all, one hub to manage routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3 — Platform team building an internal payments API&lt;/strong&gt;&lt;br&gt;
Other teams' VPCs need to call the payments API — but the platform team doesn't want those VPCs to have any other access into their network. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt;. Expose only the API endpoint. Consumer VPCs get a single private IP, nothing more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 4 — VPC Peering is impossible (overlapping CIDRs)&lt;/strong&gt;&lt;br&gt;
Two VPCs both use &lt;code&gt;10.0.0.0/16&lt;/code&gt; — peering is blocked. But one VPC needs to call a service in the other. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt;. CIDR conflicts don't matter since there's no route table overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 5 — Lambda in a private VPC needs to write to S3&lt;/strong&gt;&lt;br&gt;
Lambda is in a VPC with no internet access. You need to reach S3 without adding a NAT Gateway. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt; (Gateway Endpoint for S3 — free). Traffic stays entirely within AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PrivateLink&lt;/th&gt;
&lt;th&gt;VPC Peering&lt;/th&gt;
&lt;th&gt;Transit Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access scope&lt;/td&gt;
&lt;td&gt;Single service&lt;/td&gt;
&lt;td&gt;Whole VPC&lt;/td&gt;
&lt;td&gt;Whole network&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CIDR conflicts&lt;/td&gt;
&lt;td&gt;No issue&lt;/td&gt;
&lt;td&gt;Breaks it&lt;/td&gt;
&lt;td&gt;No issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale&lt;/td&gt;
&lt;td&gt;Unlimited consumers&lt;/td&gt;
&lt;td&gt;Up to ~125 peers&lt;/td&gt;
&lt;td&gt;Thousands of VPCs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (VPN/DX)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Per endpoint + data&lt;/td&gt;
&lt;td&gt;Data transfer only&lt;/td&gt;
&lt;td&gt;Per attachment + data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direction&lt;/td&gt;
&lt;td&gt;One-way&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern most large companies end up with: &lt;strong&gt;Transit Gateway as the backbone&lt;/strong&gt; for VPC-to-VPC and on-premises connectivity, with &lt;strong&gt;PrivateLink layered on top&lt;/strong&gt; for exposing specific internal services securely to teams or customers who shouldn't have broad network access.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Cloud Service Models — Full SRE Lecture: IaaS, PaaS, SaaS</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 07 May 2026 13:47:09 +0000</pubDate>
      <link>https://forem.com/jumptotech/cloud-service-models-full-sre-lecture-iaas-paas-saas-44do</link>
      <guid>https://forem.com/jumptotech/cloud-service-models-full-sre-lecture-iaas-paas-saas-44do</guid>
      <description>

&lt;h2&gt;
  
  
  🌐 The Big Picture First
&lt;/h2&gt;

&lt;p&gt;Think of cloud service models as a &lt;strong&gt;spectrum of responsibility&lt;/strong&gt;. The more you move right, the less you manage — but also the less control you have.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;YOUR RESPONSIBILITY
◄────────────────────────────────────────────►
Maximum                                Minimum

On-Premises → IaaS → PaaS → SaaS → Serverless
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A useful analogy is &lt;strong&gt;pizza&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On-Premises  = Make pizza at home (you own everything)
IaaS         = Order dough &amp;amp; ingredients (you cook it)
PaaS         = Order pizza kit (just assemble &amp;amp; bake)
SaaS         = Order delivery (just eat it)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🏗️ Layer 1 — IaaS (Infrastructure as a Service)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it is
&lt;/h3&gt;

&lt;p&gt;You rent raw infrastructure — servers, storage, networking — from a cloud provider. The provider manages the physical hardware. You manage &lt;strong&gt;everything above the hypervisor.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Responsibility Split
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Provider manages:        You manage:
─────────────────        ───────────────────────
Physical servers    →    Operating System
Data centers        →    Runtime &amp;amp; middleware
Networking HW       →    Applications
Hypervisor          →    Data
Storage HW          →    Security patches
                    →    Scaling
                    →    Backups
                    →    Monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real Examples
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;IaaS Products&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;EC2, EBS, VPC, S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCP&lt;/td&gt;
&lt;td&gt;Compute Engine, Cloud Storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure&lt;/td&gt;
&lt;td&gt;Virtual Machines, Azure Blob&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  IaaS Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Lift &amp;amp; shift migrations from on-prem&lt;/li&gt;
&lt;li&gt;Custom OS configurations needed&lt;/li&gt;
&lt;li&gt;High performance computing (HPC)&lt;/li&gt;
&lt;li&gt;Full control over networking required&lt;/li&gt;
&lt;li&gt;Legacy applications that can't be containerized&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  IaaS Code Example — Terraform EC2
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# You provision and OWN this — classic IaaS&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"web"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ami-0c55b159cbfafe1f0"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;

  &lt;span class="c1"&gt;# YOU are responsible for everything inside this machine&lt;/span&gt;
  &lt;span class="nx"&gt;user_data&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;-&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;
    #!/bin/bash
    apt-get update
    apt-get install -y nginx
    systemctl start nginx
&lt;/span&gt;&lt;span class="no"&gt;  EOF

&lt;/span&gt;  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"company-web-server"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🚀 Layer 2 — PaaS (Platform as a Service)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it is
&lt;/h3&gt;

&lt;p&gt;The provider manages OS, runtime, middleware, and scaling. You focus purely on &lt;strong&gt;writing and deploying your application code.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Responsibility Split
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Provider manages:        You manage:
─────────────────        ───────────────────────
Physical servers    →    Application code
Data centers        →    Data
Hypervisor          →    User access
Operating System    →    Configurations
Runtime             →    (sometimes) scaling rules
Middleware          →
Patching            →
Scaling infra       →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real Examples
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;PaaS Products&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AWS&lt;/td&gt;
&lt;td&gt;Elastic Beanstalk, RDS, Lambda&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GCP&lt;/td&gt;
&lt;td&gt;App Engine, Cloud Run, Cloud SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Azure&lt;/td&gt;
&lt;td&gt;Azure App Service, Azure SQL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Others&lt;/td&gt;
&lt;td&gt;Heroku, Render, Railway&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  PaaS Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Startups moving fast without dedicated DevOps&lt;/li&gt;
&lt;li&gt;Managed databases (RDS handles patching, backups)&lt;/li&gt;
&lt;li&gt;Web apps where you don't care about OS&lt;/li&gt;
&lt;li&gt;Rapid prototyping&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  PaaS Code Example — AWS Elastic Beanstalk
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# You just push code — platform handles the rest&lt;/span&gt;
&lt;span class="c1"&gt;# .ebextensions/app.config&lt;/span&gt;

&lt;span class="na"&gt;option_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;aws:autoscaling:asg&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;MinSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;MaxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;aws:elasticbeanstalk:environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;EnvironmentType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalanced&lt;/span&gt;
  &lt;span class="na"&gt;aws:ec2:instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;InstanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;t3.medium&lt;/span&gt;

&lt;span class="c1"&gt;# No OS management, no nginx config, no patching&lt;/span&gt;
&lt;span class="c1"&gt;# Platform handles it ALL&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  💻 Layer 3 — SaaS (Software as a Service)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it is
&lt;/h3&gt;

&lt;p&gt;A fully managed application delivered over the internet. You don't manage infrastructure, OS, runtime, or the app itself. You just &lt;strong&gt;use it.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Responsibility Split
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Provider manages:        You manage:
─────────────────        ───────────────────────
Everything               Your data
Infrastructure      →    User access/permissions
OS &amp;amp; runtime        →    Configurations within app
Application         →    Integrations
Updates &amp;amp; patches   →
Scaling             →
Security            →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Real Examples
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;SaaS Products&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Datadog, New Relic, PagerDuty&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communication&lt;/td&gt;
&lt;td&gt;Slack, Gmail, Zoom&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CI/CD&lt;/td&gt;
&lt;td&gt;GitHub Actions, CircleCI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;Okta, CrowdStrike&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;Dropbox, Google Drive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  SaaS from SRE perspective
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You don't manage the app BUT you must manage:
✅ API integrations with your systems
✅ SSO/SAML configuration
✅ Data retention policies
✅ Vendor SLA monitoring
✅ Cost &amp;amp; license management
✅ Data backup (vendor may not guarantee YOUR data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔄 Shared Responsibility Model — Deep Dive
&lt;/h2&gt;

&lt;p&gt;This is &lt;strong&gt;critical for SRE engineers&lt;/strong&gt; to understand deeply.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ON-PREM   IaaS    PaaS    SaaS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Applications          YOU      YOU     YOU    VENDOR
Data                  YOU      YOU     YOU     YOU ⚠️
Runtime               YOU      YOU    VENDOR  VENDOR
Middleware            YOU      YOU    VENDOR  VENDOR
OS                    YOU      YOU    VENDOR  VENDOR
Virtualization        YOU     VENDOR  VENDOR  VENDOR
Servers               YOU     VENDOR  VENDOR  VENDOR
Storage               YOU     VENDOR  VENDOR  VENDOR
Networking            YOU     VENDOR  VENDOR  VENDOR
Data Center           YOU     VENDOR  VENDOR  VENDOR
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;⚠️ &lt;strong&gt;Data is always YOUR responsibility&lt;/strong&gt; regardless of model. Even in SaaS, if vendor loses your data — that's your problem operationally.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧠 What an SRE Must Know About Each Model
&lt;/h2&gt;




&lt;h3&gt;
  
  
  IaaS — SRE Responsibilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. OS Hardening &amp;amp; Patching&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# You own this with IaaS&lt;/span&gt;
&lt;span class="c"&gt;# Automated patching with SSM&lt;/span&gt;
aws ssm send-command &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--document-name&lt;/span&gt; &lt;span class="s2"&gt;"AWS-RunPatchBaseline"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--targets&lt;/span&gt; &lt;span class="s2"&gt;"Key=tag:Environment,Values=production"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--parameters&lt;/span&gt; &lt;span class="s1"&gt;'{"Operation":["Install"]}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Auto Scaling &amp;amp; Self Healing&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_autoscaling_group"&lt;/span&gt; &lt;span class="s2"&gt;"web"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;min_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="nx"&gt;max_size&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
  &lt;span class="nx"&gt;desired_capacity&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

  &lt;span class="nx"&gt;health_check_type&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ELB"&lt;/span&gt;
  &lt;span class="nx"&gt;health_check_grace_period&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;

  &lt;span class="c1"&gt;# Self-healing — replace unhealthy instances automatically&lt;/span&gt;
  &lt;span class="nx"&gt;instance_refresh&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;strategy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Rolling"&lt;/span&gt;
    &lt;span class="nx"&gt;preferences&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;min_healthy_percentage&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Monitoring You Must Set Up Yourself&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus scrape config for IaaS EC2&lt;/span&gt;
&lt;span class="na"&gt;scrape_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;job_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ec2-instances'&lt;/span&gt;
    &lt;span class="na"&gt;ec2_sd_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;9100&lt;/span&gt;  &lt;span class="c1"&gt;# node_exporter port&lt;/span&gt;
    &lt;span class="na"&gt;relabel_configs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;source_labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;__meta_ec2_tag_Environment&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;target_label&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;environment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Backup Strategy&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# EBS snapshots — YOUR responsibility in IaaS&lt;/span&gt;
aws ec2 create-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--volume-id&lt;/span&gt; vol-xxxxxxxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--description&lt;/span&gt; &lt;span class="s2"&gt;"Daily backup &lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y-%m-%d&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  PaaS — SRE Responsibilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Monitor What the Platform Exposes&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RDS is PaaS — you monitor metrics, not the OS
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Key RDS metrics to alert on
&lt;/span&gt;&lt;span class="n"&gt;rds_metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CPUUtilization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;# &amp;gt; 80% = alert
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;FreeStorageSpace&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;# &amp;lt; 20% = alert
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DatabaseConnections&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# near max = alert
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ReadLatency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# &amp;gt; 20ms = investigate
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;WriteLatency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;# &amp;gt; 20ms = investigate
&lt;/span&gt;    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ReplicaLag&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;# &amp;gt; 30s = alert
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Understand Platform Limits&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS RDS Limits you MUST know as SRE:
- Max connections per instance type
- Storage autoscaling thresholds
- Failover time (~60-120 seconds)
- Backup retention (1-35 days)
- Maintenance windows impact

If you don't know these → you'll miss incidents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Runbook for PaaS Failures&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## RDS Failover Runbook&lt;/span&gt;
&lt;span class="p"&gt;
1.&lt;/span&gt; Alert fires: RDS_ReplicaLag &amp;gt; 30s
&lt;span class="p"&gt;2.&lt;/span&gt; Check: AWS Console → RDS → Events
&lt;span class="p"&gt;3.&lt;/span&gt; If primary unhealthy → failover triggers automatically
&lt;span class="p"&gt;4.&lt;/span&gt; Expected downtime: 60-120 seconds
&lt;span class="p"&gt;5.&lt;/span&gt; Verify: application reconnects (check connection pooling)
&lt;span class="p"&gt;6.&lt;/span&gt; Notify: stakeholders if &amp;gt; 2 min downtime
&lt;span class="p"&gt;7.&lt;/span&gt; Postmortem: if failover was unexpected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  SaaS — SRE Responsibilities
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Vendor SLA Tracking&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Track your SaaS vendors' uptime against their SLA
&lt;/span&gt;&lt;span class="n"&gt;vendors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datadog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sla_target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;99.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://status.datadoghq.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# no monitoring if down
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pagerduty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sla_target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;99.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://status.pagerduty.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CRITICAL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# no alerting if down
&lt;/span&gt;    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;github&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sla_target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;99.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status_page&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://githubstatus.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impact&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HIGH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# no deploys if down
&lt;/span&gt;    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. SaaS Dependency Risk&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;As SRE you must ask:
❓ What happens if Datadog goes down?
   → Do we have fallback monitoring?

❓ What happens if PagerDuty goes down?
   → Do we have SMS/phone tree backup?

❓ What happens if GitHub goes down?
   → Can we still deploy hotfixes?

❓ What happens if Okta goes down?
   → Can engineers still access production?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Data Backup for SaaS&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Even SaaS data needs backup — vendor not responsible&lt;/span&gt;
&lt;span class="c"&gt;# Example: backup GitHub repos&lt;/span&gt;

&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nv"&gt;ORGS&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;&lt;span class="s2"&gt;"company-org"&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;org &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;ORGS&lt;/span&gt;&lt;span class="p"&gt;[@]&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nv"&gt;repos&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;gh repo list &lt;span class="nv"&gt;$org&lt;/span&gt; &lt;span class="nt"&gt;--json&lt;/span&gt; name &lt;span class="nt"&gt;-q&lt;/span&gt; &lt;span class="s1"&gt;'.[].name'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;repo &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nv"&gt;$repos&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;git clone &lt;span class="nt"&gt;--mirror&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
      https://github.com/&lt;span class="nv"&gt;$org&lt;/span&gt;/&lt;span class="nv"&gt;$repo&lt;/span&gt;.git &lt;span class="se"&gt;\&lt;/span&gt;
      /backups/github/&lt;span class="nv"&gt;$org&lt;/span&gt;/&lt;span class="nv"&gt;$repo&lt;/span&gt;.git
  &lt;span class="k"&gt;done
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  📊 SLO/SLI Design Per Model
&lt;/h2&gt;

&lt;p&gt;This is where SRE expertise really shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IaaS — You define AND measure everything:
  SLI: Custom metrics from your app + infra
  SLO: 99.9% availability (you control this)
  Error Budget: You own it fully

PaaS — Platform gives you some metrics:
  SLI: Mix of platform metrics + app metrics
  SLO: Limited by platform's own SLA
  Error Budget: Platform failures count against YOU

SaaS — You mostly observe:
  SLI: API response times, login success rate
  SLO: Constrained by vendor SLA
  Error Budget: Vendor downtime burns YOUR budget
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔥 Real Incident Scenarios by Model
&lt;/h2&gt;

&lt;h3&gt;
  
  
  IaaS Incident
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert: High CPU on EC2 fleet (95%)
SRE Actions:
1. SSH into instance → top → find runaway process
2. Check ASG → is it scaling?
3. Check ALB → redistribute traffic
4. Patch if OS-level issue
5. You have FULL access to diagnose

Resolution time: Fast if skilled, slow if not
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PaaS Incident
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert: RDS connections maxed out
SRE Actions:
1. Check CloudWatch → DatabaseConnections metric
2. Check application → connection pool config
3. Scale instance type (few minutes)
4. Add read replica to distribute load
5. You CANNOT ssh into RDS — limited visibility

Resolution time: Dependent on platform tooling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SaaS Incident
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert: Datadog not receiving metrics
SRE Actions:
1. Check status.datadoghq.com
2. Check your agent → is it running?
3. If vendor issue → wait + use backup monitoring
4. You have ZERO control over their infrastructure

Resolution time: Entirely up to vendor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  💡 Key SRE Takeaways
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;IaaS&lt;/th&gt;
&lt;th&gt;PaaS&lt;/th&gt;
&lt;th&gt;SaaS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Toil level&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Blast radius&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You caused it&lt;/td&gt;
&lt;td&gt;Shared&lt;/td&gt;
&lt;td&gt;Vendor caused it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MTTR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You control&lt;/td&gt;
&lt;td&gt;Partly you&lt;/td&gt;
&lt;td&gt;Vendor controls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pay per resource&lt;/td&gt;
&lt;td&gt;Pay per usage&lt;/td&gt;
&lt;td&gt;Pay per seat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual/ASG&lt;/td&gt;
&lt;td&gt;Auto&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Patching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;You&lt;/td&gt;
&lt;td&gt;Vendor&lt;/td&gt;
&lt;td&gt;Vendor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Debugging&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full access&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;API/logs only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  🎓 Senior SRE Mental Model
&lt;/h2&gt;

&lt;p&gt;At 6 years experience, you should think about this like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IaaS = Maximum flexibility, maximum toil
       → Use when you NEED control
       → Automate everything or drown

PaaS = Sweet spot for most workloads
       → Understand platform limits deeply
       → Know exactly what you can't control

SaaS = Treat vendors like internal services
       → Track their SLAs
       → Build fallbacks for critical ones
       → Own YOUR data always

Modern SRE reality:
Most companies use ALL THREE simultaneously
Your job = understand the boundary of responsibility
           at each layer and build reliability
           within those constraints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>beginners</category>
      <category>cloudcomputing</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>What is a VPC in AWS? VPC peering, transit</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Wed, 06 May 2026 20:48:13 +0000</pubDate>
      <link>https://forem.com/jumptotech/what-is-a-vpc-in-aws-vpc-peering-transit-2fjn</link>
      <guid>https://forem.com/jumptotech/what-is-a-vpc-in-aws-vpc-peering-transit-2fjn</guid>
      <description>&lt;p&gt;&lt;strong&gt;VPC (Virtual Private Cloud)&lt;/strong&gt; is your own logically isolated network within AWS — think of it as your private data center inside AWS's infrastructure, where you control the IP ranges, subnets, routing, and security.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Do We Need a VPC?
&lt;/h2&gt;

&lt;p&gt;Without a VPC, all your AWS resources would be on a shared public network — anyone could potentially reach them. VPC solves this by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Isolation&lt;/strong&gt; — your resources are invisible to other AWS accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security&lt;/strong&gt; — you control what traffic comes in and goes out&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Custom networking&lt;/strong&gt; — define your own IP ranges, subnets, and routes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compliance&lt;/strong&gt; — meet regulatory requirements by keeping data in private networks&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Components of a VPC---
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Key VPC Building Blocks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Subnets&lt;/strong&gt; divide your VPC into sections. A &lt;em&gt;public subnet&lt;/em&gt; has a route to the Internet Gateway, so resources there can receive inbound traffic. A &lt;em&gt;private subnet&lt;/em&gt; has no direct internet route — resources there are unreachable from outside unless you explicitly allow it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Internet Gateway (IGW)&lt;/strong&gt; is the front door. It attaches to your VPC and allows two-way communication with the internet — but only for resources in public subnets that also have a public IP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NAT Gateway&lt;/strong&gt; lets private-subnet resources (like your database) make &lt;em&gt;outbound&lt;/em&gt; calls (e.g. downloading patches) without exposing them to inbound internet traffic. Traffic flows: Private EC2 → NAT GW → IGW → Internet, but never the reverse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route Tables&lt;/strong&gt; are the GPS of your VPC. Each subnet is associated with a route table that tells AWS where to send traffic — public subnets route &lt;code&gt;0.0.0.0/0&lt;/code&gt; to the IGW, private subnets route it to the NAT GW.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security Groups&lt;/strong&gt; act as virtual firewalls at the instance level — you define which ports and IPs are allowed in/out for each resource.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC Endpoints&lt;/strong&gt; let services like Lambda or EC2 talk to S3, DynamoDB, or Secrets Manager &lt;em&gt;without&lt;/em&gt; traffic leaving AWS's backbone — no IGW, no NAT, faster and cheaper.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Services Connect to a VPC
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;How it connects&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EC2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Launched directly inside a subnet — has a private IP, optionally a public one&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RDS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Placed in a DB subnet group (typically 2+ private subnets across AZs)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;By default runs outside any VPC; you can &lt;em&gt;attach&lt;/em&gt; it to a VPC for private access&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ECS / EKS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tasks/pods run inside subnets like EC2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 / DynamoDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Public services; access via VPC Endpoint keeps traffic private&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ALB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lives in public subnets, forwards to private-subnet targets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Public vs Private — When to Use Which
&lt;/h2&gt;

&lt;p&gt;Use a &lt;strong&gt;public subnet&lt;/strong&gt; for: load balancers, bastion hosts, NAT Gateways — anything that genuinely needs to receive internet traffic.&lt;/p&gt;

&lt;p&gt;Use a &lt;strong&gt;private subnet&lt;/strong&gt; for: databases, application servers, Lambda functions, internal microservices — anything that should &lt;em&gt;never&lt;/em&gt; be directly reachable from the internet.&lt;/p&gt;

&lt;p&gt;The general rule: put as little as possible in the public subnet. The smaller your public surface, the harder it is to attack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AWS Transit Gateway?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Transit Gateway (TGW)&lt;/strong&gt; is a central network hub that connects multiple VPCs, on-premises networks, and AWS accounts together — like a cloud router that everything plugs into.&lt;/p&gt;

&lt;p&gt;Think of it this way: without Transit Gateway, if you have 5 VPCs that all need to talk to each other, you'd need a mesh of VPC peering connections. With Transit Gateway, every VPC just connects to &lt;em&gt;one&lt;/em&gt; hub.&lt;strong&gt;Diagram 1: The problem Transit Gateway solves&lt;/strong&gt; — without it, VPCs connecting to each other require a full mesh of peering connections that grows unmanageable fast.The key pain point with VPC peering: it is &lt;em&gt;non-transitive&lt;/em&gt;. If VPC A peers with VPC B, and VPC B peers with VPC C, VPC A still cannot talk to VPC C. You'd need a direct peering for every pair. With 10 VPCs that's 45 connections to manage. Transit Gateway fixes this entirely — everything connects through one hub.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagram 2: What Transit Gateway connects&lt;/strong&gt; — it's not just VPC-to-VPC. It acts as the central router for your entire network.---&lt;/p&gt;

&lt;h2&gt;
  
  
  Why You Need Transit Gateway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Scale without the mesh chaos.&lt;/strong&gt; VPC peering connections grow as N×(N-1)/2 — 4 VPCs need 6 connections, 10 VPCs need 45. TGW keeps it at N connections regardless.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Transitivity.&lt;/strong&gt; VPC peering is not transitive — traffic can't hop through an intermediate VPC. TGW routes traffic across all attached networks as a proper router would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Centralized on-premises connectivity.&lt;/strong&gt; Without TGW, each VPC needs its own VPN tunnel to your data center. With TGW, one VPN attachment serves all VPCs attached to the gateway.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Traffic isolation via route tables.&lt;/strong&gt; The TGW has its own route tables. You can define that your Dev VPC can only reach other Dev VPCs and a shared-services VPC, while a security VPC sees everything for traffic inspection. This is impossible with VPC peering alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Cross-account and cross-region.&lt;/strong&gt; TGW works with AWS Resource Access Manager to share the gateway across multiple accounts. TGW peering connects gateways across regions.&lt;/p&gt;




&lt;h2&gt;
  
  
  TGW vs VPC Peering — When to Use Which
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;2 VPCs that need to talk&lt;/td&gt;
&lt;td&gt;VPC Peering (simpler, cheaper)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4+ VPCs, especially growing&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account, cross-region networking&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises + multiple VPCs&lt;/td&gt;
&lt;td&gt;Transit Gateway&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need centralized firewall/inspection&lt;/td&gt;
&lt;td&gt;Transit Gateway (route all traffic through security VPC)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Just two VPCs, same account&lt;/td&gt;
&lt;td&gt;VPC Peering&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core mental model: &lt;strong&gt;VPC Peering is a direct cable between two VPCs. Transit Gateway is a router that every VPC plugs into.&lt;/strong&gt; Once you have more than a handful of VPCs, the router approach wins every time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC Peering&lt;/strong&gt; is a networking connection between two Virtual Private Clouds (VPCs) that allows them to communicate with each other using private IP addresses — as if they were on the same network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key ideas:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VPC&lt;/strong&gt; = an isolated private network within a cloud provider (like AWS, Google Cloud, or Azure)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peering&lt;/strong&gt; = linking two of those networks together directly, without traffic going over the public internet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traffic between peered VPCs travels through the cloud provider's internal backbone network, making it fast, private, and secure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common use cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connecting a development VPC to a production VPC&lt;/li&gt;
&lt;li&gt;Sharing services (like a database) across teams or accounts&lt;/li&gt;
&lt;li&gt;Connecting VPCs across different regions or different accounts within the same cloud provider&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Non-transitive&lt;/strong&gt; — if VPC A peers with VPC B, and VPC B peers with VPC C, VPC A cannot talk to VPC C through B. Each connection must be explicitly set up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No overlapping CIDR blocks&lt;/strong&gt; — the IP address ranges of the two VPCs cannot overlap&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not a VPN&lt;/strong&gt; — it's a private cloud-internal connection, not an encrypted tunnel over the internet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Quick analogy:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of two office buildings (VPCs) in the same city. VPC peering is like building a private hallway directly between them, so employees can walk between buildings without going outside (the public internet).&lt;/p&gt;

&lt;p&gt;Want a diagram or a deeper dive into any specific cloud provider's implementation?&lt;br&gt;
Great question — this is where a lot of people get confused because all three keep traffic on the AWS backbone. The real difference is &lt;em&gt;what problem each one solves&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Here's a decision framework:&lt;strong&gt;The one question that drives the choice:&lt;/strong&gt; what scope of access do you need?---&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Mental Model
&lt;/h2&gt;

&lt;p&gt;Think of it in terms of &lt;strong&gt;scope of access&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PrivateLink&lt;/strong&gt; = expose &lt;em&gt;one service&lt;/em&gt; to another VPC. The consumer gets a single private IP endpoint — nothing else. They cannot reach any other resource in your VPC. This is surgical, zero-trust access. Use it when two VPCs don't need to talk to each other broadly — they just need one service to be reachable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPC Peering&lt;/strong&gt; = full network access between exactly &lt;em&gt;two VPCs&lt;/em&gt;. Both VPCs can reach any resource in the other (subject to security groups and NACLs). Simple to set up, no cost for the connection itself, but does not scale — every new VPC pair needs its own peering, and it's non-transitive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transit Gateway&lt;/strong&gt; = full network access between &lt;em&gt;many VPCs, accounts, and on-premises networks&lt;/em&gt;, all routed through one hub. More complex and costs money per attachment + data processed, but scales linearly and supports centralized routing policies.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Scenarios
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Scenario 1 — Startup with 2 VPCs (prod + dev)&lt;/strong&gt;&lt;br&gt;
Dev team occasionally needs to pull from a shared database in prod. → Use &lt;strong&gt;VPC Peering&lt;/strong&gt;. Two VPCs, simple, done in minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 2 — Company with 8 VPCs across 3 AWS accounts&lt;/strong&gt;&lt;br&gt;
Networking team needs all VPCs to reach a shared DNS resolver and a centralized logging service, plus a VPN back to the data center. → Use &lt;strong&gt;Transit Gateway&lt;/strong&gt;. One VPN connection shared by all, one hub to manage routing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 3 — Platform team building an internal payments API&lt;/strong&gt;&lt;br&gt;
Other teams' VPCs need to call the payments API — but the platform team doesn't want those VPCs to have any other access into their network. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt;. Expose only the API endpoint. Consumer VPCs get a single private IP, nothing more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 4 — VPC Peering is impossible (overlapping CIDRs)&lt;/strong&gt;&lt;br&gt;
Two VPCs both use &lt;code&gt;10.0.0.0/16&lt;/code&gt; — peering is blocked. But one VPC needs to call a service in the other. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt;. CIDR conflicts don't matter since there's no route table overlap.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scenario 5 — Lambda in a private VPC needs to write to S3&lt;/strong&gt;&lt;br&gt;
Lambda is in a VPC with no internet access. You need to reach S3 without adding a NAT Gateway. → Use &lt;strong&gt;PrivateLink&lt;/strong&gt; (Gateway Endpoint for S3 — free). Traffic stays entirely within AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PrivateLink&lt;/th&gt;
&lt;th&gt;VPC Peering&lt;/th&gt;
&lt;th&gt;Transit Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Access scope&lt;/td&gt;
&lt;td&gt;Single service&lt;/td&gt;
&lt;td&gt;Whole VPC&lt;/td&gt;
&lt;td&gt;Whole network&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CIDR conflicts&lt;/td&gt;
&lt;td&gt;No issue&lt;/td&gt;
&lt;td&gt;Breaks it&lt;/td&gt;
&lt;td&gt;No issue&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale&lt;/td&gt;
&lt;td&gt;Unlimited consumers&lt;/td&gt;
&lt;td&gt;Up to ~125 peers&lt;/td&gt;
&lt;td&gt;Thousands of VPCs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;On-premises&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (VPN/DX)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Per endpoint + data&lt;/td&gt;
&lt;td&gt;Data transfer only&lt;/td&gt;
&lt;td&gt;Per attachment + data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direction&lt;/td&gt;
&lt;td&gt;One-way&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complexity&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Very low&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern most large companies end up with: &lt;strong&gt;Transit Gateway as the backbone&lt;/strong&gt; for VPC-to-VPC and on-premises connectivity, with &lt;strong&gt;PrivateLink layered on top&lt;/strong&gt; for exposing specific internal services securely to teams or customers who shouldn't have broad network access.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AWS PrivateLink?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AWS PrivateLink&lt;/strong&gt; lets you access AWS services or your own services hosted in another VPC &lt;em&gt;privately&lt;/em&gt; — traffic never leaves the AWS network, never touches the internet, and the two VPCs don't need to be peered or connected via Transit Gateway.&lt;/p&gt;

&lt;p&gt;The core idea: instead of exposing a service publicly or opening up full VPC-to-VPC networking, PrivateLink creates a &lt;strong&gt;one-way, private endpoint&lt;/strong&gt; — the consumer VPC gets a private IP in its own subnet that tunnels traffic to the provider service. That's it. No route tables to manage, no CIDR conflicts to worry about.&lt;strong&gt;Diagram 1: How PrivateLink works&lt;/strong&gt; — the core mechanism is an Interface Endpoint (an ENI in your subnet) that maps to the provider's service via AWS's internal network.The consumer VPC creates an &lt;strong&gt;Interface Endpoint&lt;/strong&gt; — just a private IP (ENI) sitting in its own subnet. DNS resolves the service name to that private IP. Traffic flows through AWS's internal backbone to the provider's Network Load Balancer, then to the actual service. The two VPCs never need to peer, share route tables, or even know each other's CIDR ranges.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Diagram 2: The three ways PrivateLink is used&lt;/strong&gt; in practice.---&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three Uses of PrivateLink
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Accessing AWS-managed services privately.&lt;/strong&gt; Services like S3, SQS, ECR, Secrets Manager, KMS, and 100+ others support PrivateLink. Instead of your EC2 or Lambda hitting &lt;code&gt;s3.amazonaws.com&lt;/code&gt; over the internet, an Interface Endpoint gives it a private IP inside your subnet. For S3 and DynamoDB specifically, there's a simpler free variant called a Gateway Endpoint — same idea, slightly different implementation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Consuming a partner SaaS service.&lt;/strong&gt; Vendors like Datadog, Splunk, Snowflake, and many others publish themselves as PrivateLink Endpoint Services. You create an Interface Endpoint in your VPC pointing to their service name, and your traffic to them never touches the internet. The vendor's VPC and your VPC never peer — they can't see your network at all, only receive the specific calls you make.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Publishing your own internal service.&lt;/strong&gt; You place a Network Load Balancer in front of your service, register it as an Endpoint Service, then whitelist which AWS accounts can connect. Other teams or customers create Interface Endpoints in their own VPCs pointing at yours. This is how internal platform teams build shared services — auth, payments, data APIs — without opening full VPC-to-VPC access.&lt;/p&gt;




&lt;h2&gt;
  
  
  PrivateLink vs the Alternatives
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;PrivateLink&lt;/th&gt;
&lt;th&gt;VPC Peering&lt;/th&gt;
&lt;th&gt;Transit Gateway&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traffic path&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;td&gt;AWS backbone&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CIDR conflicts&lt;/td&gt;
&lt;td&gt;No problem&lt;/td&gt;
&lt;td&gt;Breaks everything&lt;/td&gt;
&lt;td&gt;No problem&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access scope&lt;/td&gt;
&lt;td&gt;Single service only&lt;/td&gt;
&lt;td&gt;Full VPC-to-VPC&lt;/td&gt;
&lt;td&gt;All attached VPCs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Direction&lt;/td&gt;
&lt;td&gt;One-way (consumer → provider)&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-account&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-region&lt;/td&gt;
&lt;td&gt;Yes (via interface EP)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes (TGW peering)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Exposing a specific service privately&lt;/td&gt;
&lt;td&gt;Small number of VPCs needing full access&lt;/td&gt;
&lt;td&gt;Large-scale hub-and-spoke networking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: PrivateLink is surgical. VPC Peering and Transit Gateway open up &lt;em&gt;networking&lt;/em&gt; — entire VPCs can talk to each other. PrivateLink exposes only &lt;em&gt;one service&lt;/em&gt; through a single endpoint. If you just want your app to call an internal payments API without routing everything through a shared network, PrivateLink is exactly the right tool.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>project #1: Company: *FinTrust Bank (digital banking platform) your role: 👉 Site Reliability Engineer (SRE)</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:25:26 +0000</pubDate>
      <link>https://forem.com/jumptotech/project-1-company-fintrust-bank-digital-banking-platform-your-role-site-reliability-23fp</link>
      <guid>https://forem.com/jumptotech/project-1-company-fintrust-bank-digital-banking-platform-your-role-site-reliability-23fp</guid>
      <description>&lt;p&gt;What bank  do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Online banking (accounts, transfers, payments)&lt;/li&gt;
&lt;li&gt;Mobile + web applications&lt;/li&gt;
&lt;li&gt;Real-time transactions&lt;/li&gt;
&lt;li&gt;Strict security &amp;amp; compliance (PCI-DSS, encryption)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  👩‍💻 YOUR ROLE
&lt;/h1&gt;

&lt;p&gt;Title:&lt;/p&gt;

&lt;p&gt;👉 Site Reliability Engineer (SRE)&lt;/p&gt;




&lt;h2&gt;
  
  
  Your responsibility
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ensure &lt;strong&gt;99.99% uptime&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Protect &lt;strong&gt;sensitive financial data&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Prevent &lt;strong&gt;unauthorized access&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Ensure &lt;strong&gt;low latency transactions&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Handle &lt;strong&gt;incidents quickly&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Maintain &lt;strong&gt;secure architecture&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🏗️ PROJECT NAME
&lt;/h1&gt;

&lt;p&gt;👉 &lt;strong&gt;Secure Multi-Tier Banking Infrastructure on AWS with High Availability and Zero-Trust Networking&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  🧠 CORE IDEA
&lt;/h1&gt;

&lt;p&gt;Banking system MUST:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✔ Never expose database
✔ Encrypt all traffic
✔ Restrict access strictly
✔ Handle failures instantly
✔ Be fully observable
✔ Support multi-region design
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🏗️ ARCHITECTURE
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User (Mobile / Web)
   ↓
DNS (:contentReference[oaicite:0]{index=0})
   ↓
:contentReference[oaicite:1]{index=1} + Shield
   ↓
CloudFront (CDN + TLS)
   ↓
Application Load Balancer (DMZ / Public)
   ↓
App Layer (Private Subnets)
   ↓
Transaction Services (Private)
   ↓
Database (Private DB Subnet, encrypted)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🔐 SECURITY (MOST IMPORTANT FOR BANK)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What you implemented
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Network isolation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;VPC with private architecture&lt;/li&gt;
&lt;li&gt;No public IPs for app or DB&lt;/li&gt;
&lt;li&gt;Only ALB exposed&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. Firewall design
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ALB SG → allow 443 from internet&lt;/li&gt;
&lt;li&gt;App SG → allow only from ALB&lt;/li&gt;
&lt;li&gt;DB SG → allow only from app&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Zero trust model&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Encryption
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;HTTPS everywhere (TLS)&lt;/li&gt;
&lt;li&gt;DB encryption (at rest)&lt;/li&gt;
&lt;li&gt;Secrets stored securely&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  4. WAF protection
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;blocked SQL injection&lt;/li&gt;
&lt;li&gt;blocked bots&lt;/li&gt;
&lt;li&gt;rate limiting&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🌐 NETWORKING (WHAT YOU BUILT)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  VPC design
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subnets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Public (DMZ):
- ALB
- NAT

Private App:
- Banking APIs

Private DB:
- RDS (transactions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Routing
&lt;/h2&gt;

&lt;p&gt;Public route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → IGW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → NAT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DB route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NO internet access
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Private access
&lt;/h2&gt;

&lt;p&gt;Used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC Endpoint for S3&lt;/li&gt;
&lt;li&gt;VPC Endpoint for Secrets Manager&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 No internet dependency&lt;/p&gt;




&lt;h1&gt;
  
  
  ⚖️ HIGH AVAILABILITY (BANK REQUIREMENT)
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;Multi-AZ deployment&lt;/li&gt;
&lt;li&gt;ALB distributes traffic&lt;/li&gt;
&lt;li&gt;Auto Scaling enabled&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Failure handling
&lt;/h2&gt;

&lt;p&gt;If one AZ fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traffic shifts automatically
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  📡 MULTI-VPC / ENTERPRISE DESIGN
&lt;/h1&gt;

&lt;p&gt;You designed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Core banking VPC&lt;/li&gt;
&lt;li&gt;Shared services VPC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connected using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPC Peering&lt;/li&gt;
&lt;li&gt;AWS Transit Gateway&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🔒 PRIVATELINK (VERY STRONG POINT)
&lt;/h1&gt;

&lt;p&gt;Used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS PrivateLink&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;internal fraud detection API exposed privately&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 No full VPC exposure&lt;/p&gt;




&lt;h1&gt;
  
  
  🏢 HYBRID (REAL BANKING)
&lt;/h1&gt;

&lt;p&gt;Bank has on-prem systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;legacy transaction systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Connected using:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;VPN&lt;/li&gt;
&lt;li&gt;Direct Connect (concept)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  📊 OBSERVABILITY (SRE CORE)
&lt;/h1&gt;

&lt;p&gt;You implemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudWatch metrics&lt;/li&gt;
&lt;li&gt;ALB access logs&lt;/li&gt;
&lt;li&gt;VPC Flow Logs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What you monitor
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;li&gt;traffic spikes&lt;/li&gt;
&lt;li&gt;blocked requests&lt;/li&gt;
&lt;li&gt;DB connections&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚨 INCIDENTS YOU HANDLED
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Example 1 — Payment API down
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ALB 503&lt;/li&gt;
&lt;li&gt;found unhealthy targets&lt;/li&gt;
&lt;li&gt;restarted service&lt;/li&gt;
&lt;li&gt;fixed health check&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example 2 — Transaction delay
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;high latency detected&lt;/li&gt;
&lt;li&gt;traced to DB slow query&lt;/li&gt;
&lt;li&gt;optimized query&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example 3 — Security alert
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;WAF blocked traffic spike&lt;/li&gt;
&lt;li&gt;identified bot attack&lt;/li&gt;
&lt;li&gt;tuned rules&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example 4 — Private EC2 lost internet
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;NAT route missing&lt;/li&gt;
&lt;li&gt;fixed route table&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example 5 — DNS misrouting
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;wrong ALB target&lt;/li&gt;
&lt;li&gt;updated Route 53&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🧑‍🤝‍🧑 TEAM STRUCTURE
&lt;/h1&gt;

&lt;ul&gt;
&lt;li&gt;2 SREs&lt;/li&gt;
&lt;li&gt;5 backend engineers&lt;/li&gt;
&lt;li&gt;2 frontend engineers&lt;/li&gt;
&lt;li&gt;1 security engineer&lt;/li&gt;
&lt;li&gt;1 DevOps/platform engineer&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🤝 YOUR COLLABORATION
&lt;/h1&gt;

&lt;p&gt;You worked with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;backend → debugging API failures&lt;/li&gt;
&lt;li&gt;security → WAF rules, compliance&lt;/li&gt;
&lt;li&gt;DevOps → deployments&lt;/li&gt;
&lt;li&gt;product → outage impact&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  📅 YOUR DAILY WORK
&lt;/h1&gt;

&lt;p&gt;Morning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;check dashboards&lt;/li&gt;
&lt;li&gt;review alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;During day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fix incidents&lt;/li&gt;
&lt;li&gt;optimize performance&lt;/li&gt;
&lt;li&gt;deploy updates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On-call:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;respond to outages&lt;/li&gt;
&lt;li&gt;troubleshoot quickly&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🏆 YOUR ACHIEVEMENTS
&lt;/h1&gt;

&lt;p&gt;You can say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;achieved 99.99% uptime&lt;/li&gt;
&lt;li&gt;reduced downtime by resolving recurring issues&lt;/li&gt;
&lt;li&gt;secured architecture (no public DB)&lt;/li&gt;
&lt;li&gt;improved performance&lt;/li&gt;
&lt;li&gt;reduced costs using VPC endpoints&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  💬 STRONG INTERVIEW ANSWER
&lt;/h1&gt;

&lt;p&gt;Say this:&lt;/p&gt;

&lt;p&gt;“I worked as an SRE on a banking platform where I designed and maintained a secure multi-tier AWS architecture. I implemented private networking using VPC, subnets, and NAT Gateway, and ensured that only the load balancer was exposed publicly. I secured communication using security groups and WAF, and placed the database in isolated private subnets with no internet access. I integrated DNS using Route 53 and implemented private access to AWS services using VPC endpoints. I also designed multi-VPC connectivity using Transit Gateway and PrivateLink for secure service exposure. As part of my SRE responsibilities, I monitored system health using CloudWatch and logs, handled incidents such as load balancer failures and database connectivity issues, and ensured high availability and performance for critical banking transactions.”&lt;/p&gt;




&lt;h1&gt;
  
  
  🔥 WHY THIS PROJECT IS POWERFUL
&lt;/h1&gt;

&lt;p&gt;Because it shows:&lt;/p&gt;

&lt;p&gt;✔ Security (bank-level)&lt;br&gt;
✔ Networking (deep)&lt;br&gt;
✔ Reliability (SRE core)&lt;br&gt;
✔ Real-world scenarios&lt;br&gt;
✔ Troubleshooting&lt;/p&gt;

</description>
    </item>
    <item>
      <title>VPC, subnets, IGW, NAT, routing, firewall, DMZ, private DB, and troubleshooting part #3</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:18:11 +0000</pubDate>
      <link>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-part-3-35n7</link>
      <guid>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-part-3-35n7</guid>
      <description>&lt;h1&gt;
  
  
  Real Outage Simulation: SRE Networking Debugging
&lt;/h1&gt;

&lt;p&gt;Architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
 ↓
Route 53 DNS
 ↓
ALB public subnet / DMZ
 ↓
Web EC2 private subnet
 ↓
DB private subnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your SRE troubleshooting order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. DNS
2. WAF / ALB
3. Target Group health
4. Security Groups
5. Route Tables
6. NAT / IGW
7. EC2 / Nginx
8. DB
9. Logs / Flow Logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  OUTAGE 1 — Website Completely Down
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;User says:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app.company.com is not opening.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Browser shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;This site can’t be reached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1 — Check DNS
&lt;/h2&gt;

&lt;p&gt;From your laptop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nslookup app.company.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected good output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name: app.company.com
Address: ALB-DNS or ALB IPs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;server can't find app.company.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;Route 53 record deleted or wrong.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route 53 → Hosted Zone → Create Record
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Record type: A
Alias: Yes
Target: Application Load Balancer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;DNS was not resolving to the ALB, so traffic never reached AWS infrastructure.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 2 — ALB Returns 503
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Browser opens, but shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;503 Service Temporarily Unavailable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Meaning
&lt;/h2&gt;

&lt;p&gt;ALB is reachable, but it has no healthy backend targets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Check Target Group
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 → Target Groups → sre-app-tg → Targets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Unhealthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Check health reason
&lt;/h2&gt;

&lt;p&gt;Possible reasons:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Health checks failed
Request timed out
Target.ResponseCodeMismatch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 — Check app server
&lt;/h2&gt;

&lt;p&gt;Connect to private EC2 using SSM or bastion.&lt;/p&gt;

&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;inactive (dead)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start nginx
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;nginx
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl localhost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hello from Web Server 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;ALB returned 503 because the target group had no healthy instances. The web service was stopped, so the health check failed.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 3 — ALB Target Unhealthy Because Security Group Is Wrong
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;ALB returns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;503
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Target group shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Unhealthy
Health check timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check Security Group
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 → Security Groups → web-sg → Inbound rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correct rule should be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 from alb-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad rule example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 from your IP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;ALB cannot reach web server because web SG does not allow traffic from ALB SG.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Edit web-sg inbound:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Type: HTTP
Port: 80
Source: alb-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait 1–2 minutes.&lt;/p&gt;

&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target health: Healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;The application itself was fine, but the firewall blocked ALB-to-web traffic.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 4 — Private EC2 Cannot Install Packages
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;On private EC2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Temporary failure resolving
or
Connection timed out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1 — Check route table
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Route Tables → private-rt → Routes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correct:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No default route
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Check NAT
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → NAT Gateways
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;State: Available
Subnet: Public subnet
Elastic IP: attached
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 — Check public route table
&lt;/h2&gt;

&lt;p&gt;Public subnet must have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Add route:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private Route Table
0.0.0.0/0 → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;Private EC2 had no outbound internet because the private subnet was missing the NAT route.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 5 — Public EC2 / ALB Not Reachable
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Browser cannot reach ALB DNS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check ALB Security Group
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 → Security Groups → alb-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correct inbound:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 from 0.0.0.0/0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No inbound rule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 → 0.0.0.0/0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;The ALB was healthy, but its security group blocked public HTTP traffic.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 6 — Web Server Cannot Connect to DB
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Application shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Database connection failed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 1 — From web EC2 test DB port
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nc &lt;span class="nt"&gt;-vz&lt;/span&gt; &amp;lt;db-private-ip&amp;gt; 3306
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;connection timed out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;succeeded
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 2 — Check DB SG
&lt;/h2&gt;

&lt;p&gt;Correct inbound rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MySQL 3306 from web-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad rule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MySQL 3306 from your IP
or
No MySQL rule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Edit db-sg:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
MySQL/Aurora
Port: 3306
Source: web-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;The database was private and secure, but the app tier was not allowed by the DB security group.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 7 — One Web Server Down, Site Still Works
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Stop one EC2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sre-web-1 stopped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;User still sees website.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why?
&lt;/h2&gt;

&lt;p&gt;ALB sends traffic only to healthy targets.&lt;/p&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target group:
web-1 → unused/unhealthy
web-2 → healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;This proves high availability. One instance failed, but ALB removed it from rotation and continued sending traffic to the healthy instance.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 8 — Both Web Servers in Same AZ
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Application works normally, but during AZ failure everything goes down.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;Both web servers are in one Availability Zone.&lt;/p&gt;

&lt;p&gt;Bad design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;web-1 → us-east-1a
web-2 → us-east-1a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Good design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;web-1 → us-east-1a
web-2 → us-east-1b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Launch web servers across different private subnets in different AZs.&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;High availability requires spreading resources across multiple Availability Zones.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 9 — Wrong Health Check Path
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Website works manually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://private-ip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hello from Web Server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But ALB target is unhealthy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check health check path
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target Group → Health checks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But app only serves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Change health check path to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or create &lt;code&gt;/health&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;The application was running, but ALB health check was using a path that did not exist.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 10 — NACL Blocks Return Traffic
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Security groups look correct. Route tables look correct. Still traffic times out.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check NACL
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Network ACLs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Remember:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NACL is stateless
Inbound and outbound both must be allowed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For HTTP, allow:&lt;/p&gt;

&lt;p&gt;Inbound:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;80 from source
1024-65535 ephemeral ports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outbound:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;80
1024-65535 ephemeral ports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;NACL allowed inbound request but blocked return traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Allow ephemeral ports.&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;Security groups are stateful, but NACLs are stateless. Return traffic must be explicitly allowed.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 11 — VPC Peering Not Working
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;EC2 in VPC A cannot reach EC2 in VPC B.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check 1 — Peering status
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Peering Connections
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Active
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pending acceptance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check 2 — Route tables
&lt;/h2&gt;

&lt;p&gt;VPC A route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.1.0.0/16 → peering connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VPC B route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → peering connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check 3 — CIDR overlap
&lt;/h2&gt;

&lt;p&gt;Bad:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A: 10.0.0.0/16
VPC B: 10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Peering will not work.&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;VPC peering requires non-overlapping CIDR ranges, active peering, routes on both sides, and firewall rules allowing traffic.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 12 — PrivateLink Works for One Service Only
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Consumer VPC can access one API through endpoint, but cannot reach other private EC2s in provider VPC.&lt;/p&gt;

&lt;h2&gt;
  
  
  Explanation
&lt;/h2&gt;

&lt;p&gt;This is expected.&lt;/p&gt;

&lt;p&gt;PrivateLink is not full network connectivity.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PrivateLink → service-level access
VPC Peering → network-level access
Transit Gateway → large network hub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;PrivateLink exposes only a specific service through an endpoint. It does not allow full VPC-to-VPC communication.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 13 — VPN Tunnel Down
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;On-prem users cannot reach AWS private app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check
&lt;/h2&gt;

&lt;p&gt;In AWS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Site-to-Site VPN Connections
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tunnel 1: DOWN
Tunnel 2: DOWN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer gateway public IP correct?
On-prem firewall allows IPsec?
BGP routes advertised?
AWS route table has route to on-prem CIDR?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;VPN issues usually come from tunnel status, BGP route advertisement, firewall rules, or missing route table entries.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 14 — DNS Points to Old ALB
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;New deployment completed, but users still hit old app.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig app.company.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare with current ALB DNS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Root cause
&lt;/h2&gt;

&lt;p&gt;Route 53 record points to old ALB or DNS cache TTL has not expired.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Update Route 53 alias record.&lt;/p&gt;

&lt;p&gt;Lower TTL before planned migration.&lt;/p&gt;

&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;The application was not broken. DNS was pointing users to the wrong load balancer.&lt;/p&gt;




&lt;h1&gt;
  
  
  OUTAGE 15 — WAF Blocks Real Users
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Symptom
&lt;/h2&gt;

&lt;p&gt;Some users get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;403 Forbidden
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Check
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AWS WAF → Web ACL → Logs / Sampled requests
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Look for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Blocked rule
Source IP
URI path
User agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Fix
&lt;/h2&gt;

&lt;p&gt;Options:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Adjust managed rule
Add allowlist
Change rule priority
Tune rate limit
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SRE explanation
&lt;/h2&gt;

&lt;p&gt;WAF protects the app, but rules can create false positives. SRE must verify blocked requests before disabling protection.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final SRE Outage Debugging Script
&lt;/h1&gt;

&lt;p&gt;In interview, say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;When an outage happens, I do not guess. I follow the request path. First I check DNS resolution, then ALB status, listener, security group, target group health, application service, route tables, NAT or IGW, then database connectivity. I also use ALB logs, VPC Flow Logs, CloudWatch metrics, and application logs to prove where the traffic is failing.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Best Practice Summary
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DNS issue       → dig / nslookup
ALB issue       → listener, SG, target group
503             → no healthy targets
504             → backend timeout
Private no net  → NAT route
DB issue        → DB SG from web SG
NACL issue      → remember stateless
Peering issue   → routes both sides
VPN issue       → tunnel + BGP + routes
WAF issue       → check blocked rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Very strong interview answer
&lt;/h1&gt;

&lt;p&gt;I troubleshoot production outages by following the traffic path from user to backend: DNS, WAF, load balancer, target group, security groups, route tables, NACLs, EC2 service, and database. I verify each layer with tools like dig, curl, nc, ALB health checks, CloudWatch metrics, ALB logs, and VPC Flow Logs. My goal is to quickly identify whether the problem is DNS, routing, firewall, load balancer, application, or database, then restore service and document the root cause.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>VPC, subnets, IGW, NAT, routing, firewall, DMZ, private DB, and troubleshooting part #2</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:16:02 +0000</pubDate>
      <link>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-part-2-5a7g</link>
      <guid>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-part-2-5a7g</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User (Internet)
   ↓
DNS (:contentReference[oaicite:0]{index=0})
   ↓
WAF (optional)
   ↓
Load Balancer (Public / DMZ)
   ↓
Private Web Tier (EC2 / App)
   ↓
Private DB Tier
   ↓
Private AWS Services (via VPC Endpoint)

Cross-VPC / Hybrid:
   ↔ VPC Peering / :contentReference[oaicite:1]{index=1}
   ↔ :contentReference[oaicite:2]{index=2}
   ↔ VPN / Direct Connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🚀 STEP 11 — ADD DNS (Route 53)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why SRE adds this
&lt;/h2&gt;

&lt;p&gt;Users should never access ALB DNS directly.&lt;br&gt;
They use domain like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app.company.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route 53 → Hosted Zones → Create Hosted Zone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you already have domain → use it&lt;/p&gt;




&lt;h2&gt;
  
  
  Create record
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Record name: app
Type: A
Alias: YES
Target: ALB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Expected result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nslookup app.yourdomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Name: app.yourdomain.com
Address: ALB IP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Now flow is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → DNS → ALB → Web
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE troubleshooting
&lt;/h2&gt;

&lt;p&gt;If site down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig app.yourdomain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does it resolve?&lt;/li&gt;
&lt;li&gt;correct ALB?&lt;/li&gt;
&lt;li&gt;TTL delay?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 12 — ADD WAF (SECURITY LAYER)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;p&gt;Security Groups = network firewall&lt;br&gt;
WAF = application firewall (Layer 7)&lt;/p&gt;


&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WAF → Create Web ACL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Attach to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Add rules
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;AWS Managed Rules&lt;/li&gt;
&lt;li&gt;Rate limiting (1000 req/min)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;Now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad traffic blocked BEFORE app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE troubleshooting
&lt;/h2&gt;

&lt;p&gt;If users blocked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;check WAF logs&lt;/li&gt;
&lt;li&gt;check rule priority&lt;/li&gt;
&lt;li&gt;false positives&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 13 — ADD CLOUDWATCH + ALB LOGS
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;p&gt;SRE must &lt;strong&gt;see traffic&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Enable ALB logs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 → Load Balancer → Attributes → Enable Access Logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store in S3&lt;/p&gt;




&lt;h2&gt;
  
  
  Expected log
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client_ip request_path target_status_code latency
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why important
&lt;/h2&gt;

&lt;p&gt;You can debug:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;500 errors&lt;/li&gt;
&lt;li&gt;slow requests&lt;/li&gt;
&lt;li&gt;bad clients&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 14 — ADD VPC FLOW LOGS
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Already partially covered — now use it
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Flow Logs → Create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Example output
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ACCEPT TCP 10.0.3.10 → 10.0.5.20 3306
REJECT TCP 1.2.3.4 → 10.0.5.20 3306
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;You can prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;traffic allowed&lt;/li&gt;
&lt;li&gt;traffic blocked&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 15 — ADD VPC ENDPOINT (PRIVATE AWS ACCESS)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;p&gt;Private EC2 should NOT go through internet for AWS services&lt;/p&gt;




&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Endpoints → Create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3
Type: Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Attach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private route table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2 → S3 (no NAT, no internet)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE importance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;secure&lt;/li&gt;
&lt;li&gt;cheaper&lt;/li&gt;
&lt;li&gt;required in enterprise&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 16 — ADD VPC PEERING (MULTI-VPC)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Scenario
&lt;/h2&gt;

&lt;p&gt;You have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC-A → your app
VPC-B → shared services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Create second VPC
&lt;/h2&gt;

&lt;p&gt;CIDR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.1.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Peering → Create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Update routes BOTH SIDES
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.1.0.0/16 → peering
10.0.0.0/16 → peering
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private communication between VPCs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE troubleshooting
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;routes missing?&lt;/li&gt;
&lt;li&gt;SG blocking?&lt;/li&gt;
&lt;li&gt;CIDR overlap?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 17 — ADD TRANSIT GATEWAY (ENTERPRISE LEVEL)
&lt;/h1&gt;

&lt;p&gt;Instead of many peerings:&lt;/p&gt;




&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Transit Gateway → Create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Attach VPCs
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attach VPC-A
Attach VPC-B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Central network hub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why SRE uses this
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;scalable&lt;/li&gt;
&lt;li&gt;cleaner architecture&lt;/li&gt;
&lt;li&gt;used in large companies&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 18 — ADD PRIVATELINK (ADVANCED)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Scenario
&lt;/h2&gt;

&lt;p&gt;Expose ONLY service, not full network&lt;/p&gt;




&lt;h2&gt;
  
  
  Flow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Consumer VPC → Endpoint → NLB → Service VPC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;secure&lt;/li&gt;
&lt;li&gt;no full VPC access&lt;/li&gt;
&lt;li&gt;SaaS architecture&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Difference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Peering → full network
PrivateLink → one service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🚀 STEP 19 — ADD VPN (HYBRID CLOUD)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Scenario
&lt;/h2&gt;

&lt;p&gt;Company has on-prem server&lt;/p&gt;




&lt;h2&gt;
  
  
  Go to:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → VPN → Create Site-to-Site VPN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On-prem → encrypted → AWS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  SRE checks
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;tunnel UP?&lt;/li&gt;
&lt;li&gt;routes correct?&lt;/li&gt;
&lt;li&gt;firewall open?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 20 — ADD DIRECT CONNECT (THEORY)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What
&lt;/h2&gt;

&lt;p&gt;Private fiber connection&lt;/p&gt;




&lt;h2&gt;
  
  
  When used
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;banks&lt;/li&gt;
&lt;li&gt;large companies&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Difference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPN → internet
Direct Connect → private line
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🚀 STEP 21 — FINAL SRE TESTING (REAL SCENARIOS)
&lt;/h1&gt;




&lt;h2&gt;
  
  
  Scenario 1 — ALB down
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DNS → OK?
ALB → Active?
Target → Healthy?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 2 — App slow
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB logs
Latency
DB connection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 3 — DB not reachable
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SG rules
Port 3306
Private routing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 4 — Private EC2 no internet
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAT
Route table
IGW
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 5 — DNS issue
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig app.domain.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🔥 WHAT YOU HAVE NOW (REAL SRE LEVEL)
&lt;/h1&gt;

&lt;p&gt;You built:&lt;/p&gt;

&lt;p&gt;✔ Multi-tier architecture&lt;br&gt;
✔ DMZ design&lt;br&gt;
✔ Private networking&lt;br&gt;
✔ Load balancing&lt;br&gt;
✔ Firewall (SG + WAF)&lt;br&gt;
✔ DNS routing&lt;br&gt;
✔ Observability (logs + flow logs)&lt;br&gt;
✔ Private AWS access (VPC endpoint)&lt;br&gt;
✔ Multi-VPC (peering + transit)&lt;br&gt;
✔ Service exposure (PrivateLink)&lt;br&gt;
✔ Hybrid cloud (VPN)&lt;/p&gt;




&lt;h1&gt;
  
  
  💬 FINAL INTERVIEW ANSWER
&lt;/h1&gt;

&lt;p&gt;You say:&lt;/p&gt;

&lt;p&gt;I built a production-grade AWS architecture with DNS using Route 53, public access through an Application Load Balancer in DMZ subnets, private application and database tiers, secure communication using security groups, outbound internet via NAT Gateway, private AWS access via VPC endpoints, and network observability using VPC Flow Logs and ALB logs. I also implemented multi-VPC connectivity using VPC peering and Transit Gateway, and secure service exposure using PrivateLink, along with hybrid connectivity using VPN.&lt;/p&gt;

&lt;p&gt;-&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Full SRE Networking Lecture: What You Must Know After Basic VPC</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Thu, 30 Apr 2026 00:00:32 +0000</pubDate>
      <link>https://forem.com/jumptotech/full-sre-networking-lecture-what-you-must-know-after-basic-vpc-35op</link>
      <guid>https://forem.com/jumptotech/full-sre-networking-lecture-what-you-must-know-after-basic-vpc-35op</guid>
      <description>&lt;h2&gt;
  
  
  1. DNS and Route 53
&lt;/h2&gt;

&lt;p&gt;DNS is one of the most important networking topics for SRE. Many production outages look like “application is down,” but the real issue is DNS.&lt;/p&gt;

&lt;p&gt;DNS translates a name into an IP address or another DNS name.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;jumptotech.com → ALB DNS name → EC2 targets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In AWS, the main DNS service is &lt;strong&gt;Amazon Route 53&lt;/strong&gt;. Route 53 is used to manage domain records and route users to AWS resources like ALB, CloudFront, S3 static websites, or failover endpoints.&lt;/p&gt;

&lt;p&gt;Important DNS record types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A record     → domain to IPv4 address
AAAA record  → domain to IPv6 address
CNAME        → domain to another domain
ALIAS        → Route 53 record pointing to AWS resources like ALB or CloudFront
TXT          → verification, SPF, DKIM, security records
MX           → email routing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, for an application, the flow is usually:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
 ↓
Route 53
 ↓
Application Load Balancer
 ↓
Private application servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As an SRE, when a website is not reachable, you should not immediately check EC2. First check DNS.&lt;/p&gt;

&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nslookup example.com
dig example.com
dig example.com +short
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Troubleshooting questions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Does the domain resolve?
Does it point to the correct ALB?
Was DNS changed recently?
Is TTL too long?
Is Route 53 health check failing?
Is the record public or private hosted zone?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route 53 can also use health checks and failover routing. AWS recommends evaluating target health for alias records when using health-based DNS routing, otherwise Route 53 may still route traffic to unhealthy resources. (&lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover-complex-configs.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;Route 53 is AWS DNS service. I use it to route user traffic to AWS resources such as ALB or CloudFront. As an SRE, I troubleshoot DNS by checking resolution, record type, TTL, hosted zone, and whether the DNS target is healthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Load Balancing: ALB vs NLB
&lt;/h2&gt;

&lt;p&gt;A load balancer distributes traffic across multiple targets. In production, users should not directly access EC2 instances. They should access a load balancer.&lt;/p&gt;

&lt;p&gt;Main AWS load balancers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB = Application Load Balancer
NLB = Network Load Balancer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Application Load Balancer
&lt;/h3&gt;

&lt;p&gt;ALB works at Layer 7, the application layer. It understands HTTP and HTTPS.&lt;/p&gt;

&lt;p&gt;Use ALB for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;web applications
APIs
path-based routing
host-based routing
HTTPS termination
microservices
containerized apps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;app.example.com      → frontend target group
api.example.com      → backend target group
/example/orders      → orders service
/example/payments    → payment service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ALB uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Listener
Rule
Target Group
Health Check
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A listener receives traffic on a port, usually 80 or 443. A rule decides where to forward the request. A target group contains EC2, ECS tasks, IPs, or Lambda targets. Health checks decide whether the target should receive traffic. AWS documentation says ALB target groups route requests to registered targets and health checks are configured per target group. (&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Production flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
 ↓
Route 53
 ↓
ALB in public subnet
 ↓
EC2/ECS/EKS app in private subnet
 ↓
RDS database in private subnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As an SRE, if ALB returns &lt;code&gt;503&lt;/code&gt;, usually it means no healthy targets.&lt;/p&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Are targets registered?
Are targets healthy?
Is health check path correct?
Is app listening on correct port?
Does app security group allow traffic from ALB security group?
Is the app returning 200 on health check path?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful commands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-I&lt;/span&gt; http://alb-dns-name
curl http://private-app-ip:8080/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Network Load Balancer
&lt;/h3&gt;

&lt;p&gt;NLB works at Layer 4. It handles TCP, UDP, and TLS traffic.&lt;/p&gt;

&lt;p&gt;Use NLB for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;very high performance
TCP applications
static IP requirement
low latency
non-HTTP protocols
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kafka
database proxy
game servers
TCP services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NLB health checks determine whether targets are available. AWS documentation says NLB uses active and passive health checks and routes traffic only to healthy targets in enabled Availability Zones unless cross-zone load balancing is enabled. (&lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;I use ALB for HTTP and HTTPS applications because it supports Layer 7 routing, TLS termination, host-based and path-based rules. I use NLB for high-performance TCP or UDP workloads where low latency or static IP is required.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Target Groups and Health Checks
&lt;/h2&gt;

&lt;p&gt;Health checks are critical for reliability.&lt;/p&gt;

&lt;p&gt;A load balancer should not send traffic to a broken server. That is why every target group has a health check.&lt;/p&gt;

&lt;p&gt;Example health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Protocol: HTTP
Path: /health
Success code: 200
Interval: 30 seconds
Healthy threshold: 3
Unhealthy threshold: 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Bad health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why? Maybe homepage works but database connection is broken.&lt;/p&gt;

&lt;p&gt;Better health check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/health
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This endpoint should check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;application running
database reachable
required dependencies available
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But do not make health checks too heavy. If &lt;code&gt;/health&lt;/code&gt; runs expensive database queries every few seconds, the health check itself can overload the app.&lt;/p&gt;

&lt;p&gt;As an SRE, when deployment causes outage, check target group health first.&lt;/p&gt;

&lt;p&gt;Troubleshooting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target unhealthy because timeout?
Target unhealthy because 403?
Target unhealthy because 500?
Wrong port?
Wrong path?
Security group blocking ALB?
App binding to localhost only?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common mistake:&lt;/p&gt;

&lt;p&gt;Application listens on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;127.0.0.1:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But it should listen on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;Health checks allow the load balancer to remove unhealthy targets from rotation. As an SRE, I always verify the health check path, port, response code, security group, and application logs.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. VPC Endpoints
&lt;/h2&gt;

&lt;p&gt;VPC endpoints allow private resources to access AWS services without using the public internet.&lt;/p&gt;

&lt;p&gt;AWS documentation says VPC endpoints privately connect your VPC to supported AWS services without requiring an internet gateway, NAT device, VPN, or Direct Connect. Traffic stays on the AWS network backbone. (&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/building-scalable-secure-multi-vpc-network-infrastructure/centralized-access-to-vpc-private-endpoints.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Without VPC endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2
 ↓
NAT Gateway
 ↓
Internet path
 ↓
S3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With VPC endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2
 ↓
VPC Endpoint
 ↓
S3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gateway Endpoint   → S3, DynamoDB
Interface Endpoint → SSM, ECR, CloudWatch, Secrets Manager, STS, KMS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why SRE uses VPC endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;increase security
reduce NAT Gateway dependency
reduce NAT data processing cost
allow private subnet access to AWS services
support private architecture
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Very important real-world example:&lt;/p&gt;

&lt;p&gt;You have private EC2 with no public IP. You want to connect using AWS Systems Manager Session Manager.&lt;/p&gt;

&lt;p&gt;You may need interface endpoints for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssm
ssmmessages
ec2messages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your private EC2 needs to pull Docker images from ECR, you may need endpoints for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ecr.api
ecr.dkr
s3
CloudWatch Logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Troubleshooting endpoint issues:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is endpoint created in correct VPC?
Is private DNS enabled?
Is security group allowing HTTPS 443 to endpoint?
Is route table configured for gateway endpoint?
Does IAM policy allow access?
Is endpoint policy blocking access?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;I use VPC endpoints when private workloads need to access AWS services without going through the public internet or NAT Gateway. This improves security and can reduce cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. AWS PrivateLink
&lt;/h2&gt;

&lt;p&gt;PrivateLink is related to VPC endpoints, but it is more advanced.&lt;/p&gt;

&lt;p&gt;AWS PrivateLink allows private connectivity between VPCs, AWS services, services in other AWS accounts, and Marketplace services without using public internet, NAT, VPN, or Direct Connect. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Use case:&lt;/p&gt;

&lt;p&gt;Company A exposes service privately.&lt;/p&gt;

&lt;p&gt;Company B consumes it privately.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Consumer VPC
 ↓
Interface Endpoint
 ↓
PrivateLink
 ↓
Provider NLB
 ↓
Provider service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PrivateLink is service-level access, not full network access.&lt;/p&gt;

&lt;p&gt;This is very important.&lt;/p&gt;

&lt;p&gt;Difference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC Peering     → connects networks
Transit Gateway → connects many networks
PrivateLink     → exposes only one service privately
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this matters:&lt;/p&gt;

&lt;p&gt;With VPC peering, VPCs can potentially route to many internal resources.&lt;/p&gt;

&lt;p&gt;With PrivateLink, the consumer can access only the specific service exposed through the endpoint.&lt;/p&gt;

&lt;p&gt;When to use PrivateLink:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SaaS provider exposing private API
shared internal service
cross-account service access
security-sensitive architecture
avoid full VPC-to-VPC routing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;PrivateLink is used when we want to expose a specific service privately without giving full network access between VPCs. It is more controlled than VPC peering.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. VPC Peering
&lt;/h2&gt;

&lt;p&gt;VPC peering connects two VPCs using private IPs.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A: 10.0.0.0/16
VPC B: 10.1.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After peering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EC2 in VPC A → private IP → EC2 in VPC B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CIDR blocks cannot overlap
Peering must be accepted
Routes must be added on both sides
Security groups must allow traffic
NACLs must allow traffic
Peering is not transitive
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not transitive means:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A peers with VPC B
VPC B peers with VPC C
VPC A cannot automatically reach VPC C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use peering when:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;only two or a few VPCs need communication
simple architecture
low operational complexity
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do not use peering when you have many VPCs. It becomes hard to manage.&lt;/p&gt;

&lt;p&gt;Troubleshooting peering:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is peering active?
Are CIDRs overlapping?
Does VPC A route table point to peering connection?
Does VPC B route table point back?
Do SG/NACL allow traffic?
Is DNS resolution enabled if using private DNS names?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;VPC peering is private connectivity between two VPCs. It is simple and low-latency, but it does not support transitive routing and does not scale well for many VPCs.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Transit Gateway
&lt;/h2&gt;

&lt;p&gt;Transit Gateway is like a cloud router.&lt;/p&gt;

&lt;p&gt;Instead of creating many VPC peering connections, you attach VPCs to one central hub.&lt;/p&gt;

&lt;p&gt;Without Transit Gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A ↔ VPC B
VPC A ↔ VPC C
VPC A ↔ VPC D
VPC B ↔ VPC C
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This becomes messy.&lt;/p&gt;

&lt;p&gt;With Transit Gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A
  ↓
Transit Gateway
  ↑
VPC B
  ↑
VPC C
  ↑
VPN / Direct Connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use Transit Gateway for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;many VPCs
multi-account architecture
shared services VPC
hybrid cloud
centralized firewall inspection
enterprise networks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS VPC connectivity documentation lists Transit Gateway, VPC peering, PrivateLink, VPN, and Direct Connect as major private connectivity options. (&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/amazon-vpc-to-amazon-vpc-connectivity-options.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;As an SRE, you need to know that Transit Gateway has route tables too.&lt;/p&gt;

&lt;p&gt;Troubleshooting Transit Gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is VPC attached to TGW?
Is attachment available?
Is route propagated?
Is route associated with correct TGW route table?
Do subnet route tables point to TGW?
Do SG/NACL allow traffic?
Is there asymmetric routing?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;Transit Gateway is used as a central network hub to connect many VPCs and hybrid networks. It is better than VPC peering when the environment has many VPCs or accounts.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. VPN and Direct Connect
&lt;/h2&gt;

&lt;p&gt;These are used for hybrid cloud: connecting on-premises data centers to AWS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Site-to-Site VPN
&lt;/h3&gt;

&lt;p&gt;VPN creates encrypted tunnels over the public internet.&lt;/p&gt;

&lt;p&gt;Flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On-prem router/firewall
 ↓ encrypted tunnel
AWS VPN Gateway / Transit Gateway
 ↓
VPC
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use VPN when:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;quick setup
lower cost
backup connection
encrypted connection over internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Limitations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;internet-dependent
latency can vary
bandwidth limited compared to Direct Connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Direct Connect
&lt;/h3&gt;

&lt;p&gt;Direct Connect is a dedicated private network connection from your data center or colocation to AWS.&lt;/p&gt;

&lt;p&gt;Use Direct Connect when:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;stable latency required
large data transfer
enterprise hybrid cloud
more predictable performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS documentation describes connectivity options using Direct Connect, Site-to-Site VPN, and Transit Gateway for remote network to VPC connectivity. (&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/aws-vpc-connectivity-options/introduction.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Production design often uses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Direct Connect as primary
VPN as backup
Transit Gateway as hub
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Troubleshooting hybrid connectivity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is tunnel up?
Is BGP established?
Are routes advertised?
Are security groups allowing traffic?
Are on-prem firewalls allowing traffic?
Is return route correct?
Is DNS resolving private names?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;VPN provides encrypted connectivity over the internet, while Direct Connect provides a dedicated private connection to AWS. In production, companies often use Direct Connect for stable performance and VPN as backup.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. AWS WAF
&lt;/h2&gt;

&lt;p&gt;WAF means Web Application Firewall.&lt;/p&gt;

&lt;p&gt;Security Group controls ports and IP access.&lt;/p&gt;

&lt;p&gt;NACL controls subnet-level traffic.&lt;/p&gt;

&lt;p&gt;WAF protects web applications at Layer 7.&lt;/p&gt;

&lt;p&gt;WAF can block:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SQL injection
cross-site scripting
bad bots
malicious IPs
rate-based attacks
suspicious headers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common placement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User
 ↓
Route 53
 ↓
CloudFront or ALB
 ↓
WAF
 ↓
Application
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use WAF when:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public web application
API exposed to internet
compliance requirement
need Layer 7 protection
need rate limiting
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Troubleshooting WAF:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is WAF blocking legitimate traffic?
Check WAF logs
Check rule priority
Check managed rule false positives
Check rate limit
Check IP reputation list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;WAF protects applications at Layer 7 from web attacks such as SQL injection, XSS, and bad bots. I use it in front of ALB or CloudFront for internet-facing applications.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. CloudFront and CDN Basics
&lt;/h2&gt;

&lt;p&gt;CloudFront is AWS CDN.&lt;/p&gt;

&lt;p&gt;CDN means content delivery network.&lt;/p&gt;

&lt;p&gt;It caches content close to users.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Without CloudFront:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User in California → ALB in Virginia
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With CloudFront:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User in California → nearest edge location → origin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use CloudFront for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;static websites
images
videos
frontend apps
API acceleration
global users
DDoS protection with Shield
TLS termination
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CloudFront origin can be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3
ALB
EC2
API Gateway
custom domain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;SRE troubleshooting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is cache serving old content?
Is origin healthy?
Is behavior path correct?
Is HTTPS certificate valid?
Is WAF blocking request?
Is DNS pointing to CloudFront?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common issue:&lt;/p&gt;

&lt;p&gt;You deploy new frontend, but users still see old version.&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CloudFront invalidation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;CloudFront improves performance by caching content at edge locations closer to users. As an SRE, I troubleshoot CloudFront by checking cache behavior, origin health, invalidations, certificates, WAF, and DNS.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Network Observability
&lt;/h2&gt;

&lt;p&gt;SRE must prove what is happening in the network.&lt;/p&gt;

&lt;p&gt;Important tools:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC Flow Logs
ALB access logs
CloudWatch metrics
CloudTrail
Route 53 query logs
WAF logs
Transit Gateway flow logs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  VPC Flow Logs
&lt;/h3&gt;

&lt;p&gt;VPC Flow Logs capture IP traffic metadata for network interfaces.&lt;/p&gt;

&lt;p&gt;They help answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Was traffic accepted or rejected?
Which source IP connected?
Which destination port?
Which ENI?
Which subnet?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use VPC Flow Logs for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;security investigation
NACL troubleshooting
SG troubleshooting
network visibility
unexpected traffic analysis
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REJECT TCP 10.0.3.10 10.0.5.20 5432
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This tells you traffic to database port was rejected.&lt;/p&gt;

&lt;h3&gt;
  
  
  ALB access logs
&lt;/h3&gt;

&lt;p&gt;ALB logs show:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;client IP
request path
target status code
load balancer status code
response time
target processing time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Useful for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;502
503
504
slow requests
bad target responses
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  SRE book mindset
&lt;/h3&gt;

&lt;p&gt;Google’s SRE book emphasizes that monitoring should help decide what should interrupt a human and what should not. Good monitoring is not collecting everything; good monitoring detects user-impacting issues. (&lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;Google SRE&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;For network observability, I use VPC Flow Logs, ALB logs, Route 53 logs, WAF logs, and CloudWatch metrics. These help me identify whether the issue is DNS, routing, firewall, load balancer, target health, or application.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Full Production Network Architecture
&lt;/h2&gt;

&lt;p&gt;This is the architecture you must be able to explain in interviews.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Users
 ↓
Route 53
 ↓
CloudFront + WAF
 ↓
Application Load Balancer - public subnets
 ↓
Application servers / ECS / EKS - private app subnets
 ↓
RDS / ElastiCache - private DB subnets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supporting components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAT Gateway       → private outbound internet
VPC Endpoint      → private AWS service access
Transit Gateway   → multi-VPC connectivity
VPN/DX            → on-prem connectivity
VPC Flow Logs     → network observability
CloudWatch        → metrics and alarms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB goes in public subnets
App goes in private subnets
DB goes in private subnets
NAT Gateway goes in public subnet
Private servers do not get public IPs
DB is never open to internet
Use SG references instead of hardcoded IPs
Use Multi-AZ for availability
Use VPC endpoints for private AWS service access
Use WAF for internet-facing apps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  13. SRE Troubleshooting Framework
&lt;/h2&gt;

&lt;p&gt;When production is down, do not guess.&lt;/p&gt;

&lt;p&gt;Follow this order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. DNS
2. CDN / WAF
3. Load Balancer
4. Security Groups
5. Route Tables
6. NACL
7. Target Health
8. Application Logs
9. Database
10. Dependencies
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 1: Website is down
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dig app.example.com
curl &lt;span class="nt"&gt;-I&lt;/span&gt; https://app.example.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is Route 53 pointing to correct ALB/CloudFront?
Is certificate valid?
Is WAF blocking?
Is ALB reachable?
Are targets healthy?
Is app running?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: ALB returns 503
&lt;/h3&gt;

&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No healthy targets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Target group health
Health check path
Security group from ALB to app
App port
App logs
Deployment status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 3: ALB returns 504
&lt;/h3&gt;

&lt;p&gt;Meaning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gateway timeout
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App too slow?
DB slow?
Target not responding?
Timeout configuration?
Network path blocked?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 4: Private EC2 cannot access internet
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private route table has 0.0.0.0/0 → NAT Gateway
NAT Gateway is available
NAT Gateway has Elastic IP
NAT is in public subnet
Public subnet routes 0.0.0.0/0 → IGW
SG allows outbound
NACL allows ephemeral ports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 5: EC2 cannot access S3 privately
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Gateway endpoint exists?
Route table associated?
Bucket policy allows endpoint?
IAM role allows S3?
Region correct?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 6: App cannot connect to RDS
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RDS running?
Correct endpoint?
Correct port?
DB SG allows app SG?
App subnet route table has local route?
NACL allows traffic both directions?
Credentials correct?
DB max connections reached?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 7: VPC peering not working
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Peering active?
CIDR non-overlapping?
Routes added both sides?
SG allows remote CIDR or SG?
NACL allows?
DNS resolution enabled?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  14. What You Must Memorize for Interview
&lt;/h2&gt;

&lt;p&gt;You must know this table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route 53        → DNS
CloudFront      → CDN / edge caching
WAF             → Layer 7 web protection
ALB             → HTTP/HTTPS load balancing
NLB             → TCP/UDP load balancing
VPC             → private network
Subnet          → network segment in one AZ
Route Table     → controls traffic direction
IGW             → internet access for public subnets
NAT Gateway     → outbound internet for private subnets
SG              → stateful resource firewall
NACL            → stateless subnet firewall
VPC Endpoint    → private access to AWS services
PrivateLink     → private service exposure
VPC Peering     → private VPC-to-VPC network connection
Transit Gateway → central router for many VPCs
VPN             → encrypted hybrid connection over internet
Direct Connect  → private dedicated hybrid connection
Flow Logs       → network traffic visibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  15. Strong Final Interview Answer
&lt;/h2&gt;

&lt;p&gt;Use this answer:&lt;/p&gt;

&lt;p&gt;I design AWS networking using layered architecture. I use Route 53 for DNS, CloudFront and WAF for edge performance and security, ALB for public application entry, private subnets for application workloads, and isolated private subnets for databases. I use NAT Gateway only for outbound internet from private subnets and VPC endpoints when private workloads need AWS service access without internet. For multi-VPC communication, I choose VPC peering for simple cases, Transit Gateway for large enterprise hub-and-spoke architecture, and PrivateLink when only one private service should be exposed. As an SRE, I troubleshoot from DNS to load balancer, route tables, security groups, NACLs, target health, logs, and application dependencies.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Observability, Reliability, and Incident Management (Production-Level)</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Wed, 29 Apr 2026 23:53:07 +0000</pubDate>
      <link>https://forem.com/jumptotech/observability-reliability-and-incident-management-production-level-3a8l</link>
      <guid>https://forem.com/jumptotech/observability-reliability-and-incident-management-production-level-3a8l</guid>
      <description>&lt;h1&gt;
  
  
  1. What SRE Actually Does (Real World)
&lt;/h1&gt;

&lt;p&gt;After networking (VPC, subnets, routing), your system is running.&lt;/p&gt;

&lt;p&gt;Now SRE responsibility starts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the system working?&lt;/li&gt;
&lt;li&gt;Is it fast?&lt;/li&gt;
&lt;li&gt;Is it reliable?&lt;/li&gt;
&lt;li&gt;Can we detect problems early?&lt;/li&gt;
&lt;li&gt;Can we recover quickly?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is called &lt;strong&gt;reliability engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A simple way to think:&lt;/p&gt;

&lt;p&gt;DevOps builds system&lt;br&gt;
SRE keeps it alive under stress&lt;/p&gt;




&lt;h1&gt;
  
  
  2. Observability — Deep Explanation
&lt;/h1&gt;

&lt;p&gt;Observability is not just “monitoring.”&lt;br&gt;
Monitoring tells you &lt;strong&gt;something is wrong&lt;/strong&gt;&lt;br&gt;
Observability tells you &lt;strong&gt;why it is wrong&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AWS and Google SRE books define observability as:&lt;/p&gt;

&lt;p&gt;Ability to understand system state using external outputs&lt;/p&gt;

&lt;p&gt;These outputs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;metrics&lt;/li&gt;
&lt;li&gt;logs&lt;/li&gt;
&lt;li&gt;traces&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2.1 Metrics (Deep)
&lt;/h2&gt;

&lt;p&gt;Metrics are &lt;strong&gt;numerical time-series data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU usage = 70%&lt;/li&gt;
&lt;li&gt;requests/sec = 200&lt;/li&gt;
&lt;li&gt;error rate = 5%&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why we use metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;detect anomalies&lt;/li&gt;
&lt;li&gt;trigger alerts&lt;/li&gt;
&lt;li&gt;track performance trends&lt;/li&gt;
&lt;li&gt;capacity planning&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Tool: Amazon CloudWatch
&lt;/h3&gt;

&lt;h3&gt;
  
  
  What it does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;collects metrics from AWS services (EC2, ALB, RDS)&lt;/li&gt;
&lt;li&gt;stores time-series data&lt;/li&gt;
&lt;li&gt;creates alarms&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How we use it (real scenario)
&lt;/h3&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;You deploy application on EC2&lt;/p&gt;

&lt;p&gt;CloudWatch automatically gives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPUUtilization&lt;/li&gt;
&lt;li&gt;NetworkIn/Out&lt;/li&gt;
&lt;li&gt;DiskReadOps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you create alarm:&lt;/p&gt;

&lt;p&gt;IF CPU &amp;gt; 80% for 5 minutes → trigger alert&lt;/p&gt;




&lt;h3&gt;
  
  
  When to use CloudWatch
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS native monitoring&lt;/li&gt;
&lt;li&gt;quick setup&lt;/li&gt;
&lt;li&gt;infrastructure-level metrics&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Limitations (SRE thinking)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;not very strong for custom application metrics&lt;/li&gt;
&lt;li&gt;limited visualization compared to Prometheus + Grafana&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2.2 Metrics Tool (Advanced): Prometheus
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What it does
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;pulls metrics from applications&lt;/li&gt;
&lt;li&gt;stores time-series data&lt;/li&gt;
&lt;li&gt;supports powerful queries (PromQL)&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why SRE prefers Prometheus
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;better for microservices&lt;/li&gt;
&lt;li&gt;supports custom metrics&lt;/li&gt;
&lt;li&gt;integrates with Kubernetes&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  How we use it
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Application exposes metrics endpoint:&lt;br&gt;
&lt;code&gt;/metrics&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Prometheus scrapes it&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You query:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;request latency&lt;/li&gt;
&lt;li&gt;error rates&lt;/li&gt;
&lt;li&gt;DB connections&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;You detect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;high latency&lt;/li&gt;
&lt;li&gt;normal CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ issue is NOT infrastructure&lt;br&gt;
→ issue is application&lt;/p&gt;




&lt;h3&gt;
  
  
  Troubleshooting using metrics
&lt;/h3&gt;

&lt;p&gt;Case:&lt;/p&gt;

&lt;p&gt;Website slow&lt;/p&gt;

&lt;p&gt;Check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU high → scaling issue&lt;/li&gt;
&lt;li&gt;latency high → app issue&lt;/li&gt;
&lt;li&gt;error rate high → bug or DB problem&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2.3 Logs (Deep)
&lt;/h2&gt;

&lt;p&gt;Logs are &lt;strong&gt;detailed events&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;user login failed&lt;/li&gt;
&lt;li&gt;DB connection error&lt;/li&gt;
&lt;li&gt;API returned 500&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Tool: ELK Stack
&lt;/h3&gt;

&lt;p&gt;Components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Elasticsearch → storage&lt;/li&gt;
&lt;li&gt;Logstash → processing&lt;/li&gt;
&lt;li&gt;Kibana → visualization&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why logs are critical
&lt;/h3&gt;

&lt;p&gt;Metrics say:&lt;br&gt;
“Error rate increased”&lt;/p&gt;

&lt;p&gt;Logs say:&lt;br&gt;
“Database connection timeout”&lt;/p&gt;




&lt;h3&gt;
  
  
  How we use logs
&lt;/h3&gt;

&lt;p&gt;Good logs must include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timestamp&lt;/li&gt;
&lt;li&gt;service name&lt;/li&gt;
&lt;li&gt;log level (INFO, ERROR)&lt;/li&gt;
&lt;li&gt;request ID&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  AWS logging: CloudWatch Logs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;collects logs from EC2, Lambda&lt;/li&gt;
&lt;li&gt;integrates with CloudWatch metrics&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Troubleshooting using logs
&lt;/h3&gt;

&lt;p&gt;Case:&lt;/p&gt;

&lt;p&gt;App returns 500&lt;/p&gt;

&lt;p&gt;Steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;check logs&lt;/li&gt;
&lt;li&gt;find error message&lt;/li&gt;
&lt;li&gt;identify root cause&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“connection refused” → DB issue&lt;/li&gt;
&lt;li&gt;“timeout” → network issue&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2.4 Tracing (Deep)
&lt;/h2&gt;

&lt;p&gt;Tracing tracks request across services&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;User request path:&lt;/p&gt;

&lt;p&gt;User → ALB → API → Service → DB&lt;/p&gt;




&lt;h3&gt;
  
  
  Tool: AWS X-Ray
&lt;/h3&gt;




&lt;h3&gt;
  
  
  Why tracing matters
&lt;/h3&gt;

&lt;p&gt;In microservices:&lt;/p&gt;

&lt;p&gt;You don’t know where latency happens&lt;/p&gt;

&lt;p&gt;Tracing shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;which service is slow&lt;/li&gt;
&lt;li&gt;where failure occurs&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Request takes 3 seconds&lt;/p&gt;

&lt;p&gt;Tracing shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API: 50ms&lt;/li&gt;
&lt;li&gt;Service: 100ms&lt;/li&gt;
&lt;li&gt;DB: 2.8s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;→ problem is DB&lt;/p&gt;




&lt;h1&gt;
  
  
  3. SLI, SLO, SLA (Deep Understanding)
&lt;/h1&gt;




&lt;h2&gt;
  
  
  SLI (Indicator)
&lt;/h2&gt;

&lt;p&gt;What you measure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;uptime&lt;/li&gt;
&lt;li&gt;latency&lt;/li&gt;
&lt;li&gt;error rate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  SLO (Objective)
&lt;/h2&gt;

&lt;p&gt;Target:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99.9% uptime&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  SLA (Agreement)
&lt;/h2&gt;

&lt;p&gt;Legal commitment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;if broken → compensation&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why SRE uses this
&lt;/h3&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;p&gt;Without SLO → no reliability target&lt;br&gt;
Without SLI → no measurement&lt;/p&gt;




&lt;h1&gt;
  
  
  4. Alerting (Real SRE Thinking)
&lt;/h1&gt;

&lt;p&gt;Alerting is where most teams fail&lt;/p&gt;




&lt;h2&gt;
  
  
  Bad alerts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;CPU 80%&lt;/li&gt;
&lt;li&gt;disk usage 70%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These create noise&lt;/p&gt;




&lt;h2&gt;
  
  
  Good alerts
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;user cannot login&lt;/li&gt;
&lt;li&gt;API error rate &amp;gt; 5%&lt;/li&gt;
&lt;li&gt;latency &amp;gt; threshold&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Tool: Prometheus Alertmanager
&lt;/h3&gt;




&lt;h3&gt;
  
  
  How alert works
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;metric collected&lt;/li&gt;
&lt;li&gt;condition evaluated&lt;/li&gt;
&lt;li&gt;alert fired&lt;/li&gt;
&lt;li&gt;notification sent (Slack, email)&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  SRE rule
&lt;/h3&gt;

&lt;p&gt;Alert on &lt;strong&gt;user impact&lt;/strong&gt;, not infrastructure&lt;/p&gt;




&lt;h1&gt;
  
  
  5. Incident Management (Production Flow)
&lt;/h1&gt;

&lt;p&gt;Incident = service disruption&lt;/p&gt;




&lt;h2&gt;
  
  
  Real steps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;detection (monitoring)&lt;/li&gt;
&lt;li&gt;alert triggered&lt;/li&gt;
&lt;li&gt;engineer responds&lt;/li&gt;
&lt;li&gt;mitigation (temporary fix)&lt;/li&gt;
&lt;li&gt;resolution (root cause fix)&lt;/li&gt;
&lt;li&gt;postmortem&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Issue:&lt;br&gt;
Website down&lt;/p&gt;

&lt;p&gt;Actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restart service (mitigation)&lt;/li&gt;
&lt;li&gt;fix DB connection (resolution)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  6. Postmortem (Critical SRE Practice)
&lt;/h1&gt;

&lt;p&gt;After incident:&lt;/p&gt;

&lt;p&gt;You must answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what happened?&lt;/li&gt;
&lt;li&gt;why?&lt;/li&gt;
&lt;li&gt;how to prevent?&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Rule
&lt;/h3&gt;

&lt;p&gt;No blame&lt;/p&gt;

&lt;p&gt;Focus on system failure, not people&lt;/p&gt;




&lt;h1&gt;
  
  
  7. Error Budget (Advanced Concept)
&lt;/h1&gt;

&lt;p&gt;If SLO = 99.9%&lt;/p&gt;

&lt;p&gt;Allowed downtime:&lt;/p&gt;

&lt;p&gt;≈ 43 minutes/month&lt;/p&gt;




&lt;h3&gt;
  
  
  Why important
&lt;/h3&gt;

&lt;p&gt;Balance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;innovation (deploy fast)&lt;/li&gt;
&lt;li&gt;stability (avoid downtime)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  8. High Availability (HA)
&lt;/h1&gt;

&lt;p&gt;System must survive failure&lt;/p&gt;




&lt;h3&gt;
  
  
  AWS tools
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Elastic Load Balancer&lt;/li&gt;
&lt;li&gt;Multi-AZ deployment&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;If one AZ fails:&lt;/p&gt;

&lt;p&gt;Traffic shifts to another AZ&lt;/p&gt;




&lt;h1&gt;
  
  
  9. Auto Scaling (Reliability + Cost)
&lt;/h1&gt;

&lt;p&gt;Automatically adjust capacity&lt;/p&gt;




&lt;h3&gt;
  
  
  AWS service
&lt;/h3&gt;

&lt;p&gt;Auto Scaling Group&lt;/p&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Traffic spike:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;add EC2 instances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traffic drop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;remove instances&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  10. Health Checks
&lt;/h1&gt;

&lt;p&gt;Check system status&lt;/p&gt;




&lt;h3&gt;
  
  
  In Kubernetes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;readiness probe → ready to serve&lt;/li&gt;
&lt;li&gt;liveness probe → alive&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Tool: Kubernetes
&lt;/h3&gt;




&lt;h3&gt;
  
  
  Why important
&lt;/h3&gt;

&lt;p&gt;Without health checks:&lt;/p&gt;

&lt;p&gt;Load balancer sends traffic to broken app&lt;/p&gt;




&lt;h1&gt;
  
  
  11. Caching (Performance)
&lt;/h1&gt;

&lt;p&gt;Store frequently used data&lt;/p&gt;




&lt;h3&gt;
  
  
  Tool: Redis
&lt;/h3&gt;




&lt;h3&gt;
  
  
  Why use caching
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;reduce DB load&lt;/li&gt;
&lt;li&gt;faster response&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  12. Disaster Recovery
&lt;/h1&gt;

&lt;p&gt;Plan for failure&lt;/p&gt;




&lt;h3&gt;
  
  
  Strategies
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;backup restore&lt;/li&gt;
&lt;li&gt;multi-region&lt;/li&gt;
&lt;li&gt;active-active&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  13. Troubleshooting Mindset (MOST IMPORTANT)
&lt;/h1&gt;

&lt;p&gt;When something breaks:&lt;/p&gt;

&lt;p&gt;DO NOT GUESS&lt;/p&gt;

&lt;p&gt;Follow layers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DNS&lt;/li&gt;
&lt;li&gt;Network&lt;/li&gt;
&lt;li&gt;Load balancer&lt;/li&gt;
&lt;li&gt;App&lt;/li&gt;
&lt;li&gt;DB&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;App not working&lt;/p&gt;

&lt;p&gt;Check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DNS resolves?&lt;/li&gt;
&lt;li&gt;ALB healthy?&lt;/li&gt;
&lt;li&gt;EC2 running?&lt;/li&gt;
&lt;li&gt;logs show error?&lt;/li&gt;
&lt;li&gt;DB reachable?&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  FINAL SRE INTERVIEW ANSWER
&lt;/h1&gt;

&lt;p&gt;You say:&lt;/p&gt;

&lt;p&gt;As an SRE, I focus on system reliability by implementing observability using metrics, logs, and traces, defining SLOs, setting up meaningful alerts, ensuring high availability with load balancing and auto scaling, and handling incidents with structured troubleshooting and postmortems.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>VPC, subnets, IGW, NAT, routing, firewall, DMZ, private DB, and troubleshooting part #1</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Wed, 29 Apr 2026 01:15:42 +0000</pubDate>
      <link>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-24og</link>
      <guid>https://forem.com/jumptotech/vpc-subnets-igw-nat-routing-firewall-dmz-private-db-and-troubleshooting-24og</guid>
      <description>&lt;h1&gt;
  
  
  🔥 LAB GOAL (PRODUCTION STYLE)
&lt;/h1&gt;

&lt;p&gt;You will build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   ↓
Load Balancer (DMZ / Public)
   ↓
Web Server (Private)
   ↓
Database (Private)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public subnets (DMZ)&lt;/li&gt;
&lt;li&gt;Private subnets (App + DB)&lt;/li&gt;
&lt;li&gt;NAT Gateway&lt;/li&gt;
&lt;li&gt;Security Groups (firewall)&lt;/li&gt;
&lt;li&gt;Route tables (routing)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 0 — WHAT YOU MUST HAVE
&lt;/h1&gt;

&lt;p&gt;Already created:&lt;/p&gt;

&lt;p&gt;✔ VPC&lt;br&gt;
✔ 2 Public subnets&lt;br&gt;
✔ 2 Private subnets&lt;br&gt;
✔ Internet Gateway&lt;br&gt;
✔ NAT Gateway&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 1 — FIX ROUTING (VERY IMPORTANT)
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Public Route Table
&lt;/h2&gt;

&lt;p&gt;Go to VPC → Route Tables → Public RT&lt;/p&gt;

&lt;p&gt;Make sure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Associate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public Subnet 1&lt;/li&gt;
&lt;li&gt;Public Subnet 2&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Private Route Table
&lt;/h2&gt;

&lt;p&gt;Make sure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Associate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Private Subnet 1&lt;/li&gt;
&lt;li&gt;Private Subnet 2&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Public = internet access&lt;/li&gt;
&lt;li&gt;Private = outbound only&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  🚀 STEP 2 — CREATE SECURITY GROUPS (FIREWALL DESIGN)
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Load Balancer SG (&lt;code&gt;alb-sg&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Allow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 → 0.0.0.0/0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. Web Server SG (&lt;code&gt;web-sg&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Allow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HTTP 80 → alb-sg
SSH 22 → your IP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Database SG (&lt;code&gt;db-sg&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Allow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MySQL 3306 → web-sg
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Internet → only ALB&lt;/li&gt;
&lt;li&gt;ALB → Web&lt;/li&gt;
&lt;li&gt;Web → DB&lt;/li&gt;
&lt;li&gt;Users CANNOT access DB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 This is real firewall architecture&lt;/p&gt;




&lt;h1&gt;
  
  
  🚀 STEP 3 — CREATE LOAD BALANCER (DMZ)
&lt;/h1&gt;

&lt;p&gt;Use:&lt;br&gt;
Application Load Balancer&lt;/p&gt;
&lt;h3&gt;
  
  
  Where:
&lt;/h3&gt;

&lt;p&gt;EC2 → Load Balancers → Create&lt;/p&gt;


&lt;h3&gt;
  
  
  Config:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Type: Application LB&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Scheme: Internet-facing&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Subnets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Public Subnet 1&lt;/li&gt;
&lt;li&gt;Public Subnet 2&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Security Group:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;alb-sg&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 Entry point for users&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 4 — CREATE WEB SERVERS (PRIVATE)
&lt;/h1&gt;

&lt;p&gt;Launch 2 EC2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Subnet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;private-subnet-1&lt;/li&gt;
&lt;li&gt;private-subnet-2&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Security Group:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;web-sg&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;NO public IP&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Install nginx:
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;nginx &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Customize page:
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Hello from Web Server 1"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /var/www/html/index.html
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 Private app servers running&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 5 — CONNECT ALB → WEB
&lt;/h1&gt;
&lt;h3&gt;
  
  
  Create Target Group:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Type: Instance&lt;/li&gt;
&lt;li&gt;Port: 80&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Add both EC2 instances&lt;/p&gt;



&lt;p&gt;Attach to Load Balancer&lt;/p&gt;


&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 ALB sends traffic to web servers&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 6 — TEST
&lt;/h1&gt;

&lt;p&gt;Open:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://&amp;lt;ALB-DNS&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 You see your web page&lt;/p&gt;

&lt;p&gt;Refresh:&lt;br&gt;
👉 It switches between servers&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 7 — CREATE DATABASE (SIMULATION)
&lt;/h1&gt;

&lt;p&gt;You can use EC2 or:&lt;br&gt;
Amazon RDS&lt;/p&gt;


&lt;h3&gt;
  
  
  For simple lab (EC2 DB):
&lt;/h3&gt;

&lt;p&gt;Launch EC2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subnet: private-subnet-1&lt;/li&gt;
&lt;li&gt;SG: &lt;code&gt;db-sg&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 Private DB server&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 8 — TEST NETWORK SECURITY
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Try:
&lt;/h2&gt;

&lt;p&gt;From your laptop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access DB → ❌ FAIL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From web EC2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect DB → ✔ WORK&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;👉 This proves firewall working&lt;/p&gt;


&lt;h1&gt;
  
  
  🚀 STEP 9 — TEST NAT (VERY IMPORTANT)
&lt;/h1&gt;

&lt;p&gt;SSH into web EC2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping google.com
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  ✔ Result:
&lt;/h3&gt;

&lt;p&gt;👉 Works → NAT is correct&lt;/p&gt;




&lt;h1&gt;
  
  
  🚀 STEP 10 — BREAK &amp;amp; DEBUG (SRE LEVEL)
&lt;/h1&gt;

&lt;p&gt;Now simulate failures:&lt;/p&gt;




&lt;h2&gt;
  
  
  Scenario 1 — Remove NAT route
&lt;/h2&gt;

&lt;p&gt;👉 Private EC2 cannot reach internet&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
👉 Add NAT route back&lt;/p&gt;




&lt;h2&gt;
  
  
  Scenario 2 — Remove SG rule (web → db)
&lt;/h2&gt;

&lt;p&gt;👉 App cannot reach DB&lt;/p&gt;

&lt;p&gt;Fix:&lt;br&gt;
👉 Add rule back&lt;/p&gt;




&lt;h2&gt;
  
  
  Scenario 3 — Stop one EC2
&lt;/h2&gt;

&lt;p&gt;👉 App still works via ALB&lt;/p&gt;




&lt;p&gt;👉 This is &lt;strong&gt;real SRE behavior&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  🔥 WHAT YOU JUST LEARNED
&lt;/h1&gt;

&lt;p&gt;You implemented:&lt;/p&gt;

&lt;p&gt;✔ VPC design&lt;br&gt;
✔ Subnet segmentation (DMZ / Private)&lt;br&gt;
✔ Routing (IGW + NAT)&lt;br&gt;
✔ Firewall (SG)&lt;br&gt;
✔ Load balancing&lt;br&gt;
✔ Secure DB access&lt;br&gt;
✔ Failure testing&lt;/p&gt;




&lt;h1&gt;
  
  
  💬 INTERVIEW ANSWER
&lt;/h1&gt;

&lt;blockquote&gt;
&lt;p&gt;I built a multi-tier architecture in AWS with public and private subnets, configured routing using Internet Gateway and NAT Gateway, secured communication using security groups, deployed web servers behind an Application Load Balancer, and validated failover and connectivity through testing scenarios.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
    </item>
    <item>
      <title>AWS Networking Full Lecture for DevOps/SRE</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Wed, 29 Apr 2026 01:12:48 +0000</pubDate>
      <link>https://forem.com/jumptotech/aws-networking-full-lecture-for-devopssre-3ce8</link>
      <guid>https://forem.com/jumptotech/aws-networking-full-lecture-for-devopssre-3ce8</guid>
      <description>&lt;p&gt;AWS networking starts with one main idea: we need to build a secure private network where some resources are public, some are private, traffic flows correctly, and we can troubleshoot when something breaks.&lt;/p&gt;

&lt;p&gt;In traditional networking, you used switches, routers, VLANs, firewalls, and DMZ. In AWS, the same ideas exist, but they are software-defined. Instead of physical routers and switches, we use VPC, subnets, route tables, internet gateway, NAT gateway, security groups, network ACLs, VPC endpoints, and VPC peering.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. VPC
&lt;/h2&gt;

&lt;p&gt;A VPC, or Virtual Private Cloud, is your private network inside AWS. It is logically isolated from other customers. When you create a VPC, you choose a CIDR block, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means your AWS network has private IP addresses from the &lt;code&gt;10.0.x.x&lt;/code&gt; range.&lt;/p&gt;

&lt;p&gt;Think of VPC like your company building. Inside that building, you create rooms. Those rooms are subnets.&lt;/p&gt;

&lt;p&gt;We use VPC because we need control over:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IP ranges
subnets
routing
security
internet access
private communication
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In interviews, say:&lt;/p&gt;

&lt;p&gt;A VPC is a logically isolated network in AWS where we define IP ranges, subnets, routing, and security rules.&lt;/p&gt;

&lt;p&gt;AWS route tables control where traffic goes inside the VPC, and each subnet must be associated with a route table. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/subnet-route-tables.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Subnet
&lt;/h2&gt;

&lt;p&gt;A subnet is a smaller network inside the VPC.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC: 10.0.0.0/16

Public subnet 1:  10.0.1.0/24
Public subnet 2:  10.0.2.0/24
Private subnet 1: 10.0.3.0/24
Private subnet 2: 10.0.4.0/24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A subnet belongs to one Availability Zone.&lt;/p&gt;

&lt;p&gt;We create multiple subnets because we want separation and high availability.&lt;/p&gt;

&lt;p&gt;Public subnet is for resources that need internet access, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application Load Balancer
Bastion host
NAT Gateway
Public web server for testing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Private subnet is for resources that should not be directly accessible from the internet, like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application servers
Databases
Internal APIs
Backend services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the AWS version of your Packet Tracer segmentation.&lt;/p&gt;

&lt;p&gt;Packet Tracer VLAN = AWS subnet.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Public Subnet vs Private Subnet
&lt;/h2&gt;

&lt;p&gt;A subnet is not automatically public or private because of its name. It becomes public or private based on its route table.&lt;/p&gt;

&lt;p&gt;A public subnet has this route:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A private subnet does not route directly to the Internet Gateway. Usually it has:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the real difference is routing.&lt;/p&gt;

&lt;p&gt;Public subnet means resources can communicate with the internet if they also have a public IP.&lt;/p&gt;

&lt;p&gt;Private subnet means resources cannot be reached directly from the internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Route Table
&lt;/h2&gt;

&lt;p&gt;A route table is like the traffic controller for your VPC. AWS documentation describes it as rules that determine where traffic from your subnet or gateway is directed. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Example public route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → local
0.0.0.0/0  → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;local&lt;/code&gt; route allows resources inside the VPC to communicate with each other.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;0.0.0.0/0&lt;/code&gt; route means all unknown traffic, usually internet traffic, goes to the Internet Gateway.&lt;/p&gt;

&lt;p&gt;Example private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → local
0.0.0.0/0  → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means private servers can talk inside the VPC and can go out to the internet through NAT, but the internet cannot start a connection back to them.&lt;/p&gt;

&lt;p&gt;Troubleshooting route tables:&lt;/p&gt;

&lt;p&gt;If public EC2 is not accessible, check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Does the subnet route table have 0.0.0.0/0 → Internet Gateway?
Does the EC2 have public IP?
Does the security group allow traffic?
Is the instance running?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If private EC2 cannot reach the internet, check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Does private route table have 0.0.0.0/0 → NAT Gateway?
Is NAT Gateway available?
Is NAT Gateway in public subnet?
Does public subnet have route to Internet Gateway?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Internet Gateway
&lt;/h2&gt;

&lt;p&gt;An Internet Gateway allows communication between your VPC and the internet.&lt;/p&gt;

&lt;p&gt;But just attaching an Internet Gateway is not enough. You must also update the public route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS documentation says public subnet route tables can use Internet Gateway as the target for traffic going to destinations not explicitly known, such as &lt;code&gt;0.0.0.0/0&lt;/code&gt;. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Internet_Gateway.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;We use Internet Gateway when we want public resources, such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Load balancer
Public web server
Bastion host
NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correct design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   ↓
Internet Gateway
   ↓
Public Route Table
   ↓
Public Subnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Common issue:&lt;/p&gt;

&lt;p&gt;People create an Internet Gateway but forget to associate the public subnet with the public route table. Then EC2 will not be reachable.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. NAT Gateway
&lt;/h2&gt;

&lt;p&gt;NAT Gateway is used for private subnet resources that need outbound internet access but should not be reachable from the internet.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Your private EC2 needs to run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;nginx
docker pull image
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It needs internet access. But you do not want the internet to SSH into it.&lt;/p&gt;

&lt;p&gt;That is why we use NAT Gateway.&lt;/p&gt;

&lt;p&gt;AWS describes NAT Gateway as a NAT service that lets instances in private subnets connect to services outside the VPC while external services cannot initiate connections to those private instances. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Correct NAT design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2
   ↓
Private Route Table
   ↓
NAT Gateway in Public Subnet
   ↓
Internet Gateway
   ↓
Internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important rule:&lt;/p&gt;

&lt;p&gt;NAT Gateway must be in a public subnet.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because NAT Gateway itself needs internet access through Internet Gateway.&lt;/p&gt;

&lt;p&gt;Private route table should have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Public route table should have:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0.0.0.0/0 → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NAT Gateway is zonal. For production, best practice is one NAT Gateway per Availability Zone. If you have private subnet in AZ-a and private subnet in AZ-b, each private subnet should use NAT in the same AZ for reliability and to avoid cross-AZ dependency.&lt;/p&gt;

&lt;p&gt;Troubleshooting NAT:&lt;/p&gt;

&lt;p&gt;If private EC2 cannot access internet, check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Is NAT Gateway available?
Is NAT Gateway in public subnet?
Does NAT Gateway have Elastic IP?
Does public subnet route to Internet Gateway?
Does private subnet route to NAT Gateway?
Does security group allow outbound traffic?
Does NACL allow inbound/outbound ephemeral ports?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Elastic IP
&lt;/h2&gt;

&lt;p&gt;Elastic IP is a static public IPv4 address in AWS.&lt;/p&gt;

&lt;p&gt;We use Elastic IP when we need a fixed public IP that does not change.&lt;/p&gt;

&lt;p&gt;NAT Gateway requires Elastic IP because private instances going out to the internet need a stable public source IP.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2 → NAT Gateway → Internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From the internet side, traffic appears to come from the NAT Gateway Elastic IP.&lt;/p&gt;

&lt;p&gt;Use Elastic IP for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAT Gateway
Bastion host
Static public server
Allowlisting with external vendors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do not overuse Elastic IP. In production, most public application traffic should go through a Load Balancer, not directly to EC2.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. Firewall
&lt;/h2&gt;

&lt;p&gt;A firewall controls traffic based on rules.&lt;/p&gt;

&lt;p&gt;In AWS, firewall behavior mainly comes from:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Security Groups
Network ACLs
AWS Network Firewall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For most normal EC2-level access, you use security groups.&lt;/p&gt;

&lt;p&gt;A firewall answers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Who can connect?
From where?
To which port?
Using which protocol?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Web server security group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
HTTP 80 from 0.0.0.0/0
HTTPS 443 from 0.0.0.0/0
SSH 22 from my IP only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Database security group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
PostgreSQL 5432 only from app server security group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is production thinking. Users should never directly access the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Security Group
&lt;/h2&gt;

&lt;p&gt;Security Group is an instance-level firewall. It is attached to EC2, RDS, Load Balancer, and other resources.&lt;/p&gt;

&lt;p&gt;Security Groups are stateful.&lt;/p&gt;

&lt;p&gt;Stateful means if inbound traffic is allowed, return traffic is automatically allowed.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;If user connects to web server on port 80, the response is automatically allowed back.&lt;/p&gt;

&lt;p&gt;AWS explains that security groups control inbound and outbound traffic at the instance level. (&lt;a href="https://docs.aws.amazon.com/whitepapers/latest/aws-best-practices-ddos-resiliency/security-groups-and-network-acls-bp5.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Good security group design:&lt;/p&gt;

&lt;p&gt;Load Balancer SG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
80/443 from internet

Outbound:
To web/app servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Web/App Server SG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
App port only from Load Balancer SG
SSH only from bastion or SSM

Outbound:
To database or internet through NAT
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Database SG:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Inbound:
DB port only from App Server SG

Outbound:
Default or restricted depending on company policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Very important SRE idea:&lt;/p&gt;

&lt;p&gt;Use security group references instead of IP addresses when possible.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Instead of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Allow 10.0.3.10 on port 5432
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Allow app-server-sg on port 5432
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is better because EC2 IPs can change.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Network ACL
&lt;/h2&gt;

&lt;p&gt;Network ACL, or NACL, is subnet-level firewall.&lt;/p&gt;

&lt;p&gt;Security Group protects the instance.&lt;br&gt;
NACL protects the subnet.&lt;/p&gt;

&lt;p&gt;AWS says Network ACLs control traffic in and out of one or more subnets and can be used as an additional layer of security. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;NACL is stateless.&lt;/p&gt;

&lt;p&gt;Stateless means you must allow both inbound and outbound traffic separately.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;If inbound HTTP is allowed but outbound response traffic is denied, connection fails.&lt;/p&gt;

&lt;p&gt;NACL rules have numbers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;100 allow HTTP
110 allow HTTPS
120 deny specific IP
* deny all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lower number is evaluated first.&lt;/p&gt;

&lt;p&gt;When to use NACL:&lt;/p&gt;

&lt;p&gt;Use NACL for broad subnet-level guardrails, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Block known malicious IP
Block traffic between subnet groups
Add extra compliance layer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do not use NACL for every small application rule. Use Security Groups for that.&lt;/p&gt;

&lt;p&gt;AWS also recommends security groups as the primary network control and NACLs as optional stateless subnet-level guardrails. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/infrastructure-security.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Security Group vs NACL
&lt;/h2&gt;

&lt;p&gt;Security Group:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Instance level
Stateful
Only allow rules
Commonly used
Can reference another security group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NACL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subnet level
Stateless
Allow and deny rules
Rule number order matters
Used for broad subnet control
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interview answer:&lt;/p&gt;

&lt;p&gt;Security Groups are stateful firewalls attached to resources, while NACLs are stateless firewalls applied at the subnet level.&lt;/p&gt;

&lt;h2&gt;
  
  
  12. DMZ
&lt;/h2&gt;

&lt;p&gt;DMZ means demilitarized zone.&lt;/p&gt;

&lt;p&gt;In networking, DMZ is the public-facing zone that sits between the internet and the private internal network.&lt;/p&gt;

&lt;p&gt;In AWS, a DMZ is usually your public subnet.&lt;/p&gt;

&lt;p&gt;DMZ contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application Load Balancer
Bastion host
NAT Gateway
Sometimes public web servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Private resources like databases should not be in DMZ.&lt;/p&gt;

&lt;p&gt;Correct design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   ↓
Public Subnet / DMZ
   ↓
Private App Subnet
   ↓
Private DB Subnet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why we use DMZ:&lt;/p&gt;

&lt;p&gt;Because public-facing components need controlled exposure, but internal systems must stay private.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;User can reach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → ALB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ALB can reach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALB → App Server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;App server can reach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;App Server → Database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;User cannot reach:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is production security.&lt;/p&gt;

&lt;h2&gt;
  
  
  13. Proxy
&lt;/h2&gt;

&lt;p&gt;A proxy is an intermediate server that forwards traffic.&lt;/p&gt;

&lt;p&gt;There are two common types:&lt;/p&gt;

&lt;p&gt;Forward proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Proxy → Internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Used when internal users access the internet through a controlled server.&lt;/p&gt;

&lt;p&gt;Reverse proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → Reverse Proxy → Backend Servers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Used when external users access internal applications through a front layer.&lt;/p&gt;

&lt;p&gt;In real DevOps/SRE, common reverse proxies are:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Nginx
HAProxy
Application Load Balancer
API Gateway
Ingress Controller in Kubernetes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why use proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hide backend servers
Terminate SSL/TLS
Route traffic
Apply access control
Load balance
Log requests
Protect backend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;User does not directly access app server.&lt;/p&gt;

&lt;p&gt;Instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → ALB/Nginx → App Server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is reverse proxy behavior.&lt;/p&gt;

&lt;p&gt;Firewall vs Proxy:&lt;/p&gt;

&lt;p&gt;Firewall decides allow or deny.&lt;/p&gt;

&lt;p&gt;Proxy receives and forwards application traffic.&lt;/p&gt;

&lt;h2&gt;
  
  
  14. VPC Endpoint
&lt;/h2&gt;

&lt;p&gt;You said “endpoint servers.” In AWS, the important concept is VPC Endpoint.&lt;/p&gt;

&lt;p&gt;A VPC Endpoint allows private resources to access AWS services without going through the public internet.&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Private EC2 needs to access S3.&lt;/p&gt;

&lt;p&gt;Without endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2 → NAT Gateway → Internet path → S3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With VPC endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private EC2 → VPC Endpoint → S3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is more secure and can reduce NAT traffic cost.&lt;/p&gt;

&lt;p&gt;Types:&lt;/p&gt;

&lt;p&gt;Gateway Endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;S3
DynamoDB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Interface Endpoint:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SSM
CloudWatch
ECR
Secrets Manager
STS
many AWS services
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When to use VPC endpoints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Private subnet needs AWS service access
You want to avoid internet path
You want better security
You want to reduce NAT dependency
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example SRE use case:&lt;/p&gt;

&lt;p&gt;Private EC2 has no public IP. You want to connect using SSM Session Manager. Then you may need interface endpoints for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssm
ssmmessages
ec2messages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  15. VPC Peering
&lt;/h2&gt;

&lt;p&gt;VPC Peering connects two VPCs privately.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A: 10.0.0.0/16
VPC B: 10.1.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After peering, instances in VPC A can communicate with instances in VPC B using private IPs.&lt;/p&gt;

&lt;p&gt;Use VPC peering when:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Two applications are in different VPCs
Shared services VPC needs to talk to app VPC
Company has dev/prod/shared networks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Important rules:&lt;/p&gt;

&lt;p&gt;CIDR blocks cannot overlap.&lt;/p&gt;

&lt;p&gt;If VPC A and VPC B both use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Peering will not work.&lt;/p&gt;

&lt;p&gt;Also, VPC peering is not transitive.&lt;/p&gt;

&lt;p&gt;If:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC A peers with VPC B
VPC B peers with VPC C
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That does not mean A can talk to C automatically.&lt;/p&gt;

&lt;p&gt;For many VPCs, companies often use Transit Gateway instead of many peering connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  16. Routing Servers
&lt;/h2&gt;

&lt;p&gt;In AWS, you usually do not manage a routing server like in traditional networking. AWS provides an implicit router inside the VPC. You control routing using route tables.&lt;/p&gt;

&lt;p&gt;AWS documentation says a VPC has an implicit router, and route tables control where traffic is directed. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/subnet-route-tables.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;However, sometimes companies deploy routing or network appliances, such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;firewall appliance
proxy appliance
NAT instance
VPN router
inspection appliance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But for normal DevOps/SRE learning, focus first on:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route tables
Internet Gateway
NAT Gateway
Transit Gateway
VPC Peering
VPC Endpoints
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  17. How to Build Correct AWS Network Architecture
&lt;/h2&gt;

&lt;p&gt;For your lab, build this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC: 10.0.0.0/16

Public Subnet 1: 10.0.1.0/24
Public Subnet 2: 10.0.2.0/24

Private App Subnet 1: 10.0.3.0/24
Private App Subnet 2: 10.0.4.0/24

Private DB Subnet 1: 10.0.5.0/24
Private DB Subnet 2: 10.0.6.0/24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Public route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → local
0.0.0.0/0  → Internet Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → local
0.0.0.0/0  → NAT Gateway
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Database subnet route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;10.0.0.0/16 → local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For stricter security, database subnet does not need internet access.&lt;/p&gt;

&lt;p&gt;Architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   ↓
Internet Gateway
   ↓
Application Load Balancer in Public Subnets
   ↓
App EC2 in Private Subnets
   ↓
RDS Database in Private DB Subnets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is real production design.&lt;/p&gt;

&lt;p&gt;AWS also recommends using multiple Availability Zones for production applications because it improves high availability, fault tolerance, and scalability. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-best-practices.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  18. How to Create It in Console
&lt;/h2&gt;

&lt;p&gt;First create VPC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Create VPC
Name: prod-vpc
CIDR: 10.0.0.0/16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create subnets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;public-subnet-1   10.0.1.0/24  AZ-a
public-subnet-2   10.0.2.0/24  AZ-b
private-subnet-1  10.0.3.0/24  AZ-a
private-subnet-2  10.0.4.0/24  AZ-b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create Internet Gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → Internet Gateways → Create
Attach to prod-vpc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create public route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route Tables → Create
Name: public-rt
Route: 0.0.0.0/0 → Internet Gateway
Associate public subnets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create NAT Gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC → NAT Gateways → Create
Subnet: public-subnet-1
Elastic IP: allocate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create private route table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Route Tables → Create
Name: private-rt
Route: 0.0.0.0/0 → NAT Gateway
Associate private subnets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create EC2 in public subnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Auto public IP: enabled
Security Group: allow SSH from your IP, HTTP from internet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create EC2 in private subnet:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No public IP
Security Group: allow SSH only from public/bastion SG or use SSM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  19. Correct Troubleshooting Method
&lt;/h2&gt;

&lt;p&gt;When something does not work, do not guess. Follow layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem: Public EC2 not reachable
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Is EC2 running?
2. Does EC2 have public IP?
3. Is subnet associated with public route table?
4. Does public route table have 0.0.0.0/0 → IGW?
5. Is IGW attached to VPC?
6. Does security group allow inbound port?
7. Does NACL allow traffic?
8. Is OS firewall blocking traffic?
9. Is application running?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem: Private EC2 cannot reach internet
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Is private subnet associated with private route table?
2. Does route table have 0.0.0.0/0 → NAT Gateway?
3. Is NAT Gateway available?
4. Is NAT Gateway in public subnet?
5. Does NAT Gateway have Elastic IP?
6. Does public subnet route to IGW?
7. Does security group allow outbound?
8. Does NACL allow outbound and return traffic?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem: App cannot connect to database
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Is DB running?
2. Is DB in private subnet?
3. Does DB security group allow app SG?
4. Is correct DB port open?
5. Are app and DB in same VPC or connected VPCs?
6. Is route table allowing local VPC traffic?
7. Is DNS name correct?
8. Are credentials correct?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Problem: VPC Peering not working
&lt;/h3&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Are CIDRs non-overlapping?
2. Is peering connection accepted?
3. Does VPC A route table point to peering connection?
4. Does VPC B route table point back?
5. Do security groups allow traffic?
6. Do NACLs allow traffic?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  20. SRE Mindset
&lt;/h2&gt;

&lt;p&gt;An SRE does not only create VPC. An SRE proves that the network is reliable.&lt;/p&gt;

&lt;p&gt;That means you must test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Can public users reach only what they should reach?
Can private servers reach internet through NAT?
Can database stay private?
Can one AZ fail and application still work?
Can logs show denied traffic?
Can we troubleshoot quickly?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For observability, enable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VPC Flow Logs
CloudWatch metrics
ALB access logs
Security group review
NACL review
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;AWS VPC Flow Logs capture information about IP traffic going to and from network interfaces and can help diagnose security group and NACL problems. (&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Security.html?utm_source=chatgpt.com" rel="noopener noreferrer"&gt;AWS Documentation&lt;/a&gt;)&lt;/p&gt;

&lt;h2&gt;
  
  
  Final Interview Summary
&lt;/h2&gt;

&lt;p&gt;You can say:&lt;/p&gt;

&lt;p&gt;I design AWS networks using VPCs, public and private subnets, route tables, Internet Gateway, NAT Gateway, security groups, and NACLs. Public subnets are used for internet-facing components like load balancers, while private subnets host application and database resources. NAT Gateway allows private resources to access the internet without being exposed. Security Groups protect resources at the instance level, and NACLs provide subnet-level control. For private AWS service access, I use VPC endpoints, and for private communication between VPCs, I use VPC peering or Transit Gateway depending on scale.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>proxy, firewall and DMZ on packet tracer</title>
      <dc:creator>Aisalkyn Aidarova</dc:creator>
      <pubDate>Mon, 27 Apr 2026 13:33:47 +0000</pubDate>
      <link>https://forem.com/jumptotech/proxy-firewall-and-dmz-on-packet-tracer-3b45</link>
      <guid>https://forem.com/jumptotech/proxy-firewall-and-dmz-on-packet-tracer-3b45</guid>
      <description>&lt;h1&gt;
  
  
  PHASE 1 — CHECK EXISTING ARCHITECTURE FIRST
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. ROUTER0 CHECK COMMANDS
&lt;/h2&gt;

&lt;p&gt;Run on &lt;strong&gt;Router0&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check all router interfaces
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;enable
&lt;/span&gt;show ip interface brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;g0/0.10    192.168.10.1    up/up
g0/0.20    192.168.20.1    up/up
g0/0.30    192.168.30.1    up/up
g0/1.50    192.168.50.1    up/up
g0/1.100   200.1.1.1       up/up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Which VLAN gateway exists on router
Whether interfaces are working
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check routing table
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show ip route
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;192.168.10.0/24 connected
192.168.20.0/24 connected
192.168.30.0/24 connected
192.168.50.0/24 connected
200.1.1.0/24 connected
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Router knows all VLAN networks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check DHCP pools on router
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show running-config | section dhcp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VLAN10 pool
VLAN20 pool
VLAN30 pool
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Router or server DHCP settings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check firewall rules
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show access-lists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected before firewall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;No important ACL or old ACL only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Existing firewall rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check where ACL is applied
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show running-config | include access-group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected before firewall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;empty or old access-group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Whether firewall is already applied to interface
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check router full config
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show running-config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Subinterfaces
DHCP
ACL
NAT
Gateway IPs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  2. SWITCH0 CHECK COMMANDS
&lt;/h2&gt;

&lt;p&gt;Switch0 = user computers.&lt;/p&gt;

&lt;p&gt;Run on &lt;strong&gt;Switch0&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check VLAN and ports
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;enable
&lt;/span&gt;show vlan brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected from your lab:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VLAN 10 HR       Fa0/1, Fa0/2, Fa0/5
VLAN 20 IT       Fa0/3
VLAN 30 DevOps   Fa0/4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Which computer port belongs to which VLAN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check trunk port
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show interfaces trunk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fa0/24 trunking
Allowed VLANs: 10,20,30,50
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Switch-to-router connection carries VLANs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check MAC address table
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show mac address-table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MAC addresses learned on PC ports and trunk port
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Which device is connected to which switch port
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check switch interfaces
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show ip interface brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fa0/1 up
Fa0/2 up
Fa0/3 up
Fa0/4 up
Fa0/24 up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This shows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Which physical cables are active
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. SWITCH1 CHECK COMMANDS
&lt;/h2&gt;

&lt;p&gt;Switch1 = servers.&lt;/p&gt;

&lt;p&gt;Run on &lt;strong&gt;Switch1&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Check VLAN and server ports
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;enable
&lt;/span&gt;show vlan brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VLAN 50 SERVERS/PUBLIC   Fa0/1
VLAN 100 PUBLIC          Fa0/2, Fa0/4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fa0/1 = Server0
Fa0/2 = Server2
Fa0/4 = PC4 or future server
Fa0/3 = trunk to router
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check trunk
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show interfaces trunk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fa0/3 trunking
Allowed VLANs: 50,100
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check MAC address table
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show mac address-table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;VLAN 50 MAC on Fa0/1
VLAN 100 MAC on Fa0/2
VLAN 100 MAC on Fa0/4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Check interface status
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show ip interface brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Fa0/1 up
Fa0/2 up
Fa0/3 up
Fa0/4 up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  4. COMPUTER CHECK COMMANDS
&lt;/h2&gt;

&lt;p&gt;On every PC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Desktop → Command Prompt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ipconfig
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PC0  = 192.168.10.10 / gateway 192.168.10.1
PC10 = 192.168.10.11 / gateway 192.168.10.1
PC1  = 192.168.10.12 / gateway 192.168.10.1
PC2  = 192.168.20.10 / gateway 192.168.20.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then test gateway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 192.168.10.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or for VLAN20:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 192.168.20.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  5. SERVER CHECKS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Server0 — DHCP/DNS server
&lt;/h3&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server0 → Desktop → IP Configuration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IP: 192.168.50.10
Mask: 255.255.255.0
Gateway: 192.168.50.1
DNS: 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Services → DHCP → ON
Services → DNS → ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  Server2 — Public-facing web/proxy server
&lt;/h3&gt;

&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IP: 200.1.1.2
Mask: 255.255.255.0
Gateway: 200.1.1.1
DNS: 192.168.50.10
DHCP Service: OFF
HTTP Service: ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  PHASE 2 — IMPLEMENT DMZ + FIREWALL + PROXY
&lt;/h1&gt;

&lt;p&gt;Important:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Right now VLAN100 is only a separate network.
It becomes a DMZ after we apply firewall rules.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Production-style zones
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INTERNAL USERS:
VLAN 10,20,30
192.168.10.0/24
192.168.20.0/24
192.168.30.0/24

PRIVATE SERVERS:
VLAN 50
192.168.50.0/24

DMZ / PUBLIC-FACING:
VLAN 100
200.1.1.0/24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  STEP 1 — FIX SERVER2
&lt;/h1&gt;

&lt;p&gt;On Server2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Services → DHCP → OFF
Services → HTTP → ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;IP config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IP: 200.1.1.2
Mask: 255.255.255.0
Gateway: 200.1.1.1
DNS: 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  STEP 2 — CREATE FIREWALL ON ROUTER
&lt;/h1&gt;

&lt;p&gt;This firewall will do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internal users can access web/proxy server
DMZ cannot access internal PCs
DMZ cannot access private servers except database later
Users cannot directly access database later
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run on &lt;strong&gt;Router0&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;enable
&lt;/span&gt;conf t

no access-list 100
no access-list 101
no access-list 110

ip access-list extended DMZ_FIREWALL
remark Allow DMZ server to reach private database later
permit tcp host 200.1.1.2 host 192.168.50.20 eq 80

remark Block DMZ from reaching internal user VLANs
deny ip 200.1.1.0 0.0.0.255 192.168.10.0 0.0.0.255
deny ip 200.1.1.0 0.0.0.255 192.168.20.0 0.0.0.255
deny ip 200.1.1.0 0.0.0.255 192.168.30.0 0.0.0.255

remark Block DMZ from reaching private server VLAN except database rule above
deny ip 200.1.1.0 0.0.0.255 192.168.50.0 0.0.0.255

remark Allow remaining traffic &lt;span class="k"&gt;for &lt;/span&gt;Packet Tracer lab stability
permit ip any any

interface g0/1.100
ip access-group DMZ_FIREWALL &lt;span class="k"&gt;in

&lt;/span&gt;end
wr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What this means
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;interface g0/1.100 = VLAN100 gateway
ip access-group DMZ_FIREWALL in = check traffic coming FROM DMZ into router
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So when Server2 tries to go inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server2 → Router → Internal network
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Router checks firewall.&lt;/p&gt;




&lt;h1&gt;
  
  
  STEP 3 — INTERNAL USERS FIREWALL
&lt;/h1&gt;

&lt;p&gt;This blocks users from directly accessing database later.&lt;/p&gt;

&lt;p&gt;Run on &lt;strong&gt;Router0&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;enable
&lt;/span&gt;conf t

ip access-list extended INTERNAL_USERS
remark Allow &lt;span class="nb"&gt;users &lt;/span&gt;to access DMZ web/proxy server
permit tcp 192.168.0.0 0.0.255.255 host 200.1.1.2 eq 80

remark Allow &lt;span class="nb"&gt;users &lt;/span&gt;to use DNS
permit udp 192.168.0.0 0.0.255.255 host 192.168.50.10 eq 53

remark Block &lt;span class="nb"&gt;users &lt;/span&gt;from accessing private database directly
deny tcp 192.168.0.0 0.0.255.255 host 192.168.50.20 eq 80

remark Allow other traffic &lt;span class="k"&gt;for &lt;/span&gt;lab testing
permit ip any any

interface g0/0.10
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.20
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.30
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;end
wr
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  🔥 PHASE 3 — CREATE INTERNET-FACING WEB APP
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Step 1 — Create Website on Server2 (DMZ)
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server2 → Services → HTTP → ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then edit &lt;strong&gt;index.html&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Replace with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;head&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;title&amp;gt;&lt;/span&gt;Company Portal&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/head&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;Welcome to JumpToTech Company&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;h2&amp;gt;&lt;/span&gt;Login&lt;span class="nt"&gt;&amp;lt;/h2&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;form&amp;gt;&lt;/span&gt;
Username: &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"text"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
Password: &lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"password"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;br&amp;gt;&amp;lt;br&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;input&lt;/span&gt; &lt;span class="na"&gt;type=&lt;/span&gt;&lt;span class="s"&gt;"submit"&lt;/span&gt; &lt;span class="na"&gt;value=&lt;/span&gt;&lt;span class="s"&gt;"Login"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/form&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2 — TEST WEB SERVER
&lt;/h2&gt;

&lt;p&gt;From any PC (PC0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 200.1.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Success
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now open browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Desktop → Web Browser
http://200.1.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔️ You should see your webpage&lt;/p&gt;




&lt;h1&gt;
  
  
  🔥 PHASE 4 — CREATE PRIVATE DATABASE
&lt;/h1&gt;

&lt;p&gt;We simulate database using another server.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1 — Use Server0 or New Server as DB
&lt;/h2&gt;

&lt;p&gt;👉 Better: use &lt;strong&gt;Server0 as DB + DNS&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Assign (already done):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;IP: 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 2 — Create Database (simulate via HTTP)
&lt;/h2&gt;

&lt;p&gt;Go to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server0 → Services → HTTP → ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Edit page:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;DATABASE SERVER&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;User Data Stored Here&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;Username: admin&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;p&amp;gt;&lt;/span&gt;Password: secret123&lt;span class="nt"&gt;&amp;lt;/p&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🔥 PHASE 5 — CONNECT WEB → DATABASE
&lt;/h1&gt;

&lt;p&gt;Now simulate backend call:&lt;/p&gt;

&lt;h3&gt;
  
  
  On Server2 (Web server)
&lt;/h3&gt;

&lt;p&gt;Update HTML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;h1&amp;gt;&lt;/span&gt;Company Portal&lt;span class="nt"&gt;&amp;lt;/h1&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;a&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"http://192.168.50.10"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Access Database&lt;span class="nt"&gt;&amp;lt;/a&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Test flow:
&lt;/h2&gt;

&lt;p&gt;From PC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://200.1.1.2
→ click link
→ should open 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🚨 NOW APPLY FIREWALL RESTRICTION (IMPORTANT)
&lt;/h1&gt;

&lt;p&gt;We now enforce production behavior:&lt;/p&gt;

&lt;h2&gt;
  
  
  Requirement:
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ Users cannot access DB directly
✅ Only Web Server can access DB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step — FIX FIREWALL (Router)
&lt;/h2&gt;

&lt;p&gt;Run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;conf t

no ip access-list extended INTERNAL_USERS

ip access-list extended INTERNAL_USERS

remark Allow &lt;span class="nb"&gt;users&lt;/span&gt; → web server only
permit tcp 192.168.0.0 0.0.255.255 host 200.1.1.2 eq 80

remark Block &lt;span class="nb"&gt;users&lt;/span&gt; → database
deny tcp 192.168.0.0 0.0.255.255 host 192.168.50.10 eq 80

permit ip any any

interface g0/0.10
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.20
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.30
ip access-group INTERNAL_USERS &lt;span class="k"&gt;in

&lt;/span&gt;end
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Now test:
&lt;/h2&gt;

&lt;h3&gt;
  
  
  From PC:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;http://192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;❌ SHOULD FAIL&lt;/p&gt;




&lt;h3&gt;
  
  
  From Web Server:
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server2 → Browser
http://192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;✔️ SHOULD WORK&lt;/p&gt;




&lt;h1&gt;
  
  
  🔥 PHASE 6 — ADD PROXY (VERY IMPORTANT)
&lt;/h1&gt;

&lt;p&gt;Now we simulate proxy:&lt;/p&gt;

&lt;p&gt;👉 Proxy = control user internet access&lt;/p&gt;




&lt;h2&gt;
  
  
  Step — Make Server2 act as Proxy
&lt;/h2&gt;

&lt;p&gt;In Packet Tracer (simplified):&lt;/p&gt;

&lt;p&gt;Use HTTP filtering idea:&lt;/p&gt;

&lt;p&gt;Update firewall:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;conf t

ip access-list extended PROXY_CONTROL

remark Allow only web server access
permit tcp 192.168.0.0 0.0.255.255 host 200.1.1.2 eq 80

remark Block all other internet
deny ip 192.168.0.0 0.0.255.255 any

permit ip any any

interface g0/0.10
ip access-group PROXY_CONTROL &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.20
ip access-group PROXY_CONTROL &lt;span class="k"&gt;in

&lt;/span&gt;interface g0/0.30
ip access-group PROXY_CONTROL &lt;span class="k"&gt;in

&lt;/span&gt;end
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Result:
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PC → Web Server&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PC → Internet&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PC → DB&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web → DB&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  🔥 FINAL ARCHITECTURE (PRODUCTION STYLE)
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ USERS VLAN 10/20/30 ]
        ↓
     (Firewall)
        ↓
   [ DMZ - Web Server ]
        ↓
     (Firewall)
        ↓
 [ Private DB VLAN50 ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🔥 PHASE 7 — SRE TROUBLESHOOTING SCENARIOS
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Scenario 1 — Website not opening
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 200.1.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show ip interface brief
show vlan brief
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 2 — Page loads but DB not working
&lt;/h2&gt;

&lt;p&gt;Check from Server2:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show access-lists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 3 — User cannot access web
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;show access-lists
show run | include access-group
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Scenario 4 — DNS issue
&lt;/h2&gt;

&lt;p&gt;Check:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ping 192.168.50.10
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Server0 → DNS → ON
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  🔥 FINAL RESULT
&lt;/h1&gt;

&lt;p&gt;You built:&lt;/p&gt;

&lt;p&gt;✔️ VLAN segmentation&lt;br&gt;
✔️ Router-on-a-stick&lt;br&gt;
✔️ DMZ architecture&lt;br&gt;
✔️ Firewall (ACL)&lt;br&gt;
✔️ Proxy control&lt;br&gt;
✔️ Web application&lt;br&gt;
✔️ Database separation&lt;br&gt;
✔️ SRE troubleshooting scenarios&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
