<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Data Tech Bridge</title>
    <description>The latest articles on Forem by Data Tech Bridge (@datatechbridge).</description>
    <link>https://forem.com/datatechbridge</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2904813%2F7f82258a-66eb-4a83-a7b9-c42ba24dc0b7.jpeg</url>
      <title>Forem: Data Tech Bridge</title>
      <link>https://forem.com/datatechbridge</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/datatechbridge"/>
    <language>en</language>
    <item>
      <title>I Got a Teams Message: 'Why Do We Need a Semantic Layer?'</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:02:07 +0000</pubDate>
      <link>https://forem.com/datatechbridge/i-got-a-teams-message-why-do-we-need-a-semantic-layer-4ab3</link>
      <guid>https://forem.com/datatechbridge/i-got-a-teams-message-why-do-we-need-a-semantic-layer-4ab3</guid>
      <description>&lt;p&gt;It was Monday morning. Our team was sitting through a technical architecture meeting. I could see Sarah's face — confused, frustrated, checking her watch.&lt;/p&gt;

&lt;p&gt;When the meeting ended, I got a Teams message:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Hey, can we chat? Why are we talking about semantic layers with a fancy name? We have a data warehouse. We have SQL. We have Tableau. Why do we need ANOTHER layer? Isn't that just... more complexity?"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I smiled. This was the question I get asked every few months. But Sarah's frustration was real — and valid.&lt;/p&gt;

&lt;p&gt;I called her immediately.&lt;/p&gt;

&lt;p&gt;"Let me tell you a story," I said on the Teams call. "Then it'll make sense. And stop you from thinking we're adding complexity just for fun."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Story: Revenue Doesn't Match
&lt;/h2&gt;

&lt;p&gt;Six months ago, our CFO called an urgent meeting. Marketing's revenue report showed $2M. Finance's revenue report showed $1.8M. Same period. Same data. Different numbers.&lt;/p&gt;

&lt;p&gt;Marketing's query: &lt;code&gt;SELECT SUM(sale_amount) FROM transactions WHERE date &amp;gt;= '2024-01-01'&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Finance's query: &lt;code&gt;SELECT SUM(sale_amount) FROM transactions WHERE date &amp;gt;= '2024-01-01' AND refunded = false&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Finance included refunds. Marketing didn't. For two weeks, we fought over which number was correct.&lt;/p&gt;

&lt;p&gt;"That sounds painful," Sarah said, sipping her coffee.&lt;/p&gt;

&lt;p&gt;"It was. And the worst part? This kept happening. Every month, someone would run a query slightly differently. Our CEO stopped trusting the numbers. Analysts wasted time justifying their queries instead of finding insights."&lt;/p&gt;

&lt;p&gt;Sarah nodded. "So what did you do?"&lt;/p&gt;

&lt;p&gt;"That's when we built our semantic layer."&lt;/p&gt;

&lt;h2&gt;
  
  
  Let Me Explain: The Layers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F48rdjnljb7m0iqlx37ys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F48rdjnljb7m0iqlx37ys.png" alt=" " width="800" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Okay, let me sketch this out in a shared doc while we talk," I said, opening a Google Doc.&lt;/p&gt;

&lt;p&gt;"Think of data like a building. I'll draw boxes stacked on top of each other."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Ground Floor: Raw Data (Ingestion Layer)"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"This is where data arrives. From APIs, databases, logs — everything is dumped here as-is. No transformation. No cleaning. Just raw bytes."&lt;/p&gt;

&lt;p&gt;Sarah: "So this is like... Kafka? S3? The data lake?"&lt;/p&gt;

&lt;p&gt;"Exactly. Your data is messy here. Duplicates. Inconsistencies. Wrong types. Nobody uses this directly."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Factory Floor: Processing Layer"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"This is where magic happens. We use dbt or Spark to clean, validate, join, and transform data. We remove duplicates. We fix data types. We create business logic."&lt;/p&gt;

&lt;p&gt;I drew arrows showing data flowing up.&lt;/p&gt;

&lt;p&gt;"Here's where we define: &lt;em&gt;Revenue = Sale Amount minus Refunds and Discounts, calculated daily, excluding test transactions.&lt;/em&gt;"&lt;/p&gt;

&lt;p&gt;Sarah: "Oh! So you solve the problem here?"&lt;/p&gt;

&lt;p&gt;"Partially. But wait..."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Office Floor: Logical Layer (Data Marts)"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"This is where we organize the processed data into business domains. We have a Sales Mart, a Finance Mart, a Marketing Mart. Each one is clean, documented, and domain-specific."&lt;/p&gt;

&lt;p&gt;Sarah: "So now we have one source of truth?"&lt;/p&gt;

&lt;p&gt;"Getting there. But there's still a problem. Different teams organize data slightly differently. A 'customer' in one mart might be different from a 'customer' in another. And every BI tool, every analyst, every new ML model has to learn these quirks."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Executive Suite: Semantic Layer"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is where I drew a special box with a different color.&lt;/p&gt;

&lt;p&gt;"Here's where we translate technical data into business language. In this layer, we have one definition of 'Revenue.' One definition of 'Customer.' One definition of 'Churn.' These are metrics and dimensions that everyone in the company understands."&lt;/p&gt;

&lt;p&gt;Sarah leaned closer. "So the semantic layer is like... a business dictionary?"&lt;/p&gt;

&lt;p&gt;"YES. Exactly that. It's a contract between the technical world and the business world."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Conference Room: Consumption Layer"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Finally, this is where dashboards, BI tools, ML models, and end users live. They don't care about the warehouse schema. They just ask for 'Revenue by Region' and get answers."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters: The Three Problems It Solves
&lt;/h2&gt;

&lt;p&gt;Sarah was still skeptical. "But couldn't you just document all this in a wiki or something?"&lt;/p&gt;

&lt;p&gt;"You could. But then you're relying on people to follow documentation. Let me show you three problems a semantic layer actually solves."&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: The Revenue Conflict
&lt;/h3&gt;

&lt;p&gt;"Remember the Marketing vs Finance revenue conflict? With a semantic layer, we define revenue once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sale_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;refunded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt; 
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; 
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;is_test_transaction&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now every dashboard, every query, every report uses the same definition. Marketing can't accidentally exclude refunds. Finance can't accidentally include test data. One metric. One source of truth."&lt;/p&gt;

&lt;p&gt;Sarah: "So they can't fight anymore because they're using the same calculation?"&lt;/p&gt;

&lt;p&gt;"Exactly."&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Schema Changes Don't Break Everything
&lt;/h3&gt;

&lt;p&gt;"Let's say next month, our company acquires another startup. Their customer table has a slightly different schema. We merge the data into our warehouse and change a column name from &lt;code&gt;customer_id&lt;/code&gt; to &lt;code&gt;cust_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Without a semantic layer? Every dashboard, every query, every report breaks. You spend a week fixing broken dashboards.&lt;/p&gt;

&lt;p&gt;With a semantic layer? You update the mapping once. Every BI tool, every analyst, every model keeps working because they query the semantic layer, not the raw tables."&lt;/p&gt;

&lt;p&gt;Sarah's eyes widened. "Oh! So it's like... an adapter?"&lt;/p&gt;

&lt;p&gt;"Perfect. It adapts technical changes so the business side doesn't see them."&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: AI and ML Models
&lt;/h3&gt;

&lt;p&gt;"And here's the modern problem: AI models. Let's say we want to train a churn prediction model. The model needs historical data. But which historical data? When was it loaded? Did it include test data? Was it deduplicated?&lt;/p&gt;

&lt;p&gt;Without data lineage and a semantic layer, you have no idea. Your model trains on dirty data. It works great for six months. Then it fails because the data changed.&lt;/p&gt;

&lt;p&gt;With a semantic layer and proper versioning, you know exactly which data the model trained on. You know when that data was collected. You can reproduce it. You can debug when it fails."&lt;/p&gt;

&lt;p&gt;Sarah nodded slowly. "So semantic layers aren't just for dashboards. They're for everything."&lt;/p&gt;

&lt;p&gt;"Yes. BI, reporting, ML models, data science, even GenAI. Anyone who touches data benefits."&lt;/p&gt;

&lt;h2&gt;
  
  
  Common Data Layers: The Full Picture
&lt;/h2&gt;

&lt;p&gt;Sarah asked: "So what are all the layers? You mentioned a few."&lt;/p&gt;

&lt;p&gt;I updated the whiteboard with a complete list.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Ingestion Layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Raw data arrives. No transformation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Kafka, Fivetran, AWS Glue, Apache NiFi&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; API logs hitting S3 every hour, untouched&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Processing Layer (ETL/ELT)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Clean, validate, transform, join.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; dbt, Apache Spark, Python, SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Merge three data sources, remove duplicates, add business logic&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Logical Layer (Data Marts)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Organized, domain-specific tables.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Snowflake, BigQuery, Postgres (any warehouse)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Sales Mart with cleaned order data, Finance Mart with GL transactions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Semantic Layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Business language, metrics, dimensions, one source of truth.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Looker, Tableau semantic models, Atlan, dbt metrics, Cube.js&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; Metric "Revenue" = defined once, used everywhere&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;Consumption Layer&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dashboards, reports, BI tools, ML models.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tools:&lt;/strong&gt; Tableau, Metabase, Superset, Python notebooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Example:&lt;/strong&gt; CFO's revenue dashboard, ML churn model, analyst SQL queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Supporting Layers (Everywhere):
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Metadata Layer:&lt;/strong&gt; Tracks lineage, schema, ownership, documentation&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Who changed this column? Where did this metric come from? What data quality rules apply?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Governance Layer:&lt;/strong&gt; Access control, data quality, compliance&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Who can see this data? Has it been audited? Does it meet compliance requirements?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Lightbulb Moment
&lt;/h2&gt;

&lt;p&gt;Sarah put down her coffee. "So if we skip the semantic layer and go straight from logical to consumption, we get chaos. But if we build the semantic layer, everything downstream gets easier?"&lt;/p&gt;

&lt;p&gt;"Now you're getting it. And here's the kicker..."&lt;/p&gt;

&lt;h2&gt;
  
  
  Why It Matters Even More Now (AI and GenAI)
&lt;/h2&gt;

&lt;p&gt;"There's one more reason this matters today," I said. "AI and generative AI."&lt;/p&gt;

&lt;p&gt;Sarah: "What do you mean?"&lt;/p&gt;

&lt;p&gt;"Imagine we're building an AI system that answers business questions in natural language. You ask: 'What was our revenue last quarter?' The AI needs to:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Know what 'revenue' means in our company (not their definition, OUR definition)&lt;/li&gt;
&lt;li&gt;Know which table to query&lt;/li&gt;
&lt;li&gt;Know the data quality rules (exclude test data, exclude refunds)&lt;/li&gt;
&lt;li&gt;Know the refresh cadence (is this data updated daily? Hourly?)&lt;/li&gt;
&lt;li&gt;Know if this data can be trusted&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Without a semantic layer, the AI makes up answers. It hallucinates. It returns confidently incorrect numbers.&lt;/p&gt;

&lt;p&gt;With a semantic layer, the AI knows exactly where to find trustworthy business metrics. It gives correct answers."&lt;/p&gt;

&lt;p&gt;Sarah leaned back. "So building a semantic layer now is like... future-proofing for AI?"&lt;/p&gt;

&lt;p&gt;"Exactly. Companies that build semantic layers now will deploy AI faster and with fewer mistakes."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Takeaway
&lt;/h2&gt;

&lt;p&gt;Sarah unmuted. "So to summarize: semantic layers are basically the glue that makes data reliable, consistent, and AI-ready. They prevent chaos. They make dashboards survive schema changes. They prepare you for AI. And it's not more complexity — it's LESS chaos?"&lt;/p&gt;

&lt;p&gt;"Exactly," I said. "And here's the best part: you don't have to build it all at once. Start with one domain. Maybe Sales. Build the processing layer, the logical layer, and the semantic layer for just Sales data. Measure how much faster analytics becomes. How many fewer support tickets you get.&lt;/p&gt;

&lt;p&gt;Then expand to Finance. Then Marketing. One domain at a time."&lt;/p&gt;

&lt;p&gt;There was a pause.&lt;/p&gt;

&lt;p&gt;"Okay, that actually makes sense now," Sarah said. "I was thinking this was some theoretical thing. But it's just... organization?"&lt;/p&gt;

&lt;p&gt;"It's organization with a contract. A promise that says: 'Revenue always means this. Profit always means this. And it won't change if we change the warehouse tomorrow.'"&lt;/p&gt;

&lt;p&gt;"Got it. I'm going to stop complaining about the semantic layer now," she laughed.&lt;/p&gt;

&lt;p&gt;"Good. And next time someone asks you, just tell them the story about Marketing vs Finance fighting over revenue. They'll get it immediately."&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Data layers aren't about complexity — they're about clarity.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ingestion:&lt;/strong&gt; Raw data arrives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing:&lt;/strong&gt; Get cleaned and transformed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logical:&lt;/strong&gt; Get organized by domain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic:&lt;/strong&gt; Get translated into business language&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consumption:&lt;/strong&gt; Get used by dashboards, models, and people&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The semantic layer is the MVP. It's the one that prevents fights over metrics, survives schema changes, and prepares you for AI.&lt;/p&gt;

&lt;p&gt;Start small. Build one domain. Measure the impact.&lt;/p&gt;

&lt;p&gt;Your team will stop fighting over numbers. Your dashboards will survive changes. Your AI will give correct answers instead of hallucinations.&lt;/p&gt;

&lt;p&gt;That conversation with Sarah? It happens every few months. Now I just point them to the semantic layer.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Do you have a chaos story like the Marketing vs Finance revenue fight? Or are you thinking about where to start with your data layers? Drop a comment — I'd love to hear your story.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>architecture</category>
      <category>data</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Are you running LLM locally using LLM Studio or Ollama? What is your laptop configuration and how it the latency working with Open Source Models? Please share your experience for community to guidance.</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Tue, 17 Mar 2026 20:57:09 +0000</pubDate>
      <link>https://forem.com/datatechbridge/are-you-running-llm-locally-using-llm-studio-or-ollama-what-is-your-laptop-configuration-and-how-3o4p</link>
      <guid>https://forem.com/datatechbridge/are-you-running-llm-locally-using-llm-studio-or-ollama-what-is-your-laptop-configuration-and-how-3o4p</guid>
      <description></description>
    </item>
    <item>
      <title>If you're working with Apache Spark, you need to have this documentation bookmarked!

The official Spark Python (PySpark) API reference is an invaluable resource: [https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html]</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Thu, 19 Feb 2026 05:34:29 +0000</pubDate>
      <link>https://forem.com/datatechbridge/if-youre-working-with-apache-spark-you-need-to-have-this-documentation-bookmarked-the-2n9o</link>
      <guid>https://forem.com/datatechbridge/if-youre-working-with-apache-spark-you-need-to-have-this-documentation-bookmarked-the-2n9o</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/index.html" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;spark.apache.org&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>The Ultimate Databricks Data Engineer Associate Exam Guide for AWS Engineers</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Thu, 19 Feb 2026 03:19:51 +0000</pubDate>
      <link>https://forem.com/datatechbridge/the-ultimate-databricks-data-engineer-associate-exam-guide-for-aws-engineers-4662</link>
      <guid>https://forem.com/datatechbridge/the-ultimate-databricks-data-engineer-associate-exam-guide-for-aws-engineers-4662</guid>
      <description>&lt;h2&gt;
  
  
  A Comprehensive Deep-Dive into Every Exam Topic
&lt;/h2&gt;




&lt;h1&gt;
  
  
  Table of Contents
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt;Introduction to Databricks &amp;amp; the Lakehouse Paradigm&lt;/li&gt;
&lt;li&gt;Databricks Architecture&lt;/li&gt;
&lt;li&gt;Compute — Clusters &amp;amp; Runtimes&lt;/li&gt;
&lt;li&gt;Notebooks&lt;/li&gt;
&lt;li&gt;Delta Lake &amp;amp; Data Management&lt;/li&gt;
&lt;li&gt;Tables, Schemas &amp;amp; Databases&lt;/li&gt;
&lt;li&gt;Databricks Utilities (dbutils)&lt;/li&gt;
&lt;li&gt;Git Folders / Repos&lt;/li&gt;
&lt;li&gt;Unity Catalog &amp;amp; Governance&lt;/li&gt;
&lt;li&gt;Job Orchestration &amp;amp; Workflows&lt;/li&gt;
&lt;li&gt;Data Ingestion Patterns&lt;/li&gt;
&lt;li&gt;Delta Live Tables (DLT) &amp;amp; Pipelines&lt;/li&gt;
&lt;li&gt;Apache Spark — High-Level Overview&lt;/li&gt;
&lt;li&gt;AWS Integration Specifics&lt;/li&gt;
&lt;li&gt;Security &amp;amp; Access Control&lt;/li&gt;
&lt;li&gt;Performance &amp;amp; Best Practices&lt;/li&gt;
&lt;li&gt;Exam Tips &amp;amp; Summary&lt;/li&gt;
&lt;/ol&gt;




&lt;h1&gt;
  
  
  Chapter 1 — Introduction to Databricks &amp;amp; the Lakehouse Paradigm
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Is Databricks?
&lt;/h2&gt;

&lt;p&gt;Databricks is a unified data analytics platform built on top of Apache Spark. It was founded by the original creators of Apache Spark and is available as a managed cloud service on AWS, Azure, and GCP. For AWS engineers, Databricks runs inside your AWS account — it provisions and manages Amazon EC2 instances, S3 buckets, IAM roles, VPCs, and other AWS resources on your behalf.&lt;/p&gt;

&lt;p&gt;The platform is designed to bridge the gap between data engineering, data science, and business analytics — providing a single pane of glass for all data workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lakehouse Architecture
&lt;/h2&gt;

&lt;p&gt;Before Databricks, teams operated in one of two paradigms:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Warehouses&lt;/strong&gt; (Redshift, Snowflake) — Structured, fast SQL queries, but expensive for large raw datasets, limited support for ML, and data duplication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Lakes&lt;/strong&gt; (S3 + Glue + Athena) — Cheap storage of any format, but no ACID transactions, no data quality guarantees, schema enforcement challenges, and slow query performance without significant tuning.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Lakehouse&lt;/strong&gt; combines the best of both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cheap, open-format storage (S3 + Parquet/Delta)&lt;/li&gt;
&lt;li&gt;ACID transactions and data reliability&lt;/li&gt;
&lt;li&gt;Support for SQL, Batch, Streaming, ML, and BI workloads&lt;/li&gt;
&lt;li&gt;A unified governance layer&lt;/li&gt;
&lt;li&gt;Open standards (Delta Lake, Apache Iceberg support)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Databricks is the leading commercial implementation of the Lakehouse pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters for the Exam
&lt;/h2&gt;

&lt;p&gt;The exam frequently tests whether you understand which layer (Bronze, Silver, Gold) a given operation belongs to, what Delta Lake provides over raw Parquet, and how Databricks differs from a traditional data warehouse or raw S3 data lake.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 2 — Databricks Architecture
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Control Plane vs. Data Plane
&lt;/h2&gt;

&lt;p&gt;This is one of the most fundamental concepts for both the exam and real-world AWS deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Control Plane (Databricks-managed AWS Account)
&lt;/h3&gt;

&lt;p&gt;The control plane lives in Databricks' own AWS account. It contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Web Application&lt;/strong&gt; — the UI you interact with&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Manager&lt;/strong&gt; — orchestrates cluster lifecycle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job Scheduler&lt;/strong&gt; — triggers and monitors workflow runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notebook Storage&lt;/strong&gt; — notebooks are stored in the control plane (unless using Git Folders)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata Services&lt;/strong&gt; — the Hive Metastore (legacy) or Unity Catalog Metastore&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The control plane never touches your actual data. It sends instructions to resources in your account.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Plane (Your AWS Account)
&lt;/h3&gt;

&lt;p&gt;The data plane lives in &lt;strong&gt;your&lt;/strong&gt; AWS account and contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2 instances&lt;/strong&gt; — actual Spark driver and worker nodes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3 buckets&lt;/strong&gt; — your data storage (the DBFS root bucket and your own buckets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VPC, Subnets, Security Groups&lt;/strong&gt; — network resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Roles &amp;amp; Instance Profiles&lt;/strong&gt; — permissions for cluster nodes to access S3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a cluster is launched, the Databricks control plane communicates with your AWS account (via a cross-account IAM role) to spin up EC2 instances inside your VPC.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Split Matters
&lt;/h3&gt;

&lt;p&gt;Security teams love this architecture because &lt;strong&gt;your data never leaves your AWS account&lt;/strong&gt;. Databricks only stores metadata and notebook code. If you use Secrets (more on that later), sensitive values are encrypted in the control plane but never transmitted in plain text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks Workspace
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;workspace&lt;/strong&gt; is the top-level organizational unit in Databricks. Think of it like an AWS account — it has its own:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users and groups&lt;/li&gt;
&lt;li&gt;Clusters&lt;/li&gt;
&lt;li&gt;Notebooks&lt;/li&gt;
&lt;li&gt;Jobs&lt;/li&gt;
&lt;li&gt;Data objects (catalogs, schemas, tables)&lt;/li&gt;
&lt;li&gt;Secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On AWS, a workspace is tied to a specific AWS region and a specific S3 bucket (the "root DBFS bucket"). Large organizations often have multiple workspaces (dev, staging, prod) — this is the recommended pattern.&lt;/p&gt;

&lt;h2&gt;
  
  
  DBFS — Databricks File System
&lt;/h2&gt;

&lt;p&gt;DBFS is a virtual file system layer that &lt;strong&gt;abstracts over your underlying storage&lt;/strong&gt;. It provides a unified path namespace (&lt;code&gt;dbfs:/&lt;/code&gt;) that maps to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An S3 bucket managed by Databricks (the root bucket, e.g., &lt;code&gt;s3://databricks-workspace-XXXXXXX/&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Other mounted S3 paths&lt;/li&gt;
&lt;li&gt;External locations (with Unity Catalog)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DBFS is &lt;strong&gt;not a separate storage system&lt;/strong&gt; — it's a layer over S3&lt;/li&gt;
&lt;li&gt;The default DBFS root bucket is managed by Databricks but lives in your AWS account&lt;/li&gt;
&lt;li&gt;You can mount external S3 buckets to DBFS paths (legacy approach)&lt;/li&gt;
&lt;li&gt;With Unity Catalog, External Locations replace mounts as the preferred approach&lt;/li&gt;
&lt;li&gt;DBFS paths look like &lt;code&gt;dbfs:/user/hive/warehouse/&lt;/code&gt; or &lt;code&gt;dbfs:/FileStore/&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Important DBFS paths to know:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;dbfs:/FileStore/&lt;/code&gt; — files uploaded via the UI (accessible via &lt;code&gt;/FileStore/&lt;/code&gt; URL)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbfs:/user/hive/warehouse/&lt;/code&gt; — default managed table location (Hive Metastore)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbfs:/databricks/&lt;/code&gt; — internal Databricks paths&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbfs:/mnt/&amp;lt;mount-name&amp;gt;/&lt;/code&gt; — mounted external storage (legacy)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cluster Architecture
&lt;/h2&gt;

&lt;p&gt;Each Databricks cluster is a standard Apache Spark cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Driver Node (EC2)
  |
  |-- Worker Node 1 (EC2)
  |      |-- Executor 1
  |      |-- Executor 2
  |
  |-- Worker Node 2 (EC2)
         |-- Executor 3
         |-- Executor 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Driver&lt;/strong&gt; — runs the main program, coordinates Spark jobs, hosts the SparkSession&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workers&lt;/strong&gt; — run Spark executors that process data in parallel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Executors&lt;/strong&gt; — JVM processes on workers that execute tasks&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Chapter 3 — Compute: Clusters &amp;amp; Runtimes
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Cluster Types
&lt;/h2&gt;

&lt;h3&gt;
  
  
  All-Purpose Clusters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Created manually (UI, CLI, API) or via the cluster policy&lt;/li&gt;
&lt;li&gt;Designed for &lt;strong&gt;interactive development&lt;/strong&gt; — notebooks, ad-hoc queries&lt;/li&gt;
&lt;li&gt;Can be shared by multiple users simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More expensive&lt;/strong&gt; because they run continuously (until terminated)&lt;/li&gt;
&lt;li&gt;Persist between runs; you can restart them&lt;/li&gt;
&lt;li&gt;Billed per DBU (Databricks Unit) while running&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exam tip:&lt;/strong&gt; All-purpose clusters are for development. Never use them exclusively in production workflows without good reason — that's what job clusters are for.&lt;/p&gt;

&lt;h3&gt;
  
  
  Job Clusters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Created automatically when a &lt;strong&gt;Databricks Job&lt;/strong&gt; runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terminated automatically&lt;/strong&gt; when the job completes&lt;/li&gt;
&lt;li&gt;Cannot be shared interactively&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Much cheaper&lt;/strong&gt; — you only pay while the job is running&lt;/li&gt;
&lt;li&gt;Configuration is defined in the job definition&lt;/li&gt;
&lt;li&gt;Each job run gets a fresh, isolated cluster&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exam tip:&lt;/strong&gt; For production workloads, always prefer job clusters for cost efficiency and isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL Warehouses (Databricks SQL)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Specifically designed for &lt;strong&gt;SQL analytics workloads&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Used by Databricks SQL, BI tools (Tableau, Power BI), and dbt&lt;/li&gt;
&lt;li&gt;Serverless (Serverless SQL Warehouses) or Classic (Pro/Serverless)&lt;/li&gt;
&lt;li&gt;Auto-scaling and auto-stop built-in&lt;/li&gt;
&lt;li&gt;Not used for Spark DataFrame API or ML workloads directly&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cluster Configuration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Runtime Versions
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Databricks Runtime (DBR)&lt;/strong&gt; is the set of software running on your cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Runtime (Standard)&lt;/strong&gt; — Spark + Delta Lake + standard libraries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Runtime ML&lt;/strong&gt; — adds ML libraries (MLflow, TensorFlow, PyTorch, scikit-learn)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks Runtime for Genomics&lt;/strong&gt; — specialized for genomics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Photon Runtime&lt;/strong&gt; — includes the Photon query engine (C++ vectorized execution for Delta Lake queries — much faster for SQL)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;LTS versions&lt;/strong&gt; (Long Term Support) — recommended for production. They receive bug fixes for a longer period. Example: &lt;code&gt;13.3 LTS&lt;/code&gt;, &lt;code&gt;12.2 LTS&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Always use the latest LTS version unless you have a specific reason to use a different version.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cluster Modes
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Standard Mode (Single User)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One user owns the cluster&lt;/li&gt;
&lt;li&gt;Recommended for most production workloads&lt;/li&gt;
&lt;li&gt;Full Spark features available&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shared Mode&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple users share the same cluster&lt;/li&gt;
&lt;li&gt;Isolation via Unity Catalog and fine-grained access control&lt;/li&gt;
&lt;li&gt;Some Spark features restricted for safety&lt;/li&gt;
&lt;li&gt;Requires Unity Catalog&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Single Node Mode&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Driver only — no worker nodes&lt;/li&gt;
&lt;li&gt;Good for small data, ML model training on single machine, development&lt;/li&gt;
&lt;li&gt;Spark still works (local mode)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Auto-Scaling
&lt;/h3&gt;

&lt;p&gt;Databricks clusters can &lt;strong&gt;automatically scale&lt;/strong&gt; the number of worker nodes based on load:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set a &lt;strong&gt;minimum&lt;/strong&gt; and &lt;strong&gt;maximum&lt;/strong&gt; worker count&lt;/li&gt;
&lt;li&gt;Databricks monitors the job queue and scales up/down accordingly&lt;/li&gt;
&lt;li&gt;Scale-down has a configurable timeout (default: 120 seconds of idle)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does not work well with Streaming jobs&lt;/strong&gt; — use fixed-size clusters for streaming&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Auto-Termination
&lt;/h3&gt;

&lt;p&gt;Set an &lt;strong&gt;idle timeout&lt;/strong&gt; (e.g., 60 minutes) after which the cluster terminates automatically if no notebooks are attached or jobs are running. This is critical for cost control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Instance Types (AWS)
&lt;/h3&gt;

&lt;p&gt;For the exam, you don't need to memorize specific EC2 types, but understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory-optimized&lt;/strong&gt; (r5, r6i) — good for large DataFrames, joins, aggregations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compute-optimized&lt;/strong&gt; (c5, c6i) — good for CPU-intensive transforms&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage-optimized&lt;/strong&gt; (i3) — good for heavy shuffle operations (local NVMe SSDs)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU instances&lt;/strong&gt; (p3, g4dn) — for ML Runtime (deep learning)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Instance Pools
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;instance pool&lt;/strong&gt; maintains a set of &lt;strong&gt;pre-warmed EC2 instances&lt;/strong&gt; ready to be allocated to clusters. This dramatically reduces cluster startup time (from ~5-8 minutes to ~30-60 seconds).&lt;/p&gt;

&lt;p&gt;Key characteristics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pools have a minimum idle instance count and maximum capacity&lt;/li&gt;
&lt;li&gt;Instances in pools are EC2 instances that are running (you pay for them)&lt;/li&gt;
&lt;li&gt;When a cluster uses a pool, it grabs instances from the pool&lt;/li&gt;
&lt;li&gt;When a cluster terminates, instances return to the pool&lt;/li&gt;
&lt;li&gt;Configured with a specific instance type and Databricks Runtime&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exam tip:&lt;/strong&gt; Use instance pools when you need &lt;strong&gt;fast cluster startup&lt;/strong&gt; for jobs or have many short-lived clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cluster Policies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cluster policies&lt;/strong&gt; are JSON configurations that &lt;strong&gt;restrict and/or set default values&lt;/strong&gt; for cluster configuration. They help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control costs (max DBUs per hour, max instance count)&lt;/li&gt;
&lt;li&gt;Enforce standards (specific runtimes, instance types)&lt;/li&gt;
&lt;li&gt;Simplify UX for non-technical users&lt;/li&gt;
&lt;li&gt;Apply governance rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only workspace admins can create policies. Users can only create clusters that conform to the policies assigned to them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Example:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;restrict&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;cluster&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;specific&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;runtime&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"spark_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"13.3.x-scala2.12"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"autotermination_minutes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DBUs — Databricks Units
&lt;/h2&gt;

&lt;p&gt;DBUs (Databricks Units) are the billing unit for Databricks. One DBU is roughly equivalent to one hour of a specific instance type doing compute. The rate varies by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster type (All-Purpose vs. Jobs vs. SQL Warehouse)&lt;/li&gt;
&lt;li&gt;Workload type (Data Engineering, Data Analytics, etc.)&lt;/li&gt;
&lt;li&gt;Whether Photon is enabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You pay both EC2 costs (to AWS) AND DBU costs (to Databricks).&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 4 — Notebooks
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Is a Databricks Notebook?
&lt;/h2&gt;

&lt;p&gt;A notebook is a web-based, interactive document containing &lt;strong&gt;cells&lt;/strong&gt; of code and markdown. It's the primary development interface in Databricks. Notebooks in Databricks have superpowers compared to Jupyter notebooks — they integrate tightly with the platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Notebook Basics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Languages
&lt;/h3&gt;

&lt;p&gt;Each notebook has a &lt;strong&gt;default language&lt;/strong&gt; (Python, SQL, Scala, R). However, you can &lt;strong&gt;mix languages&lt;/strong&gt; within a single notebook using &lt;strong&gt;magic commands&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;%python&lt;/code&gt; — execute a Python cell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%sql&lt;/code&gt; — execute a SQL cell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%scala&lt;/code&gt; — execute a Scala cell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%r&lt;/code&gt; — execute an R cell&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%md&lt;/code&gt; — render Markdown (documentation)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%sh&lt;/code&gt; — execute shell commands on the driver node&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%fs&lt;/code&gt; — shortcut for &lt;code&gt;dbutils.fs&lt;/code&gt; commands&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%run&lt;/code&gt; — run another notebook inline (powerful for modularization)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%pip&lt;/code&gt; — install Python packages in the notebook environment&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;%conda&lt;/code&gt; — use conda for package management (less common)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight about &lt;code&gt;%run&lt;/code&gt;:&lt;/strong&gt; When you &lt;code&gt;%run&lt;/code&gt; another notebook, it executes in the &lt;strong&gt;same SparkSession and scope&lt;/strong&gt; as the calling notebook. Variables, functions, and DataFrames defined in the sub-notebook are available in the parent notebook. This is different from a function call — it's more like an &lt;code&gt;include&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insight about &lt;code&gt;%sh&lt;/code&gt;:&lt;/strong&gt; Shell commands run on the &lt;strong&gt;driver node only&lt;/strong&gt;, not on worker nodes. Output is printed inline. You can install packages with &lt;code&gt;%sh pip install ...&lt;/code&gt; but &lt;code&gt;%pip&lt;/code&gt; is preferred because it installs on all nodes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Widget Parameters
&lt;/h3&gt;

&lt;p&gt;Notebooks can accept &lt;strong&gt;input parameters&lt;/strong&gt; using widgets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Table Name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Widget types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;text&lt;/code&gt; — free-text input&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dropdown&lt;/code&gt; — select from a list&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;combobox&lt;/code&gt; — dropdown with free-text option&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;multiselect&lt;/code&gt; — select multiple values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Widgets are particularly useful when notebooks are run as jobs — you can pass parameter values at runtime.&lt;/p&gt;

&lt;h3&gt;
  
  
  Notebook Results &amp;amp; Visualization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The last expression in a cell is displayed as output&lt;/li&gt;
&lt;li&gt;DataFrames are displayed as interactive tables with &lt;code&gt;display(df)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;display()&lt;/code&gt; supports charts — bar, line, pie, map, etc.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;print()&lt;/code&gt; outputs plain text&lt;/li&gt;
&lt;li&gt;Markdown cells render formatted documentation inline&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Notebook Lifecycle
&lt;/h3&gt;

&lt;p&gt;Notebooks can be in these states:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Detached&lt;/strong&gt; — not connected to a cluster, cannot run code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attached&lt;/strong&gt; — connected to a running or starting cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Running&lt;/strong&gt; — actively executing cells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You attach a notebook to a cluster by selecting the cluster from the dropdown in the top-right. Multiple notebooks can be attached to the same all-purpose cluster simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto-Save and Versioning
&lt;/h3&gt;

&lt;p&gt;Notebooks auto-save frequently. Databricks maintains a &lt;strong&gt;revision history&lt;/strong&gt; for every notebook — you can see the full history and revert to any previous version. This is separate from Git versioning (covered later).&lt;/p&gt;

&lt;h3&gt;
  
  
  Exporting Notebooks
&lt;/h3&gt;

&lt;p&gt;Notebooks can be exported as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.ipynb&lt;/code&gt; — Jupyter Notebook format&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.py&lt;/code&gt; — Python source file (with &lt;code&gt;# COMMAND ----------&lt;/code&gt; separators)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.scala&lt;/code&gt; — Scala source&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.sql&lt;/code&gt; — SQL source&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.dbc&lt;/code&gt; — Databricks archive (proprietary, includes all cells and results)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.html&lt;/code&gt; — HTML with rendered output (for sharing results)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Notebook Context &amp;amp; SparkSession
&lt;/h3&gt;

&lt;p&gt;In every Databricks notebook, these objects are pre-initialized:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;spark&lt;/code&gt; — the SparkSession&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sc&lt;/code&gt; — the SparkContext&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbutils&lt;/code&gt; — Databricks Utilities&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;display()&lt;/code&gt; — the display function&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;displayHTML()&lt;/code&gt; — render HTML content&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You never need to create a SparkSession manually in a notebook.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 5 — Delta Lake &amp;amp; Data Management
&lt;/h1&gt;

&lt;p&gt;This is the &lt;strong&gt;most important chapter&lt;/strong&gt; for the exam. Delta Lake underpins everything in Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Delta Lake?
&lt;/h2&gt;

&lt;p&gt;Delta Lake is an &lt;strong&gt;open-source storage layer&lt;/strong&gt; that brings ACID transactions and reliability to data lakes. It uses standard Parquet files for data storage but adds a &lt;strong&gt;transaction log&lt;/strong&gt; on top that enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ACID transactions&lt;/li&gt;
&lt;li&gt;Schema enforcement and evolution&lt;/li&gt;
&lt;li&gt;Time travel (data versioning)&lt;/li&gt;
&lt;li&gt;Upserts and deletes (DML operations)&lt;/li&gt;
&lt;li&gt;Streaming and batch unification&lt;/li&gt;
&lt;li&gt;Optimized read/write operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Delta Lake is the &lt;strong&gt;default table format&lt;/strong&gt; in Databricks. When you create a table without specifying a format, it's Delta.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Lake Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Delta Transaction Log
&lt;/h3&gt;

&lt;p&gt;The transaction log (located in the &lt;code&gt;_delta_log/&lt;/code&gt; folder of every Delta table) is the backbone of Delta Lake. It's a folder containing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JSON files&lt;/strong&gt; (0000000000.json, 0000000001.json, ...) — each represents one transaction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint files&lt;/strong&gt; (.parquet) — every 10 commits, a full checkpoint is created for performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every operation (write, update, delete, schema change) creates a new entry in the transaction log. The log is &lt;strong&gt;append-only&lt;/strong&gt; — you never modify existing log entries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reading a Delta table:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spark reads the &lt;code&gt;_delta_log/&lt;/code&gt; to determine the current state&lt;/li&gt;
&lt;li&gt;It identifies which Parquet files are "live" (not removed by deletes/updates)&lt;/li&gt;
&lt;li&gt;It reads only those Parquet files&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Writing to a Delta table:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spark writes new Parquet files to the table directory&lt;/li&gt;
&lt;li&gt;Atomically commits a new JSON entry to &lt;code&gt;_delta_log/&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;The transaction either fully commits or fully fails — no partial writes&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  ACID Properties in Delta Lake
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Atomicity&lt;/strong&gt; — A transaction either fully succeeds or fully fails. If your Spark job fails halfway, no partial data is committed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency&lt;/strong&gt; — Schema enforcement ensures data always conforms to the table's schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation&lt;/strong&gt; — Concurrent reads and writes don't interfere. Readers always see a consistent snapshot. Writers use optimistic concurrency control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Durability&lt;/strong&gt; — Once committed, data persists to S3.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Lake Operations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Creating Delta Tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL approach (creates a managed table in default schema)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Or from an existing data source&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`/path/to/parquet/`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- CTAS (Create Table As Select)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver_employees&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bronze_employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Writing Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Overwrite entire table
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Append to table
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to a path (external table pattern)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/delta/employees/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reading Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read a table
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read from path
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/delta/employees/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM employees WHERE department = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Engineering&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  MERGE (Upsert)
&lt;/h3&gt;

&lt;p&gt;MERGE is one of Delta Lake's most powerful features — it allows you to upsert (update + insert) data efficiently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;target_table&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;source_table&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
  &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
             &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
  &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SOURCE&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
  &lt;span class="k"&gt;DELETE&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key MERGE clauses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;WHEN MATCHED&lt;/code&gt; — row exists in both target and source&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHEN NOT MATCHED&lt;/code&gt; (or &lt;code&gt;WHEN NOT MATCHED BY TARGET&lt;/code&gt;) — row in source but not target (INSERT)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WHEN NOT MATCHED BY SOURCE&lt;/code&gt; — row in target but not in source (useful for DELETE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can have multiple &lt;code&gt;WHEN MATCHED&lt;/code&gt; clauses with different conditions.&lt;/p&gt;

&lt;h3&gt;
  
  
  DELETE and UPDATE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Delete rows&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Legacy'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Update rows&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Engineering'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are true DML operations — not possible with raw Parquet files!&lt;/p&gt;

&lt;h2&gt;
  
  
  Time Travel
&lt;/h2&gt;

&lt;p&gt;Time travel is the ability to &lt;strong&gt;query historical versions&lt;/strong&gt; of a Delta table. Every commit creates a new version.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query version 5&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;VERSION&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Query as of a timestamp&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-15 10:00:00'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- In Python&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"versionAsOf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;\&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"/path/to/table"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Audit and compliance — what did the data look like at a specific time?&lt;/li&gt;
&lt;li&gt;Recovery — accidentally deleted data? Roll back!&lt;/li&gt;
&lt;li&gt;Reproducibility — ML experiments can reference exact data versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How long can you go back?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Time travel is limited by the &lt;strong&gt;data retention period&lt;/strong&gt; (default: 30 days). Old versions are removed by the &lt;code&gt;VACUUM&lt;/code&gt; command.&lt;/p&gt;

&lt;h3&gt;
  
  
  RESTORE
&lt;/h3&gt;

&lt;p&gt;You can restore a table to a previous version:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;RESTORE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;VERSION&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;RESTORE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-10'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates a new commit that restores the state — it doesn't delete history.&lt;/p&gt;

&lt;h2&gt;
  
  
  Schema Enforcement and Evolution
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Schema Enforcement (Schema Validation)
&lt;/h3&gt;

&lt;p&gt;By default, Delta Lake &lt;strong&gt;rejects writes&lt;/strong&gt; that don't match the table's schema. This prevents silent data corruption.&lt;/p&gt;

&lt;p&gt;If you try to write a DataFrame with extra columns or incompatible types to a Delta table, you'll get a &lt;code&gt;AnalysisException&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Schema Evolution
&lt;/h3&gt;

&lt;p&gt;You can allow schema changes using &lt;code&gt;mergeSchema&lt;/code&gt; or &lt;code&gt;overwriteSchema&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add new columns to the table schema automatically
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Completely replace the schema (dangerous!)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;mergeSchema&lt;/strong&gt; — adds new columns from the incoming data to the schema. Existing columns are preserved.&lt;br&gt;
&lt;strong&gt;overwriteSchema&lt;/strong&gt; — replaces the entire schema (requires overwrite mode).&lt;/p&gt;

&lt;p&gt;For SQL, use &lt;code&gt;SET TBLPROPERTIES ('delta.columnMapping.mode' = 'name')&lt;/code&gt; for column renaming and dropping.&lt;/p&gt;
&lt;h2&gt;
  
  
  Delta Lake Table Properties and Optimization
&lt;/h2&gt;
&lt;h3&gt;
  
  
  OPTIMIZE
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;OPTIMIZE&lt;/code&gt; compacts small Parquet files into larger ones. Small files are a common performance problem — each Parquet file write from a Spark task creates a file, and tables with many small files are slow to read.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- With Z-ORDER (more on this below)&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;OPTIMIZE&lt;/code&gt; is idempotent and safe to run at any time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Z-Ordering
&lt;/h3&gt;

&lt;p&gt;Z-ORDER is a multi-dimensional data clustering technique. It co-locates related data (based on specified columns) within Parquet files. This dramatically improves query performance when filtering on those columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After Z-ordering by &lt;code&gt;department&lt;/code&gt;, queries like &lt;code&gt;WHERE department = 'Engineering'&lt;/code&gt; will skip most files (data skipping), reading only files that contain that department's data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Z-ORDER is most effective for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-cardinality columns used in WHERE filters&lt;/li&gt;
&lt;li&gt;Columns used in JOIN conditions&lt;/li&gt;
&lt;li&gt;Up to 3-4 columns (diminishing returns beyond that)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  VACUUM
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;VACUUM&lt;/code&gt; removes old Parquet files that are no longer referenced by the current or recent transaction log entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Remove files older than the default retention period (7 days for VACUUM, 30 days for time travel)&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Specify a retention period&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;RETAIN&lt;/span&gt; &lt;span class="mi"&gt;168&lt;/span&gt; &lt;span class="n"&gt;HOURS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- 7 days&lt;/span&gt;

&lt;span class="c1"&gt;-- DANGER: DRY RUN first!&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;DRY&lt;/span&gt; &lt;span class="n"&gt;RUN&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; The default &lt;code&gt;spark.databricks.delta.retentionDurationCheck.enabled&lt;/code&gt; prevents VACUUM from running with less than 7-day retention. Override carefully.&lt;/p&gt;

&lt;p&gt;After VACUUM, time travel beyond the retention period is impossible.&lt;/p&gt;

&lt;h3&gt;
  
  
  ANALYZE
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;COMPUTE&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;COMPUTE&lt;/span&gt; &lt;span class="k"&gt;STATISTICS&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;COLUMNS&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Computes statistics (min, max, null count, distinct count) that the Spark query optimizer uses for better query planning.&lt;/p&gt;

&lt;h3&gt;
  
  
  Auto Optimize (Databricks-specific Delta feature)
&lt;/h3&gt;

&lt;p&gt;Databricks adds two auto-optimization features:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auto Compaction&lt;/strong&gt; — automatically runs a compaction after writes if files are too small&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'delta.autoOptimize.autoCompact'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Optimized Writes&lt;/strong&gt; — repartitions data before writing to produce right-sized files&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'delta.autoOptimize.optimizeWrite'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or enable globally in Spark config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.databricks.delta.optimizeWrite.enabled = true
spark.databricks.delta.autoCompact.enabled = true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Liquid Clustering (Newer Feature)
&lt;/h3&gt;

&lt;p&gt;Liquid Clustering is a newer alternative to Z-Ordering and partitioning that provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Flexible, incremental clustering&lt;/li&gt;
&lt;li&gt;No need to know clustering columns upfront&lt;/li&gt;
&lt;li&gt;Better performance than Z-ORDER for most workloads&lt;/li&gt;
&lt;li&gt;Can change clustering columns without rewriting the whole table
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Re-cluster as data grows&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Delta Table History
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Shows all versions of the table with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version number&lt;/li&gt;
&lt;li&gt;Timestamp&lt;/li&gt;
&lt;li&gt;User who made the change&lt;/li&gt;
&lt;li&gt;Operation (WRITE, DELETE, MERGE, OPTIMIZE, etc.)&lt;/li&gt;
&lt;li&gt;Operation parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Table Partitioning
&lt;/h2&gt;

&lt;p&gt;Partitioning physically separates data into subdirectories based on a column value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PARTITIONED&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Data is stored as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;s3://bucket/sales/region=US/date=2024-01-01/*.parquet
s3://bucket/sales/region=EU/date=2024-01-01/*.parquet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to partition:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tables with billions of rows&lt;/li&gt;
&lt;li&gt;Columns with low cardinality (&amp;lt; 10,000 distinct values)&lt;/li&gt;
&lt;li&gt;Columns consistently used as filters&lt;/li&gt;
&lt;li&gt;Columns used in &lt;code&gt;GROUP BY&lt;/code&gt; at the partition level (e.g., date for daily processing)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When NOT to partition:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small tables (&amp;lt; 1GB)&lt;/li&gt;
&lt;li&gt;High-cardinality columns (creates too many small files)&lt;/li&gt;
&lt;li&gt;Columns rarely used as filters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Exam tip:&lt;/strong&gt; Over-partitioning is a common mistake. Liquid Clustering is generally preferred for new tables in Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Delta Change Data Feed (CDF)
&lt;/h2&gt;

&lt;p&gt;Change Data Feed (also called Change Data Capture) captures every row-level change (INSERT, UPDATE, DELETE) to a Delta table and exposes it as a readable stream or batch.&lt;/p&gt;

&lt;p&gt;Enable it on a table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enableChangeDataFeed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- OR at creation:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="p"&gt;(...)&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enableChangeDataFeed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read the changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Batch - read changes between versions
&lt;/span&gt;&lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readChangeFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;endingVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Streaming - continuously process changes
&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readChangeFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each change row has additional columns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;_change_type&lt;/code&gt; — &lt;code&gt;insert&lt;/code&gt;, &lt;code&gt;update_preimage&lt;/code&gt;, &lt;code&gt;update_postimage&lt;/code&gt;, &lt;code&gt;delete&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_commit_version&lt;/code&gt; — which version of the table&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_commit_timestamp&lt;/code&gt; — when the change was committed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Efficient Silver → Gold propagation (only process changed rows)&lt;/li&gt;
&lt;li&gt;Building CDC pipelines&lt;/li&gt;
&lt;li&gt;Audit logs&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Chapter 6 — Tables, Schemas &amp;amp; Databases
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Managed vs. External Tables
&lt;/h2&gt;

&lt;p&gt;This is a critical concept for the exam.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managed Tables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Databricks (via the metastore) controls &lt;strong&gt;both the metadata AND the data files&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data is stored in the &lt;strong&gt;default warehouse location&lt;/strong&gt; (DBFS or Unity Catalog managed storage)&lt;/li&gt;
&lt;li&gt;When you &lt;code&gt;DROP TABLE managed_table&lt;/code&gt;, &lt;strong&gt;both metadata AND data are deleted&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;No &lt;code&gt;LOCATION&lt;/code&gt; keyword in the CREATE statement
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;managed_employees&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;-- Data stored at: dbfs:/user/hive/warehouse/managed_employees/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  External Tables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Databricks controls the &lt;strong&gt;metadata only&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data lives at a &lt;strong&gt;user-specified location&lt;/strong&gt; (S3, ADLS, etc.)&lt;/li&gt;
&lt;li&gt;When you &lt;code&gt;DROP TABLE external_table&lt;/code&gt;, &lt;strong&gt;only metadata is deleted; data remains in S3&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Uses &lt;code&gt;LOCATION&lt;/code&gt; keyword
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;external_employees&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-data-bucket/employees/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use external tables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data shared with other tools (Athena, Glue, EMR)&lt;/li&gt;
&lt;li&gt;Data that should outlive the metastore&lt;/li&gt;
&lt;li&gt;Data owned by another team&lt;/li&gt;
&lt;li&gt;Large datasets already at a specific S3 path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use managed tables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data fully managed by Databricks&lt;/li&gt;
&lt;li&gt;Simplicity — don't want to manage S3 paths manually&lt;/li&gt;
&lt;li&gt;Unity Catalog managed storage is preferred&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Views
&lt;/h3&gt;

&lt;p&gt;Databricks supports three view types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regular Views&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;engineering_employees&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Engineering'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Stored in the metastore&lt;/li&gt;
&lt;li&gt;Query is re-executed each time&lt;/li&gt;
&lt;li&gt;Can reference tables across schemas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Temporary Views&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TEMP&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;temp_high_earners&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Only visible to the current SparkSession (your notebook session)&lt;/li&gt;
&lt;li&gt;Dropped when the SparkSession ends&lt;/li&gt;
&lt;li&gt;Not visible to other users or notebooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Global Temporary Views&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;GLOBAL&lt;/span&gt; &lt;span class="k"&gt;TEMP&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;global_active&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Access via global_temp schema&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;global_temp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_active&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Shared across SparkSessions &lt;strong&gt;on the same cluster&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Dropped when the cluster terminates&lt;/li&gt;
&lt;li&gt;Rarely used — prefer regular views for persistence&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Schemas (Databases)
&lt;/h2&gt;

&lt;p&gt;In Databricks, &lt;strong&gt;Schema&lt;/strong&gt; and &lt;strong&gt;Database&lt;/strong&gt; are synonymous (SQL standard uses Schema; Databricks/Hive historically uses Database). The two-level namespace is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;database.table
employees_db.employees
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Unity Catalog, it becomes three-level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;catalog.schema.table
main.employees_db.employees
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Creating and Using Schemas
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;employees_db&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Employee data'&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/employees_db/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- optional, for external schema&lt;/span&gt;

&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;employees_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- sets default schema for subsequent queries&lt;/span&gt;

&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;SCHEMAS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;employees_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;employees_db&lt;/span&gt; &lt;span class="k"&gt;CASCADE&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- CASCADE drops all tables in it&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DESCRIBE Commands
&lt;/h2&gt;

&lt;p&gt;Critical for exploring metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Basic table info (column names and types)&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Detailed table info (storage location, format, partitions, etc.)&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;DETAIL&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Table history (all versions)&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Extended info (everything + table properties)&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;EXTENDED&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  SHOW Commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                    &lt;span class="c1"&gt;-- tables in current schema&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TABLES&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="n"&gt;employees_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;-- tables in specific schema&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;SCHEMAS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                   &lt;span class="c1"&gt;-- all schemas&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;DATABASES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;                 &lt;span class="c1"&gt;-- same as SHOW SCHEMAS&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;COLUMNS&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;-- columns in a table&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;PARTITIONS&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;     &lt;span class="c1"&gt;-- partitions&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- table properties&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;-- show CREATE statement&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Table Cloning
&lt;/h2&gt;

&lt;p&gt;Delta Lake supports two cloning modes:&lt;/p&gt;

&lt;h3&gt;
  
  
  Shallow Clone
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees_backup&lt;/span&gt;
&lt;span class="n"&gt;SHALLOW&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Creates a new table that &lt;strong&gt;references the same Parquet files&lt;/strong&gt; as the original&lt;/li&gt;
&lt;li&gt;Only the metadata (transaction log) is copied — no data movement&lt;/li&gt;
&lt;li&gt;Extremely fast and cheap&lt;/li&gt;
&lt;li&gt;Good for: creating test copies, point-in-time snapshots for experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; If you run VACUUM on the original table, shallow clones may break! The underlying files may be deleted.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deep Clone
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees_backup&lt;/span&gt;
&lt;span class="n"&gt;DEEP&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fully copies&lt;/strong&gt; both metadata AND data files&lt;/li&gt;
&lt;li&gt;Independent of the original table&lt;/li&gt;
&lt;li&gt;Good for: migration, backup, production copies&lt;/li&gt;
&lt;li&gt;Can be expensive for large tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also clone incrementally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;employees_backup&lt;/span&gt;
&lt;span class="n"&gt;DEEP&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this repeatedly and it only copies new/changed files.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 7 — Databricks Utilities (dbutils)
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;dbutils&lt;/code&gt; is a utility library available in all Databricks notebooks. It provides programmatic access to many platform features.&lt;/p&gt;

&lt;h2&gt;
  
  
  dbutils.fs — File System Utilities
&lt;/h2&gt;

&lt;p&gt;Think of this as a programmatic interface to DBFS and cloud storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# List files and directories
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/data/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Move/rename
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/source/file.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/dest/file.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Copy
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/source/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/dest/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recurse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Delete
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/path/to/file.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/path/to/folder/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recurse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create directory
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/my/new/directory/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read file content (small files only!)
&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/path/to/file.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;maxBytes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;65536&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Mount an S3 bucket (legacy - prefer External Locations with UC)
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://my-bucket/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mount_point&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/my-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;extra_configs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fs.s3a.aws.credentials.provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List mounts
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mounts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Unmount
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unmount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/my-bucket&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;%fs&lt;/code&gt; magic command is a shorthand:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%fs ls /
%fs rm -r /path/to/delete/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  dbutils.secrets — Secrets Management
&lt;/h2&gt;

&lt;p&gt;One of the most security-critical utilities. Secrets allow you to store sensitive values (API keys, passwords, connection strings) &lt;strong&gt;without hardcoding them in notebooks or code&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Secret Scopes
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;secret scope&lt;/strong&gt; is a named collection of secrets. Two types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databricks-backed scopes&lt;/strong&gt; — secrets stored in the Databricks control plane, encrypted at rest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager-backed scopes&lt;/strong&gt; — secrets stored in AWS Secrets Manager; Databricks fetches them dynamically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a Databricks-backed scope (Databricks CLI)&lt;/span&gt;
databricks secrets create-scope &lt;span class="nt"&gt;--scope&lt;/span&gt; my-scope

&lt;span class="c"&gt;# Put a secret&lt;/span&gt;
databricks secrets put &lt;span class="nt"&gt;--scope&lt;/span&gt; my-scope &lt;span class="nt"&gt;--key&lt;/span&gt; db-password

&lt;span class="c"&gt;# List secrets (only shows keys, never values)&lt;/span&gt;
databricks secrets list &lt;span class="nt"&gt;--scope&lt;/span&gt; my-scope
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using Secrets in Notebooks
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Retrieve a secret - returns a string
&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db-password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# IMPORTANT: If you print a secret, Databricks redacts it
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Output: [REDACTED]
&lt;/span&gt;
&lt;span class="c1"&gt;# List scopes
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;listScopes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# List keys in a scope (doesn't show values)
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical exam point:&lt;/strong&gt; You can &lt;strong&gt;never see the actual value&lt;/strong&gt; of a secret from within a notebook. &lt;code&gt;print()&lt;/code&gt; always shows &lt;code&gt;[REDACTED]&lt;/code&gt;. This is by design for security.&lt;/p&gt;

&lt;h2&gt;
  
  
  dbutils.widgets — Notebook Parameters
&lt;/h2&gt;

&lt;p&gt;Covered in the Notebooks chapter. Key methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Display Label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# retrieve value
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;removeAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  dbutils.notebook — Notebook Orchestration
&lt;/h2&gt;

&lt;p&gt;This enables calling other notebooks programmatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Run another notebook with a timeout (seconds)
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/notebook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;timeout_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Exit a notebook with a return value
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SUCCESS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Difference between &lt;code&gt;%run&lt;/code&gt; and &lt;code&gt;dbutils.notebook.run()&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;&lt;code&gt;%run&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;dbutils.notebook.run()&lt;/code&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Executes in same session&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (new session)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Variables shared&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Can capture return value&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Parallel execution&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (multiple calls)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Timeout support&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pass parameters&lt;/td&gt;
&lt;td&gt;No (use widgets)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;%run&lt;/code&gt; is for modular code sharing. &lt;code&gt;dbutils.notebook.run()&lt;/code&gt; is for orchestration.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  dbutils.data — Preview Data
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Preview a small sample of a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summarize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  dbutils.library — Library Management (Deprecated)
&lt;/h2&gt;

&lt;p&gt;In newer runtimes, use &lt;code&gt;%pip&lt;/code&gt; instead. But for completeness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Deprecated, use %pip
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;installPyPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pandas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.3.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  dbutils.jobs — Job Context (Within Job Runs)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the current run ID when executing as part of a job
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upstream_task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is used in &lt;strong&gt;multi-task jobs&lt;/strong&gt; to pass values between tasks.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 8 — Git Folders / Repos
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Are Git Folders?
&lt;/h2&gt;

&lt;p&gt;Git Folders (formerly called Databricks Repos) is Databricks' &lt;strong&gt;native Git integration&lt;/strong&gt;. It allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect a Databricks workspace folder to a Git repository&lt;/li&gt;
&lt;li&gt;Pull, push, commit, create branches, and merge within the Databricks UI&lt;/li&gt;
&lt;li&gt;Version-control your notebooks alongside application code&lt;/li&gt;
&lt;li&gt;Implement CI/CD workflows for Databricks code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Supported Git Providers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;GitHub (github.com and GitHub Enterprise)&lt;/li&gt;
&lt;li&gt;GitLab&lt;/li&gt;
&lt;li&gt;Bitbucket (Cloud and Server)&lt;/li&gt;
&lt;li&gt;Azure DevOps&lt;/li&gt;
&lt;li&gt;AWS CodeCommit&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setting Up Git Integration
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Go to &lt;strong&gt;User Settings → Git Integration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Select your Git provider&lt;/li&gt;
&lt;li&gt;Enter your &lt;strong&gt;personal access token&lt;/strong&gt; (PAT) or OAuth credentials&lt;/li&gt;
&lt;li&gt;Databricks stores credentials in the control plane (encrypted)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Working with Git Folders
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cloning a Repo
&lt;/h3&gt;

&lt;p&gt;In the Databricks UI:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Navigate to the &lt;strong&gt;Repos&lt;/strong&gt; section (or &lt;strong&gt;Workspace → Git Folders&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Click "Add Repo"&lt;/li&gt;
&lt;li&gt;Enter the repository URL&lt;/li&gt;
&lt;li&gt;Optionally specify a branch&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This creates a folder in your workspace that mirrors the Git repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Git Operations in the UI
&lt;/h3&gt;

&lt;p&gt;Within a Git Folder, you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pull&lt;/strong&gt; — fetch and merge latest changes from remote&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Push&lt;/strong&gt; — push local commits to remote&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Commit&lt;/strong&gt; — stage and commit changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Branch&lt;/strong&gt; — create or switch branches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merge&lt;/strong&gt; — merge branches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reset&lt;/strong&gt; — revert uncommitted changes&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What Gets Versioned
&lt;/h3&gt;

&lt;p&gt;Inside a Git Folder:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Notebooks&lt;/strong&gt; — stored as &lt;code&gt;.py&lt;/code&gt;, &lt;code&gt;.sql&lt;/code&gt;, &lt;code&gt;.scala&lt;/code&gt;, or &lt;code&gt;.r&lt;/code&gt; files with special Databricks metadata comments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python modules&lt;/strong&gt; — regular &lt;code&gt;.py&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL files&lt;/strong&gt; — &lt;code&gt;.sql&lt;/code&gt; files&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Config files&lt;/strong&gt; — &lt;code&gt;requirements.txt&lt;/code&gt;, &lt;code&gt;setup.py&lt;/code&gt;, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Regular workspace notebooks (outside Git Folders) are &lt;strong&gt;not&lt;/strong&gt; in Git. Only notebooks inside a Git Folder are version-controlled.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD with Git Folders
&lt;/h3&gt;

&lt;p&gt;A typical CI/CD workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer (feature branch)
  → Git Folder in dev workspace
  → Runs tests in dev
  → Creates Pull Request on GitHub
  → PR review and approval
  → Merge to main
  → CI/CD pipeline (GitHub Actions / AWS CodePipeline)
    → Runs automated tests
    → Deploys to staging workspace (via Databricks CLI)
    → Deploys to production workspace (via Databricks CLI)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Databricks CLI for CI/CD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install CLI&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;databricks-cli

&lt;span class="c"&gt;# Configure&lt;/span&gt;
databricks configure &lt;span class="nt"&gt;--token&lt;/span&gt;

&lt;span class="c"&gt;# Import a notebook&lt;/span&gt;
databricks workspace import local_notebook.py /Remote/Path/notebook &lt;span class="nt"&gt;--language&lt;/span&gt; PYTHON

&lt;span class="c"&gt;# Export a notebook&lt;/span&gt;
databricks workspace &lt;span class="nb"&gt;export&lt;/span&gt; /Remote/Path/notebook local_notebook.py

&lt;span class="c"&gt;# Deploy a job&lt;/span&gt;
databricks &lt;span class="nb"&gt;jobs &lt;/span&gt;create &lt;span class="nt"&gt;--json-file&lt;/span&gt; job_config.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;Databricks Asset Bundles (DABs)&lt;/strong&gt; is the modern approach for CI/CD — a YAML-based framework for deploying Databricks resources (jobs, pipelines, notebooks, etc.) across environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Databricks Asset Bundles (DABs)
&lt;/h3&gt;

&lt;p&gt;DABs use a &lt;code&gt;databricks.yml&lt;/code&gt; configuration file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bundle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-data-pipeline&lt;/span&gt;

&lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://your-workspace.azuredatabricks.net&lt;/span&gt;

&lt;span class="na"&gt;targets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;development&lt;/span&gt;
    &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://dev-workspace.databricks.com&lt;/span&gt;

  &lt;span class="na"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
    &lt;span class="na"&gt;workspace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://prod-workspace.databricks.com&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;my_etl_job&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;My ETL Job&lt;/span&gt;
      &lt;span class="na"&gt;tasks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;task_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ingest&lt;/span&gt;
          &lt;span class="na"&gt;notebook_task&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;notebook_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./notebooks/ingest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deploy with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;databricks bundle deploy &lt;span class="nt"&gt;--target&lt;/span&gt; dev
databricks bundle run my_etl_job &lt;span class="nt"&gt;--target&lt;/span&gt; dev
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Chapter 9 — Unity Catalog &amp;amp; Governance
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Is Unity Catalog?
&lt;/h2&gt;

&lt;p&gt;Unity Catalog (UC) is Databricks' &lt;strong&gt;unified governance solution&lt;/strong&gt; for data and AI. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized metastore&lt;/strong&gt; — single source of truth for all metadata across workspaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained access control&lt;/strong&gt; — table, column, and row-level security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data lineage&lt;/strong&gt; — automatic tracking of data origin and transformations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auditing&lt;/strong&gt; — full audit log of all data access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data discovery&lt;/strong&gt; — searchable catalog with tagging and documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before Unity Catalog, each Databricks workspace had its own &lt;strong&gt;Hive Metastore&lt;/strong&gt; — workspace-local, no cross-workspace sharing, no fine-grained governance. Unity Catalog replaces this with a centralized, cross-workspace solution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three-Level Namespace
&lt;/h2&gt;

&lt;p&gt;Unity Catalog introduces a &lt;strong&gt;three-level namespace&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;catalog.schema.table

main.sales.transactions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to the legacy two-level namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;schema.table  (or database.table)
sales.transactions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Catalog&lt;/strong&gt; — the top-level container. You can have multiple catalogs for different environments, business units, or data domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema&lt;/strong&gt; — equivalent to a database in the old model. Contains tables, views, and volumes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table/View/Volume&lt;/strong&gt; — the actual data objects.&lt;/p&gt;

&lt;h3&gt;
  
  
  Navigating the Namespace
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Set the default catalog&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Set the default schema&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Fully qualified reference&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Partially qualified (using defaults)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Unity Catalog Object Hierarchy
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metastore (one per region)
  └── Catalog
        └── Schema
              ├── Table (Managed or External)
              ├── View
              ├── Volume
              └── Function
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metastore&lt;/strong&gt; — The top-level Unity Catalog object. Created once per Databricks account region. All workspaces in the same account and region share the same metastore (or different metastores if you create multiple).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Catalog&lt;/strong&gt; — First-level namespace. Create catalogs for environments (dev, staging, prod) or business units.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema&lt;/strong&gt; — Second-level namespace. Organize tables by domain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Table&lt;/strong&gt; — Same as before, but now with UC governance features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Volume&lt;/strong&gt; — A governance-enabled storage abstraction for &lt;strong&gt;non-tabular data&lt;/strong&gt; (files, images, audio). Similar to a managed directory with access control.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audio_files&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="c1"&gt;-- Access volume files&lt;/span&gt;
&lt;span class="n"&gt;ls&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Volumes&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;audio_files&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Function&lt;/strong&gt; — User-defined functions stored in the catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Managed vs. External Storage in Unity Catalog
&lt;/h2&gt;

&lt;h3&gt;
  
  
  UC Managed Tables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data stored in the UC metastore's &lt;strong&gt;managed storage location&lt;/strong&gt; (an S3 bucket you configure)&lt;/li&gt;
&lt;li&gt;When dropped, data is deleted&lt;/li&gt;
&lt;li&gt;UC manages all file organization and optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  External Tables in Unity Catalog
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data lives in an &lt;strong&gt;External Location&lt;/strong&gt; (governed S3 path)&lt;/li&gt;
&lt;li&gt;When dropped, only metadata is removed; data persists&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  External Locations
&lt;/h3&gt;

&lt;p&gt;An &lt;strong&gt;External Location&lt;/strong&gt; is a registered S3 path that UC governs. It requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;strong&gt;Storage Credential&lt;/strong&gt; — an IAM role that Databricks can use to access S3&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;External Location&lt;/strong&gt; — the combination of Storage Credential + S3 path
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- First, an admin creates a storage credential (IAM role ARN)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;STORAGE&lt;/span&gt; &lt;span class="n"&gt;CREDENTIAL&lt;/span&gt; &lt;span class="n"&gt;my_s3_credential&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="s1"&gt;'arn:aws:iam::123456789:role/my-databricks-s3-role'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Then create an external location&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="n"&gt;my_data_location&lt;/span&gt;
&lt;span class="n"&gt;URL&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-data-bucket/data/'&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;STORAGE&lt;/span&gt; &lt;span class="n"&gt;CREDENTIAL&lt;/span&gt; &lt;span class="n"&gt;my_s3_credential&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Now create an external table at that location&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-data-bucket/data/transactions/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits over DBFS mounts:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access control — only users granted permissions to the External Location can access data&lt;/li&gt;
&lt;li&gt;Audit logging — all accesses are logged&lt;/li&gt;
&lt;li&gt;Cross-workspace — works across all workspaces connected to the metastore&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Access Control in Unity Catalog
&lt;/h2&gt;

&lt;p&gt;UC uses a privilege-based model with &lt;code&gt;GRANT&lt;/code&gt; and &lt;code&gt;REVOKE&lt;/code&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  Privilege Types
&lt;/h3&gt;

&lt;p&gt;On &lt;strong&gt;Catalogs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;USE CATALOG&lt;/code&gt; — allows using the catalog (required for all access)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CREATE SCHEMA&lt;/code&gt; — create schemas within the catalog&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ALL PRIVILEGES&lt;/code&gt; — grants all privileges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On &lt;strong&gt;Schemas:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;USE SCHEMA&lt;/code&gt; — allows using the schema&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CREATE TABLE&lt;/code&gt; — create tables&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CREATE VIEW&lt;/code&gt; — create views&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALL PRIVILEGES&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On &lt;strong&gt;Tables:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SELECT&lt;/code&gt; — read data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;MODIFY&lt;/code&gt; — INSERT, UPDATE, DELETE&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ALL PRIVILEGES&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On &lt;strong&gt;External Locations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;READ FILES&lt;/code&gt; — read files from this location&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;WRITE FILES&lt;/code&gt; — write files to this location&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;CREATE EXTERNAL TABLE&lt;/code&gt; — create external tables at this location&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Granting Privileges
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Grant table access to a user&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`user@example.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant schema access to a group&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_analysts`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant catalog access&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_team`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Revoke&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`user@example.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Show grants&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;GRANTS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Privilege Inheritance
&lt;/h3&gt;

&lt;p&gt;Privileges cascade down the hierarchy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Granting &lt;code&gt;ALL PRIVILEGES&lt;/code&gt; on a catalog implicitly covers all schemas, tables, and views within it&lt;/li&gt;
&lt;li&gt;However, &lt;code&gt;USE CATALOG&lt;/code&gt; doesn't automatically grant &lt;code&gt;USE SCHEMA&lt;/code&gt; — each level must be granted explicitly&lt;/li&gt;
&lt;li&gt;To read a table, a user needs: &lt;code&gt;USE CATALOG&lt;/code&gt; + &lt;code&gt;USE SCHEMA&lt;/code&gt; + &lt;code&gt;SELECT ON TABLE&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Row-Level Security and Column Masking
&lt;/h3&gt;

&lt;p&gt;Unity Catalog supports &lt;strong&gt;fine-grained access control&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Row Filters&lt;/strong&gt; — restrict which rows a user can see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a row filter function&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;filter_by_region&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'us_team'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'US'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Apply to table&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="n"&gt;filter_by_region&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Column Masks&lt;/strong&gt; — mask sensitive column values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a mask function&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;mask_ssn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IS_ACCOUNT_GROUP_MEMBER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hr_team'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'***-**-****'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Apply to column&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;hr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;mask_ssn&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Lineage
&lt;/h2&gt;

&lt;p&gt;Unity Catalog &lt;strong&gt;automatically tracks data lineage&lt;/strong&gt; — how data flows between tables, views, and notebooks. No configuration required.&lt;/p&gt;

&lt;p&gt;You can view lineage in the Catalog Explorer UI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Table lineage&lt;/strong&gt; — upstream and downstream tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Column lineage&lt;/strong&gt; — trace a specific column back to its source&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is invaluable for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Impact analysis (what breaks if I change this table?)&lt;/li&gt;
&lt;li&gt;Debugging data quality issues&lt;/li&gt;
&lt;li&gt;Compliance and auditing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Audit Logs
&lt;/h2&gt;

&lt;p&gt;All Unity Catalog operations are logged to a &lt;strong&gt;system table&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query the audit log (system catalog)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audit&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'databricksUserEvent'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Tags and Documentation
&lt;/h2&gt;

&lt;p&gt;UC supports tagging objects with metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TAGS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'pii'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'domain'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sales'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'owner'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sales-team@company.com'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="s1"&gt;'All sales transaction records from POS systems'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="s1"&gt;'Unique customer identifier, PII'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Delta Sharing
&lt;/h2&gt;

&lt;p&gt;Delta Sharing is an open standard for sharing Delta Lake data across organizations &lt;strong&gt;without copying the data&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create a share&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SHARE&lt;/span&gt; &lt;span class="n"&gt;my_data_share&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Add tables to the share&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SHARE&lt;/span&gt; &lt;span class="n"&gt;my_data_share&lt;/span&gt;
&lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transactions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create a recipient&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;RECIPIENT&lt;/span&gt; &lt;span class="n"&gt;partner_company&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;ID&lt;/span&gt; &lt;span class="s1"&gt;'partner-databricks-id'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant access&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SHARE&lt;/span&gt; &lt;span class="n"&gt;my_data_share&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;RECIPIENT&lt;/span&gt; &lt;span class="n"&gt;partner_company&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recipient accesses the data via the Delta Sharing protocol from their own Databricks (or other supported tool) — they never get direct S3 access.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 10 — Job Orchestration &amp;amp; Workflows
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Databricks Workflows
&lt;/h2&gt;

&lt;p&gt;Databricks Workflows (formerly Jobs) is the native &lt;strong&gt;orchestration engine&lt;/strong&gt; for running Databricks workloads. It supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single-task or multi-task jobs&lt;/li&gt;
&lt;li&gt;Scheduling (cron syntax)&lt;/li&gt;
&lt;li&gt;Triggered runs&lt;/li&gt;
&lt;li&gt;Retry policies&lt;/li&gt;
&lt;li&gt;Alerting (email, webhooks, Slack)&lt;/li&gt;
&lt;li&gt;Job clusters (ephemeral) or all-purpose clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Job Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Task
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;task&lt;/strong&gt; is the atomic unit of work in a Workflow. Each task can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Notebook task&lt;/strong&gt; — runs a Databricks notebook&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python script task&lt;/strong&gt; — runs a &lt;code&gt;.py&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python wheel task&lt;/strong&gt; — runs a function from a packaged Python wheel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Spark JAR task&lt;/strong&gt; — runs a Java/Scala Spark application&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL task&lt;/strong&gt; — runs SQL (against a SQL Warehouse)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Live Tables task&lt;/strong&gt; — triggers a DLT pipeline update&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dbt task&lt;/strong&gt; — runs a dbt project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline task&lt;/strong&gt; — runs a DLT pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run job task&lt;/strong&gt; — triggers another job (job chaining)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For each task&lt;/strong&gt; — iterates over a list and runs a sub-task for each item&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Multi-Task Jobs (Task Graphs)
&lt;/h3&gt;

&lt;p&gt;Jobs can have &lt;strong&gt;multiple tasks connected by dependencies&lt;/strong&gt;, forming a DAG (Directed Acyclic Graph):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ingest_raw] → [validate_data] → [transform_silver] → [aggregate_gold]
                                                    ↘
                                                      [generate_report]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tasks can run &lt;strong&gt;in parallel&lt;/strong&gt; (if no dependency between them) or &lt;strong&gt;sequentially&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Example job configuration structure
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily ETL Pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tasks&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ETL/01_ingest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;depends_on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ETL/02_transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aggregate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;depends_on&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notebook_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/ETL/03_aggregate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Task Values (Passing Data Between Tasks)
&lt;/h3&gt;

&lt;p&gt;Tasks in a multi-task job can pass data to downstream tasks using task values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In the upstream task (e.g., "ingest")
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_ingested&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/data/silver/2024-01-15/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# In the downstream task (e.g., "transform")
&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rows_ingested&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Scheduling Jobs
&lt;/h2&gt;

&lt;p&gt;Jobs can be triggered by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Manual trigger&lt;/strong&gt; — click "Run Now" in the UI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron schedule&lt;/strong&gt; — standard cron syntax&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File arrival trigger&lt;/strong&gt; — runs when a new file appears in a storage path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous trigger&lt;/strong&gt; — reruns as soon as the previous run completes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API trigger&lt;/strong&gt; — POST to the Jobs API
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Cron examples
0 0 * * *       # Daily at midnight UTC
0 6 * * 1       # Every Monday at 6 AM
*/15 * * * *    # Every 15 minutes
0 2 1 * *       # First of every month at 2 AM
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cluster Configuration for Jobs
&lt;/h2&gt;

&lt;p&gt;Each task can have its own cluster configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Job cluster&lt;/strong&gt; — created and terminated per-run (recommended for production)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;All-purpose cluster&lt;/strong&gt; — use an existing running cluster (development only)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instance pool&lt;/strong&gt; — use instances from a pool for faster startup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For &lt;strong&gt;multi-task jobs&lt;/strong&gt;, different tasks can use different cluster configurations (e.g., a small cluster for data validation, a large cluster for heavy transformation).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared job clusters&lt;/strong&gt; — multiple tasks in the same job run share a single cluster (cost-efficient when tasks don't need isolation).&lt;/p&gt;

&lt;h2&gt;
  
  
  Retry and Error Handling
&lt;/h2&gt;

&lt;p&gt;Each task supports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"max_retries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"min_retry_interval_millis"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"retry_on_timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timeout_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Conditional task execution&lt;/strong&gt; — you can run a task based on the outcome of a previous task:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;all_success&lt;/code&gt; (default) — run only if all upstream tasks succeeded&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;all_done&lt;/code&gt; — run regardless of upstream status&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;at_least_one_success&lt;/code&gt; — run if at least one upstream succeeded&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;all_failed&lt;/code&gt; — run only if all upstream failed (useful for cleanup/alerting on failure)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;at_least_one_failed&lt;/code&gt; — run if at least one upstream failed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Job Parameters
&lt;/h2&gt;

&lt;p&gt;Jobs accept &lt;strong&gt;parameters&lt;/strong&gt; at two levels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Job-level parameters&lt;/strong&gt; — available to all tasks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"run_date"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-01-15"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"environment"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Task-level parameters&lt;/strong&gt; — specific to a task (e.g., widget values for a notebook task).&lt;/p&gt;

&lt;p&gt;Access in notebooks via &lt;code&gt;dbutils.widgets.get()&lt;/code&gt; or as job parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring and Alerting
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Job Run Status:&lt;/strong&gt; Each run shows &lt;code&gt;Pending&lt;/code&gt;, &lt;code&gt;Running&lt;/code&gt;, &lt;code&gt;Succeeded&lt;/code&gt;, &lt;code&gt;Failed&lt;/code&gt;, &lt;code&gt;Timedout&lt;/code&gt;, or &lt;code&gt;Canceled&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Email Notifications:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"email_notifications"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"team@company.com"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_failure"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"oncall@company.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"manager@company.com"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"on_start"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"no_alert_for_skipped_runs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Webhook notifications&lt;/strong&gt; — send HTTP callbacks to Slack, PagerDuty, custom endpoints on job events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System Tables for Monitoring (Unity Catalog):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lakeflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;job_run_timeline&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;workspace_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_workspace_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;period_start_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;period_start_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Repair Runs
&lt;/h2&gt;

&lt;p&gt;If a job fails partway through a multi-task DAG, you can &lt;strong&gt;repair&lt;/strong&gt; the run — re-running only the failed tasks (and their dependents) without re-running successful tasks. This is huge for cost savings in long pipelines.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 11 — Data Ingestion Patterns
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Bronze-Silver-Gold (Medallion Architecture)
&lt;/h2&gt;

&lt;p&gt;This is the standard data organization pattern in Databricks:&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer (Raw Ingestion)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exact copy&lt;/strong&gt; of source data — no transformations&lt;/li&gt;
&lt;li&gt;All formats preserved (JSON, CSV, XML, Avro, Parquet)&lt;/li&gt;
&lt;li&gt;Append-only (never delete from Bronze)&lt;/li&gt;
&lt;li&gt;May store as Delta (preferred) or original format&lt;/li&gt;
&lt;li&gt;Adds ingestion metadata: &lt;code&gt;ingestion_timestamp&lt;/code&gt;, &lt;code&gt;source_file&lt;/code&gt;, &lt;code&gt;batch_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;This is your "system of record" — if anything goes wrong downstream, reprocess from Bronze&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Silver Layer (Cleansed &amp;amp; Enriched)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data validated, cleaned, and normalized&lt;/li&gt;
&lt;li&gt;Schema enforced (Delta)&lt;/li&gt;
&lt;li&gt;Deduplication applied&lt;/li&gt;
&lt;li&gt;Null handling and type casting done&lt;/li&gt;
&lt;li&gt;Joins to reference/lookup tables&lt;/li&gt;
&lt;li&gt;Row-level quality checks&lt;/li&gt;
&lt;li&gt;Still somewhat raw — business logic not fully applied&lt;/li&gt;
&lt;li&gt;Query-friendly but not aggregated&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Gold Layer (Business-Level Aggregates)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Aggregated, denormalized, business-specific views&lt;/li&gt;
&lt;li&gt;Optimized for specific use cases (reporting, ML features)&lt;/li&gt;
&lt;li&gt;Often organized by business domain or consumer&lt;/li&gt;
&lt;li&gt;Direct source for BI tools, dashboards, ML models&lt;/li&gt;
&lt;li&gt;Examples: daily sales totals, monthly active users, feature store tables&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Batch Ingestion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reading from S3
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read CSV from S3
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inferSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/employees/*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read JSON
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Parquet
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/transactions/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read with explicit schema (preferred - avoids schema inference overhead)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
  &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Auto Loader
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Auto Loader&lt;/strong&gt; is Databricks' incremental data ingestion mechanism. It efficiently processes &lt;strong&gt;new files as they arrive in S3&lt;/strong&gt;, tracking which files have been processed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Auto Loader with cloudFiles format
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/checkpoints/schema/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.inferColumnTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write the stream
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/checkpoints/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableNow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \  &lt;span class="c1"&gt;# Process all available files, then stop
&lt;/span&gt;  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Auto Loader features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Schema inference and evolution&lt;/strong&gt; — automatically detects and adapts to new columns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File notification mode&lt;/strong&gt; — uses AWS SQS/SNS for efficient, scalable file detection (recommended for large volumes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Directory listing mode&lt;/strong&gt; — periodically lists the directory (simpler, lower setup, but less efficient for large buckets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema location&lt;/strong&gt; — stores the inferred schema separately so it persists across runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint&lt;/strong&gt; — tracks exactly which files have been processed (exactly-once guarantee)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;rescuedDataColumn&lt;/strong&gt; — captures data that didn't match the expected schema into a separate column&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Auto Loader vs. COPY INTO:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Auto Loader&lt;/th&gt;
&lt;th&gt;COPY INTO&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Use case&lt;/td&gt;
&lt;td&gt;Streaming incremental loads&lt;/td&gt;
&lt;td&gt;Batch incremental loads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tracking&lt;/td&gt;
&lt;td&gt;Checkpoint-based&lt;/td&gt;
&lt;td&gt;Built-in idempotency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scale&lt;/td&gt;
&lt;td&gt;Millions of files&lt;/td&gt;
&lt;td&gt;Thousands of files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Schema evolution&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;td&gt;readStream&lt;/td&gt;
&lt;td&gt;SQL command&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  COPY INTO
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;COPY INTO&lt;/code&gt; is a SQL command for &lt;strong&gt;idempotent batch data loading&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-bucket/raw/events/'&lt;/span&gt;
&lt;span class="n"&gt;FILEFORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt;
&lt;span class="n"&gt;FORMAT_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'inferSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'mergeSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;COPY_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'mergeSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Idempotent&lt;/strong&gt; — running the same COPY INTO multiple times doesn't duplicate data; it only loads new files&lt;/li&gt;
&lt;li&gt;Tracks loaded files internally&lt;/li&gt;
&lt;li&gt;Good for scheduled batch loading&lt;/li&gt;
&lt;li&gt;Simpler than Auto Loader but less scalable&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Streaming Ingestion
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Structured Streaming
&lt;/h3&gt;

&lt;p&gt;Spark Structured Streaming is the engine underlying streaming in Databricks. Write streaming logic using the same DataFrame API as batch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from Kafka (on MSK - Amazon Managed Streaming for Kafka)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;broker1:9092,broker2:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingOffsets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;latest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Parse the Kafka message value (typically JSON)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;
&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([...])&lt;/span&gt;
&lt;span class="n"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Delta
&lt;/span&gt;&lt;span class="n"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoints/kafka_stream/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Trigger Options
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Micro-batch triggered every 30 seconds
&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processingTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;30 seconds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Process all available data, then stop (batch-like)
&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableNow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Continuous processing (low-latency, experimental)
&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;continuous&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1 second&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Default: process as fast as possible (micro-batch)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Output Modes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;append&lt;/code&gt;&lt;/strong&gt; — only new rows are written (most common for Delta)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;complete&lt;/code&gt;&lt;/strong&gt; — all rows of the result are written/overwritten (for aggregations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;update&lt;/code&gt;&lt;/strong&gt; — only changed rows are written (requires Delta)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Checkpointing
&lt;/h3&gt;

&lt;p&gt;Checkpoints store the streaming state — which offsets have been processed, aggregate states, etc. &lt;strong&gt;Always set a checkpoint location&lt;/strong&gt; for production streaming jobs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/checkpoints/my-stream/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a streaming job crashes and restarts, it resumes from the checkpoint — exactly-once processing.&lt;/p&gt;

&lt;h3&gt;
  
  
  foreachBatch
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;foreachBatch&lt;/code&gt; lets you apply arbitrary batch operations to each micro-batch of a stream:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;batch_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="c1"&gt;# Complex transformations, MERGE, etc.
&lt;/span&gt;  &lt;span class="n"&gt;batch_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;# Send metrics to monitoring system
&lt;/span&gt;
&lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;foreachBatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;process_batch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoints/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is powerful because you can use any batch API (including MERGE) within a stream.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 12 — Delta Live Tables (DLT) &amp;amp; Pipelines
&lt;/h1&gt;

&lt;h2&gt;
  
  
  What Are Delta Live Tables?
&lt;/h2&gt;

&lt;p&gt;Delta Live Tables (DLT) is a &lt;strong&gt;declarative pipeline framework&lt;/strong&gt; for building reliable data pipelines. Instead of writing imperative Spark code and managing the execution yourself, you declare &lt;strong&gt;what&lt;/strong&gt; your tables should look like, and DLT figures out &lt;strong&gt;how&lt;/strong&gt; to build them, run them, and keep them up-to-date.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Philosophy
&lt;/h3&gt;

&lt;p&gt;In traditional ETL:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You write code that runs sequentially&lt;/li&gt;
&lt;li&gt;You manage retries, error handling, checkpointing manually&lt;/li&gt;
&lt;li&gt;You handle schema evolution manually&lt;/li&gt;
&lt;li&gt;You figure out when to run each step and in what order&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In DLT:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You declare table definitions&lt;/li&gt;
&lt;li&gt;DLT analyzes dependencies and builds a DAG automatically&lt;/li&gt;
&lt;li&gt;DLT handles retries, checkpointing, error handling&lt;/li&gt;
&lt;li&gt;DLT enforces data quality with expectations&lt;/li&gt;
&lt;li&gt;DLT supports both batch and streaming seamlessly&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  DLT Core Concepts
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Defining Tables
&lt;/h3&gt;

&lt;p&gt;In DLT notebooks, you use the &lt;code&gt;@dlt.table&lt;/code&gt; decorator (Python) or &lt;code&gt;CREATE OR REFRESH LIVE TABLE&lt;/code&gt; (SQL):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;# Bronze: Raw ingestion with Auto Loader
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw event data from S3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;raw_events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoints/raw_events_schema/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL equivalent&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;raw_events&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="nv"&gt;"Raw event data from S3"&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cloud_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'s3://my-bucket/raw/events/'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'json'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'cloudFiles.inferColumnTypes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Live Tables vs. Streaming Live Tables
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Live Tables (Materialized Views)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process data in batch&lt;/li&gt;
&lt;li&gt;Fully recomputed each run (by default)&lt;/li&gt;
&lt;li&gt;Best for: aggregations, small tables, tables that need to be completely refreshed
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;daily_sales_summary&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver_transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Streaming Live Tables&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Process data incrementally using Structured Streaming&lt;/li&gt;
&lt;li&gt;Only process new data each run (huge cost savings)&lt;/li&gt;
&lt;li&gt;Best for: high-volume append-only data (logs, events, IoT)
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;silver_events&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key difference:&lt;/strong&gt; Use &lt;code&gt;dlt.read()&lt;/code&gt; for batch (live table) dependencies, &lt;code&gt;dlt.read_stream()&lt;/code&gt; for streaming dependencies.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data Quality Expectations
&lt;/h3&gt;

&lt;p&gt;Expectations are DLT's &lt;strong&gt;data quality enforcement mechanism&lt;/strong&gt;. They're constraints you define on your tables. DLT tracks violations and can:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@dlt.expect&lt;/code&gt;&lt;/strong&gt; — warn on violations (data flows through, violations recorded as metrics)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@dlt.expect_or_drop&lt;/code&gt;&lt;/strong&gt; — drop violating rows (bad data doesn't contaminate downstream)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;@dlt.expect_or_fail&lt;/code&gt;&lt;/strong&gt; — fail the pipeline on any violation (strict validation)
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email RLIKE &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;^[^@]+@[^@]+&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s"&gt;.[^@]+$&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_negative_ids&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id &amp;gt;= 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;silver_transactions&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL expectations&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver_transactions&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_customer_id&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;positive_amount&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;no_negative_ids&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="n"&gt;FAIL&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;STREAM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bronze_transactions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Viewing expectation metrics&lt;/strong&gt; in the pipeline UI shows you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many rows passed/failed each constraint&lt;/li&gt;
&lt;li&gt;Trends over time&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pipeline Configuration
&lt;/h3&gt;

&lt;p&gt;A DLT &lt;strong&gt;pipeline&lt;/strong&gt; is the deployment unit. One pipeline can contain multiple notebooks, each defining tables/views.&lt;/p&gt;

&lt;p&gt;Key pipeline settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline mode:&lt;/strong&gt; &lt;code&gt;TRIGGERED&lt;/code&gt; (batch) or &lt;code&gt;CONTINUOUS&lt;/code&gt; (streaming, low latency)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target schema:&lt;/strong&gt; where to store the resulting tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Storage location:&lt;/strong&gt; where DLT stores internal state, logs, and tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster configuration:&lt;/strong&gt; compute for the pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notifications:&lt;/strong&gt; email on success/failure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Channel:&lt;/strong&gt; &lt;code&gt;CURRENT&lt;/code&gt; or &lt;code&gt;PREVIEW&lt;/code&gt; (preview gets new DLT features first)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Triggered pipelines&lt;/strong&gt; — runs once, processes available data, then stops. Triggered via a schedule or manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous pipelines&lt;/strong&gt; — runs indefinitely, processing data as it arrives. Used for near-real-time requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pipeline UI and Monitoring
&lt;/h3&gt;

&lt;p&gt;The DLT UI provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DAG visualization&lt;/strong&gt; — see all tables and their dependencies as a graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data quality metrics&lt;/strong&gt; — expectation pass/fail rates per table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table event logs&lt;/strong&gt; — see exactly what happened during each update&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row statistics&lt;/strong&gt; — rows read, written, failed per table per run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Event log is also queryable as a Delta table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;`/path/to/pipeline/_delta_log_events/`&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'flow_progress'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;timestamp&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DLT with Auto Loader
&lt;/h2&gt;

&lt;p&gt;Auto Loader + DLT is the &lt;strong&gt;canonical pattern for Bronze ingestion&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipelines.reset.allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# prevent accidental full reload
&lt;/span&gt;  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bronze_orders&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
              &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/pipelines/my_pipeline/schema/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.inferColumnTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/raw/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;input_file_name&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_ingestion_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Apply Changes (CDC with DLT)
&lt;/h2&gt;

&lt;p&gt;DLT has native support for &lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt; via &lt;code&gt;APPLY CHANGES INTO&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define the target table
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# Target table definition
&lt;/span&gt;
&lt;span class="c1"&gt;# Apply changes from a source (CDC stream)
&lt;/span&gt;&lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers_cdc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;keys&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;sequence_by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# determines order of changes
&lt;/span&gt;  &lt;span class="n"&gt;apply_as_deletes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DELETE&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="n"&gt;except_column_list&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;update_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;stored_as_scd_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# Type 1: overwrite, Type 2: keep history
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SCD Type 1&lt;/strong&gt; — just overwrite the record with the latest values&lt;br&gt;
&lt;strong&gt;SCD Type 2&lt;/strong&gt; — keep history by creating new records with &lt;code&gt;__START_AT&lt;/code&gt; and &lt;code&gt;__END_AT&lt;/code&gt; columns&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL equivalent&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="n"&gt;APPLY&lt;/span&gt; &lt;span class="n"&gt;CHANGES&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;STREAM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers_cdc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;KEYS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;APPLY&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;operation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'DELETE'&lt;/span&gt;
&lt;span class="n"&gt;SEQUENCE&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;update_timestamp&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;SCD&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DLT Table Properties and Access
&lt;/h2&gt;

&lt;p&gt;Tables created by DLT can be accessed by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Other DLT pipelines&lt;/li&gt;
&lt;li&gt;Notebooks (after the pipeline has run)&lt;/li&gt;
&lt;li&gt;SQL analytics&lt;/li&gt;
&lt;li&gt;Jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tables are stored in the &lt;strong&gt;target catalog/schema&lt;/strong&gt; you specify in the pipeline configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- After pipeline runs, query the tables normally&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Limitations of DLT
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;DLT tables cannot be modified outside the pipeline (read-only externally)&lt;/li&gt;
&lt;li&gt;You cannot manually write to a DLT-managed table&lt;/li&gt;
&lt;li&gt;Full pipeline resets can be expensive (reprocess all data from scratch)&lt;/li&gt;
&lt;li&gt;Limited to certain compute configurations&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Chapter 13 — Apache Spark: High-Level Overview
&lt;/h1&gt;

&lt;p&gt;Spark is the distributed computing engine underlying everything in Databricks. For the exam, you need a solid conceptual understanding. Here are the key topics at a high level:&lt;/p&gt;

&lt;h2&gt;
  
  
  Core Concepts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SparkContext / SparkSession&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;SparkContext&lt;/code&gt; — the entry point for RDD-based Spark (legacy)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SparkSession&lt;/code&gt; — the unified entry point for DataFrames, SQL, and Streaming (current)&lt;/li&gt;
&lt;li&gt;Pre-created as &lt;code&gt;spark&lt;/code&gt; in Databricks notebooks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RDDs (Resilient Distributed Datasets)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low-level distributed data structure (the original Spark API)&lt;/li&gt;
&lt;li&gt;Immutable, partitioned collection of records distributed across nodes&lt;/li&gt;
&lt;li&gt;Rarely used directly today — use DataFrames/Datasets instead&lt;/li&gt;
&lt;li&gt;Still underpins the execution engine&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DataFrames&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-level, schema-aware, distributed collection of data organized in named columns&lt;/li&gt;
&lt;li&gt;Think of it as a distributed SQL table&lt;/li&gt;
&lt;li&gt;Lazy evaluation — operations build a logical plan until an action triggers execution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Datasets (Scala/Java)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Type-safe version of DataFrames (for JVM languages)&lt;/li&gt;
&lt;li&gt;Not directly applicable in Python (PySpark uses DataFrames exclusively)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Transformations vs. Actions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Transformations&lt;/strong&gt; — lazy operations that build a logical plan:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;select()&lt;/code&gt;, &lt;code&gt;filter()&lt;/code&gt;, &lt;code&gt;groupBy()&lt;/code&gt;, &lt;code&gt;join()&lt;/code&gt;, &lt;code&gt;withColumn()&lt;/code&gt;, &lt;code&gt;orderBy()&lt;/code&gt;, &lt;code&gt;limit()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Do NOT trigger computation immediately&lt;/li&gt;
&lt;li&gt;Can be narrow (don't require data movement between partitions) or wide (require shuffle)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Actions&lt;/strong&gt; — trigger actual computation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;count()&lt;/code&gt;, &lt;code&gt;show()&lt;/code&gt;, &lt;code&gt;collect()&lt;/code&gt;, &lt;code&gt;write()&lt;/code&gt;, &lt;code&gt;save()&lt;/code&gt;, &lt;code&gt;take()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Results are either returned to the driver or written to storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding lazy evaluation is key — Spark optimizes the entire plan before running.&lt;/p&gt;

&lt;h2&gt;
  
  
  Catalyst Optimizer and Tungsten
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Catalyst Optimizer&lt;/strong&gt; — Spark's query optimization engine:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Parses logical plan from DataFrame operations&lt;/li&gt;
&lt;li&gt;Analyzes and resolves column references&lt;/li&gt;
&lt;li&gt;Optimizes the logical plan (predicate pushdown, projection pruning, constant folding)&lt;/li&gt;
&lt;li&gt;Generates multiple physical plans&lt;/li&gt;
&lt;li&gt;Selects the best physical plan using cost-based optimization&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Tungsten&lt;/strong&gt; — Memory management and code generation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Off-heap memory management&lt;/li&gt;
&lt;li&gt;Whole-stage code generation (Java bytecode generation for tight loops)&lt;/li&gt;
&lt;li&gt;Column-based binary format (no Java object overhead)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Photon&lt;/strong&gt; — Databricks' native vectorized execution engine (C++):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dramatically faster than standard Spark for SQL and Delta queries&lt;/li&gt;
&lt;li&gt;Enabled via the Photon Runtime&lt;/li&gt;
&lt;li&gt;Drop-in replacement — same API, faster execution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Partitions and Shuffles
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Partitions&lt;/strong&gt; — how data is distributed across executors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;spark.default.parallelism&lt;/code&gt; — default partition count for RDDs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; — default partition count after a shuffle (default: 200)&lt;/li&gt;
&lt;li&gt;On Databricks, Adaptive Query Execution (AQE) automatically adjusts partition counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Shuffle&lt;/strong&gt; — data redistribution across partitions/executors (wide transformation):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triggered by: &lt;code&gt;groupBy&lt;/code&gt;, &lt;code&gt;join&lt;/code&gt;, &lt;code&gt;orderBy&lt;/code&gt;, &lt;code&gt;distinct&lt;/code&gt;, &lt;code&gt;repartition&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Expensive — involves writing data to disk and network transfer&lt;/li&gt;
&lt;li&gt;Minimize shuffles for performance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;repartition()&lt;/code&gt; vs &lt;code&gt;coalesce()&lt;/code&gt;:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;repartition(n)&lt;/code&gt; — full shuffle to exactly n partitions (use when increasing or changing distribution)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;coalesce(n)&lt;/code&gt; — narrow operation to reduce to n partitions (more efficient for reducing partition count, no full shuffle)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key DataFrame Operations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Transformations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;select()&lt;/code&gt; — projection (choose columns)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;filter()&lt;/code&gt; / &lt;code&gt;where()&lt;/code&gt; — row filtering&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withColumn()&lt;/code&gt; — add/modify a column&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;withColumnRenamed()&lt;/code&gt; — rename a column&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;drop()&lt;/code&gt; — remove columns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;groupBy().agg()&lt;/code&gt; — group and aggregate&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;join()&lt;/code&gt; — combine two DataFrames&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;union()&lt;/code&gt; / &lt;code&gt;unionByName()&lt;/code&gt; — combine rows&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;orderBy()&lt;/code&gt; / &lt;code&gt;sort()&lt;/code&gt; — sort rows&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;distinct()&lt;/code&gt; — deduplicate&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dropDuplicates()&lt;/code&gt; — deduplicate (can specify subset of columns)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;explode()&lt;/code&gt; — expand array/map column into rows&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pivot()&lt;/code&gt; — pivot rows to columns&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;cache()&lt;/code&gt; / &lt;code&gt;persist()&lt;/code&gt; — cache in memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common Functions (pyspark.sql.functions):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;col()&lt;/code&gt;, &lt;code&gt;lit()&lt;/code&gt; — column references and literals&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;when().otherwise()&lt;/code&gt; — conditional logic (like CASE WHEN)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;coalesce()&lt;/code&gt; — first non-null value&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;to_date()&lt;/code&gt;, &lt;code&gt;to_timestamp()&lt;/code&gt; — type casting for dates&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;date_format()&lt;/code&gt;, &lt;code&gt;date_add()&lt;/code&gt;, &lt;code&gt;datediff()&lt;/code&gt; — date operations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;split()&lt;/code&gt;, &lt;code&gt;regexp_replace()&lt;/code&gt;, &lt;code&gt;trim()&lt;/code&gt;, &lt;code&gt;upper()&lt;/code&gt; — string operations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sum()&lt;/code&gt;, &lt;code&gt;count()&lt;/code&gt;, &lt;code&gt;avg()&lt;/code&gt;, &lt;code&gt;max()&lt;/code&gt;, &lt;code&gt;min()&lt;/code&gt; — aggregations&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rank()&lt;/code&gt;, &lt;code&gt;dense_rank()&lt;/code&gt;, &lt;code&gt;row_number()&lt;/code&gt; — window functions&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;collect_list()&lt;/code&gt;, &lt;code&gt;collect_set()&lt;/code&gt; — aggregate to arrays&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;from_json()&lt;/code&gt;, &lt;code&gt;to_json()&lt;/code&gt; — JSON parsing&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;struct()&lt;/code&gt;, &lt;code&gt;array()&lt;/code&gt;, &lt;code&gt;map()&lt;/code&gt; — complex type creation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;explode()&lt;/code&gt;, &lt;code&gt;posexplode()&lt;/code&gt; — array/map to rows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Window Functions
&lt;/h2&gt;

&lt;p&gt;Window functions perform calculations across a sliding window of rows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.window&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;sum&lt;/span&gt;

&lt;span class="c1"&gt;# Define window spec
&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Apply window function
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rowsBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unboundedPreceding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Joins
&lt;/h2&gt;

&lt;p&gt;Join types: &lt;code&gt;inner&lt;/code&gt;, &lt;code&gt;left&lt;/code&gt; (or &lt;code&gt;left_outer&lt;/code&gt;), &lt;code&gt;right&lt;/code&gt; (or &lt;code&gt;right_outer&lt;/code&gt;), &lt;code&gt;full&lt;/code&gt; (or &lt;code&gt;full_outer&lt;/code&gt;), &lt;code&gt;left_semi&lt;/code&gt;, &lt;code&gt;left_anti&lt;/code&gt;, &lt;code&gt;cross&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Broadcast Join&lt;/strong&gt; — when one table is small, broadcast it to all executors to avoid shuffle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;broadcast&lt;/span&gt;
&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;large_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;small_table&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Adaptive Query Execution (AQE)
&lt;/h2&gt;

&lt;p&gt;AQE (enabled by default in Databricks) optimizes query plans at runtime based on actual data statistics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Coalescing shuffle partitions&lt;/strong&gt; — reduces the number of post-shuffle partitions dynamically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Converting sort-merge join to broadcast join&lt;/strong&gt; — if a table turns out to be small&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skew join optimization&lt;/strong&gt; — handles data skew by splitting large partitions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Caching
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Cache in memory (equivalent to MEMORY_AND_DISK)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEMORY_AND_DISK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpersist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Remove from cache
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cacheTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uncacheTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cache intermediate results that are reused multiple times in the same job.&lt;/p&gt;




&lt;h1&gt;
  
  
  Chapter 14 — AWS Integration Specifics
&lt;/h1&gt;

&lt;h2&gt;
  
  
  IAM and Authentication
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Instance Profiles (Cluster IAM)
&lt;/h3&gt;

&lt;p&gt;EC2 instances in a Databricks cluster can be assigned an &lt;strong&gt;IAM Instance Profile&lt;/strong&gt;. This allows code running on the cluster to access AWS resources using AWS credentials — without hardcoding access keys.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# With an instance profile attached, this just works:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/data/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Spark automatically uses the EC2 instance profile credentials
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create an IAM role with S3 permissions&lt;/li&gt;
&lt;li&gt;Create an instance profile from that role&lt;/li&gt;
&lt;li&gt;Register the instance profile ARN in Databricks (admin step)&lt;/li&gt;
&lt;li&gt;Users can then select that instance profile when creating a cluster&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Cross-Account IAM Role
&lt;/h3&gt;

&lt;p&gt;The Databricks control plane uses a &lt;strong&gt;cross-account IAM role&lt;/strong&gt; in your AWS account to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create/terminate EC2 instances&lt;/li&gt;
&lt;li&gt;Attach/detach EBS volumes&lt;/li&gt;
&lt;li&gt;Configure security groups&lt;/li&gt;
&lt;li&gt;Tag resources&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This role has the minimal permissions needed — it cannot read your S3 data.&lt;/p&gt;

&lt;h2&gt;
  
  
  S3 Integration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Direct S3 Access (Recommended with Unity Catalog)
&lt;/h3&gt;

&lt;p&gt;With Unity Catalog and External Locations, access S3 directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/delta/employees/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  DBFS Mount (Legacy)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3a://my-bucket/data/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;mount_point&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/my-data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;extra_configs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fs.s3a.aws.credentials.provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; 
      &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.amazonaws.auth.InstanceProfileCredentialsProvider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Use as:
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/my-data/employees/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  S3 Path Formats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;s3://&lt;/code&gt; — standard AWS S3 URL scheme (used by AWS tools)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s3a://&lt;/code&gt; — Hadoop's S3A connector (used by Spark, more efficient than &lt;code&gt;s3n://&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbfs:/mnt/&amp;lt;mount&amp;gt;/&lt;/code&gt; — via DBFS mount&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  VPC Configuration
&lt;/h2&gt;

&lt;p&gt;Databricks can be deployed in your VPC in two modes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Default VPC&lt;/strong&gt; — Databricks creates and manages a VPC in your account. Simple but less control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Customer-managed VPC (VPC Injection)&lt;/strong&gt; — You provide the VPC, subnets, and security groups. Required for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Existing security/network policies&lt;/li&gt;
&lt;li&gt;PrivateLink (no public internet)&lt;/li&gt;
&lt;li&gt;Custom DNS&lt;/li&gt;
&lt;li&gt;VPC peering with RDS, MSK, etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  PrivateLink
&lt;/h2&gt;

&lt;p&gt;For enterprises requiring no public internet traffic, Databricks supports &lt;strong&gt;AWS PrivateLink&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control plane to data plane communication over private AWS network&lt;/li&gt;
&lt;li&gt;No data leaves the AWS backbone&lt;/li&gt;
&lt;li&gt;Requires customer-managed VPC&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  AWS Services Integration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Amazon MSK (Kafka)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;b-1.msk-cluster.kafka.us-east-1.amazonaws.com:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transactions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.security.protocol&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SASL_SSL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.sasl.mechanism&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AWS_MSK_IAM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.sasl.jaas.config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;software.amazon.msk.auth...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Amazon Kinesis
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kinesis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;streamName&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;initialPosition&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRIM_HORIZON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS Glue Data Catalog Integration
&lt;/h3&gt;

&lt;p&gt;Databricks can use the &lt;strong&gt;AWS Glue Data Catalog&lt;/strong&gt; as a Hive Metastore replacement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.hadoop.hive.metastore.client.factory.class = 
  com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows Databricks to discover tables registered in Glue that were created by Athena, Glue ETL, or other services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; With Unity Catalog, Glue integration is less necessary — UC provides a superior alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  Amazon Redshift
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from Redshift
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.databricks.spark.redshift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:redshift://cluster.us-east-1.redshift.amazonaws.com:5439/mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbtable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;public.sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tempdir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket/temp/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws_iam_role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:iam::123456789:role/RedshiftS3Role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Amazon RDS / Aurora
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# JDBC connection
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:mysql://my-rds.cluster.us-east-1.rds.amazonaws.com:3306/mydb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbtable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rds-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rds-password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numPartitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;partitionColumn&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lowerBound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upperBound&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1000000&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  AWS Secrets Manager Backed Secret Scope
&lt;/h2&gt;

&lt;p&gt;For AWS organizations already using AWS Secrets Manager:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a UC-backed scope (links to AWS Secrets Manager)&lt;/span&gt;
databricks secrets create-scope &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scope&lt;/span&gt; aws-secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--scope-backend-type&lt;/span&gt; AWS_SECRETS_MANAGER &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use secrets stored in AWS Secrets Manager
&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-secrets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-secret-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Chapter 15 — Security &amp;amp; Access Control
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Workspace-Level Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  User Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Users are managed at the &lt;strong&gt;Databricks account level&lt;/strong&gt; (account console)&lt;/li&gt;
&lt;li&gt;SCIM (System for Cross-domain Identity Management) provisioning from identity providers (Okta, Azure AD, AWS SSO)&lt;/li&gt;
&lt;li&gt;Groups simplify permission management — assign users to groups, grant permissions to groups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Permission Levels
&lt;/h3&gt;

&lt;p&gt;For workspace objects (notebooks, clusters, jobs):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Level&lt;/th&gt;
&lt;th&gt;Notebooks&lt;/th&gt;
&lt;th&gt;Clusters&lt;/th&gt;
&lt;th&gt;Jobs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CAN READ&lt;/td&gt;
&lt;td&gt;View&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;View results&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CAN RUN&lt;/td&gt;
&lt;td&gt;Run&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Run job&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CAN EDIT&lt;/td&gt;
&lt;td&gt;Edit&lt;/td&gt;
&lt;td&gt;Can restart&lt;/td&gt;
&lt;td&gt;Edit settings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CAN MANAGE&lt;/td&gt;
&lt;td&gt;All&lt;/td&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;td&gt;Full control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IS OWNER&lt;/td&gt;
&lt;td&gt;All + delete&lt;/td&gt;
&lt;td&gt;Create/delete&lt;/td&gt;
&lt;td&gt;Create/delete&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Cluster-Level Security
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster creation permission&lt;/strong&gt; — not everyone can create clusters (controlled via admin settings or cluster policies)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster access permission&lt;/strong&gt; — who can attach notebooks to or terminate a cluster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster policies&lt;/strong&gt; — restrict what configurations users can apply&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Table ACLs (Legacy — Pre-Unity Catalog)
&lt;/h3&gt;

&lt;p&gt;Before Unity Catalog, table-level access control in the Hive Metastore was managed via &lt;code&gt;GRANT&lt;/code&gt; statements on the cluster with table ACLs enabled. This is being replaced by Unity Catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
  Network Security
&lt;/h2&gt;

&lt;h3&gt;
  
  
  IP Access Lists
&lt;/h3&gt;

&lt;p&gt;Restrict workspace access to specific IP ranges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Administrators can configure allowed IP ranges&lt;/li&gt;
&lt;li&gt;Users outside the allowed ranges cannot access the workspace&lt;/li&gt;
&lt;li&gt;Configured via Admin Settings → IP Access Lists&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Encryption
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encryption at rest&lt;/strong&gt; — S3 data is encrypted using AWS KMS (AWS-managed or customer-managed keys)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption in transit&lt;/strong&gt; — TLS/SSL for all communication between control plane, data plane, and clients&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customer-managed keys&lt;/strong&gt; — for workspaces with strict security requirements, customers can manage their own KMS keys for notebook encryption and cluster EBS volumes&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Compliance
&lt;/h2&gt;

&lt;p&gt;Databricks supports various compliance certifications relevant for AWS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SOC 2 Type II&lt;/li&gt;
&lt;li&gt;ISO 27001&lt;/li&gt;
&lt;li&gt;HIPAA (with Business Associate Agreement)&lt;/li&gt;
&lt;li&gt;PCI DSS&lt;/li&gt;
&lt;li&gt;FedRAMP (government use)&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Chapter 16 — Performance &amp;amp; Best Practices
&lt;/h1&gt;

&lt;h2&gt;
  
  
  File Size and Compaction
&lt;/h2&gt;

&lt;p&gt;The "small file problem" is one of the most common performance issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each Spark task writes one output file&lt;/li&gt;
&lt;li&gt;Thousands of small files → slow reads (S3 overhead per file)&lt;/li&gt;
&lt;li&gt;Target file size: &lt;strong&gt;128MB to 1GB&lt;/strong&gt; for Parquet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;OPTIMIZE&lt;/code&gt; command (manual)&lt;/li&gt;
&lt;li&gt;Auto Compaction (automatic)&lt;/li&gt;
&lt;li&gt;Optimized Writes (on write)&lt;/li&gt;
&lt;li&gt;Liquid Clustering (ongoing)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Skew
&lt;/h2&gt;

&lt;p&gt;Data skew occurs when one or more partitions are much larger than others, causing certain executors to take much longer (stragglers).&lt;/p&gt;

&lt;p&gt;Detection: Look for &lt;code&gt;Tasks&lt;/code&gt; in Spark UI where one task takes 10x longer than others.&lt;/p&gt;

&lt;p&gt;Solutions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Salting&lt;/strong&gt; — add a random key suffix to distribute skewed keys&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AQE skew optimization&lt;/strong&gt; — automatic in Databricks (handles skew joins)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repartition&lt;/strong&gt; before a join on the skewed key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broadcast join&lt;/strong&gt; — if the smaller table is small enough&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Caching Strategy
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Cache DataFrames that are used multiple times in the same session&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;delta.cache&lt;/code&gt; (disk-based cache for Delta files) at the cluster level for repeatedly read Delta tables&lt;/li&gt;
&lt;li&gt;Don't cache unnecessarily — caching takes memory and initialization time&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Partitioning Strategy
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Partition by date for time-series data (daily/monthly partitions)&lt;/li&gt;
&lt;li&gt;Avoid partitioning by high-cardinality columns&lt;/li&gt;
&lt;li&gt;Use Liquid Clustering as a more flexible alternative&lt;/li&gt;
&lt;li&gt;Keep partition count manageable (&amp;lt; 10,000 partitions per table)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Schema Best Practices
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Always define explicit schemas for production pipelines&lt;/li&gt;
&lt;li&gt;Avoid &lt;code&gt;inferSchema&lt;/code&gt; in production (expensive, can produce wrong types)&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;StringType&lt;/code&gt; initially if unsure, cast downstream&lt;/li&gt;
&lt;li&gt;Enable &lt;code&gt;mergeSchema&lt;/code&gt; only when needed — it can hide bugs&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Z-Ordering vs. Partitioning vs. Liquid Clustering
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Partitioning&lt;/th&gt;
&lt;th&gt;Z-Order&lt;/th&gt;
&lt;th&gt;Liquid Clustering&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Data skipping&lt;/td&gt;
&lt;td&gt;Folder-level&lt;/td&gt;
&lt;td&gt;File-level&lt;/td&gt;
&lt;td&gt;File-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Works on&lt;/td&gt;
&lt;td&gt;Low-cardinality&lt;/td&gt;
&lt;td&gt;High-cardinality&lt;/td&gt;
&lt;td&gt;Any cardinality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ongoing maintenance&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Manual OPTIMIZE&lt;/td&gt;
&lt;td&gt;Auto (OPTIMIZE)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flexibility&lt;/td&gt;
&lt;td&gt;Low (fixed at creation)&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recommended for&lt;/td&gt;
&lt;td&gt;New pipelines: rarely&lt;/td&gt;
&lt;td&gt;Existing tables&lt;/td&gt;
&lt;td&gt;New tables&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  Chapter 17 — Additional Databricks Concepts
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Databricks SQL (DBSQL)
&lt;/h2&gt;

&lt;p&gt;Databricks SQL is the &lt;strong&gt;analytics-focused layer&lt;/strong&gt; of Databricks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL Editor&lt;/strong&gt; — web-based SQL query interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL Warehouses&lt;/strong&gt; — dedicated compute for SQL (serverless or classic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt; — build visualizations and dashboards from SQL results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerts&lt;/strong&gt; — trigger notifications based on query results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query History&lt;/strong&gt; — see all executed queries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Serverless SQL Warehouses&lt;/strong&gt; — compute is managed by Databricks, instant startup, automatic scaling, no cluster configuration needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks MLflow
&lt;/h2&gt;

&lt;p&gt;MLflow is an open-source ML lifecycle platform built into Databricks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tracking&lt;/strong&gt; — log parameters, metrics, and artifacts from ML experiments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Models&lt;/strong&gt; — store, version, and deploy ML models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Registry&lt;/strong&gt; — model versioning and stage management (Staging, Production, Archived)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Projects&lt;/strong&gt; — package ML code for reproducibility
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mlflow.sklearn&lt;/span&gt;

&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start_run&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="c1"&gt;# Train model
&lt;/span&gt;  &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X_train&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_train&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Log params
&lt;/span&gt;  &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_param&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;n_estimators&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_metric&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;accuracy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# Log model
&lt;/span&gt;  &lt;span class="n"&gt;mlflow&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sklearn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Feature Store
&lt;/h2&gt;

&lt;p&gt;Databricks Feature Store is a centralized repository for ML features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store features as Delta tables&lt;/li&gt;
&lt;li&gt;Reuse features across teams and models&lt;/li&gt;
&lt;li&gt;Automatic feature lineage&lt;/li&gt;
&lt;li&gt;Point-in-time lookups for training data&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Databricks Connect
&lt;/h2&gt;

&lt;p&gt;Databricks Connect allows you to run Spark code from your &lt;strong&gt;local IDE&lt;/strong&gt; (VS Code, IntelliJ, etc.) against a remote Databricks cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write code locally with full IDE support (autocomplete, debugging)&lt;/li&gt;
&lt;li&gt;Execute on the remote cluster&lt;/li&gt;
&lt;li&gt;Great for development — combine local tooling with cloud compute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Databricks REST API
&lt;/h2&gt;

&lt;p&gt;Almost everything in Databricks can be automated via the REST API:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create/manage clusters, jobs, notebooks&lt;/li&gt;
&lt;li&gt;Trigger job runs&lt;/li&gt;
&lt;li&gt;Manage permissions&lt;/li&gt;
&lt;li&gt;List and manage tables&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Databricks CLI wraps the REST API for command-line use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Databricks Lakehouse Monitoring
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Lakehouse Monitoring&lt;/strong&gt; provides data quality monitoring for Delta tables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profile data distributions over time&lt;/li&gt;
&lt;li&gt;Detect drift (statistical changes in data)&lt;/li&gt;
&lt;li&gt;Generate monitoring metrics as Delta tables&lt;/li&gt;
&lt;li&gt;Works for both batch and streaming tables&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Contracts and Quality
&lt;/h2&gt;

&lt;p&gt;Beyond DLT expectations, common data quality patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Great Expectations&lt;/strong&gt; — open-source data quality library integrated with Databricks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta constraints&lt;/strong&gt; — SQL CHECK constraints on Delta tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data quality monitoring&lt;/strong&gt; — Lakehouse Monitoring
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Add a CHECK constraint&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_quantity&lt;/span&gt; &lt;span class="k"&gt;CHECK&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Chapter 18 — Exam Tips &amp;amp; Topic Summary
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Exam Structure
&lt;/h2&gt;

&lt;p&gt;The Databricks Data Engineer Associate exam (exam code: Databricks-Certified-Data-Engineer-Associate) typically covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~20% Databricks Lakehouse Platform&lt;/strong&gt; — architecture, workspace, clusters, notebooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~30% ELT with Spark and Delta Lake&lt;/strong&gt; — transformations, Delta features, SQL&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~20% Incremental Data Processing&lt;/strong&gt; — Auto Loader, Structured Streaming, DLT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~20% Production Pipelines&lt;/strong&gt; — jobs, workflows, DLT pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~10% Data Governance&lt;/strong&gt; — Unity Catalog, permissions, data access&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  High-Frequency Exam Topics
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Managed vs. External Tables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Know the DROP behavior (metadata only vs. data+metadata)&lt;/li&gt;
&lt;li&gt;Know when to use each&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Delta Lake Features
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Time travel (VERSION AS OF, TIMESTAMP AS OF)&lt;/li&gt;
&lt;li&gt;MERGE syntax (WHEN MATCHED, WHEN NOT MATCHED, WHEN NOT MATCHED BY SOURCE)&lt;/li&gt;
&lt;li&gt;OPTIMIZE and Z-ORDER&lt;/li&gt;
&lt;li&gt;VACUUM and data retention&lt;/li&gt;
&lt;li&gt;Schema enforcement vs. evolution&lt;/li&gt;
&lt;li&gt;CDF (Change Data Feed)&lt;/li&gt;
&lt;li&gt;SHALLOW CLONE vs. DEEP CLONE&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cluster Types
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;All-Purpose for development&lt;/li&gt;
&lt;li&gt;Job clusters for production (auto-terminated, cost-efficient)&lt;/li&gt;
&lt;li&gt;SQL Warehouses for SQL analytics&lt;/li&gt;
&lt;li&gt;Instance pools for fast startup&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;%run&lt;/code&gt; vs. &lt;code&gt;dbutils.notebook.run()&lt;/code&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;%run&lt;/code&gt; shares scope (variables, SparkSession)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;dbutils.notebook.run()&lt;/code&gt; runs independently, can capture return value, can run in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Streaming
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Trigger types (availableNow, processingTime, continuous)&lt;/li&gt;
&lt;li&gt;Checkpoint location importance&lt;/li&gt;
&lt;li&gt;foreachBatch for complex streaming operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DLT
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@dlt.expect&lt;/code&gt; (warn) vs. &lt;code&gt;@dlt.expect_or_drop&lt;/code&gt; (drop row) vs. &lt;code&gt;@dlt.expect_or_fail&lt;/code&gt; (fail pipeline)&lt;/li&gt;
&lt;li&gt;Streaming vs. batch tables in DLT&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;APPLY CHANGES INTO&lt;/code&gt; for CDC&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Auto Loader
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;cloudFiles&lt;/code&gt; format&lt;/li&gt;
&lt;li&gt;Schema inference and &lt;code&gt;schemaLocation&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;File notification vs. directory listing mode&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Unity Catalog
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Three-level namespace (catalog.schema.table)&lt;/li&gt;
&lt;li&gt;Privilege hierarchy (must grant USE at each level)&lt;/li&gt;
&lt;li&gt;Managed vs. External tables in UC&lt;/li&gt;
&lt;li&gt;Storage Credentials and External Locations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  dbutils.secrets
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Secrets are always &lt;code&gt;[REDACTED]&lt;/code&gt; in output&lt;/li&gt;
&lt;li&gt;Two scope types: Databricks-backed and AWS Secrets Manager-backed&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Traps
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;VACUUM&lt;/code&gt; removes time travel capability&lt;/strong&gt; — after VACUUM, you can't travel back to removed versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;DROP TABLE&lt;/code&gt; on a managed table deletes data&lt;/strong&gt; — external tables only delete metadata&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;%run&lt;/code&gt; in a notebook runs in the same scope&lt;/strong&gt; — not isolation like &lt;code&gt;dbutils.notebook.run()&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-scaling doesn't work well with streaming&lt;/strong&gt; — use fixed clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPTIMIZE doesn't run automatically&lt;/strong&gt; — unless Auto Compaction is enabled&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema inference in production is bad&lt;/strong&gt; — use explicit schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temp views are session-scoped&lt;/strong&gt; — not visible to other notebooks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job clusters are created per-run and terminated after&lt;/strong&gt; — you can't manually start/stop them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DLT tables are read-only outside the pipeline&lt;/strong&gt; — you can't write to them directly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Z-ORDER is only effective when combined with OPTIMIZE&lt;/strong&gt; — it's not automatic&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Quick Reference: Key SQL Commands
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Time Travel&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;VERSION&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- History and Metadata&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;HISTORY&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;DETAIL&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="n"&gt;EXTENDED&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Maintenance&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;VACUUM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;RETAIN&lt;/span&gt; &lt;span class="mi"&gt;168&lt;/span&gt; &lt;span class="n"&gt;HOURS&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;RESTORE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="k"&gt;VERSION&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Clone&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;t_backup&lt;/span&gt; &lt;span class="n"&gt;SHALLOW&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;t_backup&lt;/span&gt; &lt;span class="n"&gt;DEEP&lt;/span&gt; &lt;span class="n"&gt;CLONE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- MERGE&lt;/span&gt;
&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;...;&lt;/span&gt;

&lt;span class="c1"&gt;-- DML&lt;/span&gt;
&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;DELETE&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- CDC&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;TBLPROPERTIES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;enableChangeDataFeed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;table_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'t'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- UC Permissions&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`user@company.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`user@company.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;GRANTS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="k"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Reference: Key Python Patterns
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Auto Loader (Bronze ingestion)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoints/schema/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://bucket/path/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Streaming write
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoints/stream/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableNow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.events&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Secrets
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Notebook parameters
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Task values (multi-task jobs)
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;taskValues&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;taskKey&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;upstream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run another notebook
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/notebook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Delta MERGE in Python
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;
&lt;span class="n"&gt;dt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target.id = source.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  DLT Quick Reference
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;

&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not_null_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_silver_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_bronze_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(...)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- SQL DLT&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_id&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;positive_amount&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;STREAM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bronze_orders&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;...;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;The Databricks Data Engineer Associate exam tests your ability to design, build, and maintain data pipelines on the Databricks Lakehouse Platform. The most important areas to master are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Delta Lake fundamentals&lt;/strong&gt; — This is everywhere. Know ACID, time travel, MERGE, OPTIMIZE/VACUUM/ZORDER inside out.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Medallion Architecture&lt;/strong&gt; — Understand Bronze/Silver/Gold and what belongs in each layer.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DLT and streaming&lt;/strong&gt; — Know the declarative pipeline approach, expectations, and Auto Loader.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cluster types and their appropriate use cases&lt;/strong&gt; — All-purpose for dev, Job clusters for prod, SQL Warehouses for analytics.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unity Catalog&lt;/strong&gt; — The three-level namespace, privilege model, and the difference between managed and external tables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Orchestration&lt;/strong&gt; — Multi-task jobs, task dependencies, retry policies, and task values.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;dbutils&lt;/strong&gt; — Know every submodule: &lt;code&gt;fs&lt;/code&gt;, &lt;code&gt;secrets&lt;/code&gt;, &lt;code&gt;widgets&lt;/code&gt;, &lt;code&gt;notebook&lt;/code&gt;, and &lt;code&gt;jobs&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The exam is scenario-based — it will describe a business requirement and ask you to identify the correct Databricks tool or approach. Read each question carefully, eliminate clearly wrong answers, and apply the principles covered in this guide.&lt;/p&gt;

&lt;p&gt;Good luck with your Databricks Data Engineer Associate certification! 🚀&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This guide covers the Databricks Data Engineer Associate exam curriculum. Always check the official &lt;a href="https://www.databricks.com/learn/certification/data-engineer-associate" rel="noopener noreferrer"&gt;Databricks certification exam guide&lt;/a&gt; for the latest topic list.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aws</category>
      <category>dataengineering</category>
      <category>learning</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The Ultimate Guide to Databricks Data Engineer Associate Exam: Everything You Need to Know</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Thu, 19 Feb 2026 01:57:34 +0000</pubDate>
      <link>https://forem.com/datatechbridge/the-ultimate-guide-to-databricks-data-engineer-associate-exam-everything-you-need-to-know-h73</link>
      <guid>https://forem.com/datatechbridge/the-ultimate-guide-to-databricks-data-engineer-associate-exam-everything-you-need-to-know-h73</guid>
      <description>&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Introduction to Databricks and the Lakehouse Platform&lt;/li&gt;
&lt;li&gt;Apache Spark Fundamentals&lt;/li&gt;
&lt;li&gt;Delta Lake Deep Dive&lt;/li&gt;
&lt;li&gt;ELT with Apache Spark and Delta Lake&lt;/li&gt;
&lt;li&gt;Incremental Data Processing&lt;/li&gt;
&lt;li&gt;Production Pipelines with Delta Live Tables (DLT)&lt;/li&gt;
&lt;li&gt;Databricks Workflows and Job Orchestration&lt;/li&gt;
&lt;li&gt;Data Governance with Unity Catalog&lt;/li&gt;
&lt;li&gt;Databricks SQL&lt;/li&gt;
&lt;li&gt;Security and Access Control&lt;/li&gt;
&lt;li&gt;Performance Optimization&lt;/li&gt;
&lt;li&gt;Exam Tips and Practice Questions&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Chapter 1: Introduction to Databricks and the Lakehouse Platform
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 What is Databricks?
&lt;/h3&gt;

&lt;p&gt;Databricks is a unified analytics platform built on top of Apache Spark that combines data engineering, data science, machine learning, and business intelligence into one cohesive ecosystem. Founded in 2013 by the original creators of Apache Spark, Databricks has grown into an industry-leading platform that runs on all major cloud providers — AWS, Azure, and Google Cloud Platform.&lt;/p&gt;

&lt;p&gt;At its core, Databricks provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Collaborative notebooks&lt;/strong&gt; for data exploration and development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed Apache Spark clusters&lt;/strong&gt; for distributed computing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Lake&lt;/strong&gt; as the underlying storage layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLflow&lt;/strong&gt; for machine learning lifecycle management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Databricks SQL&lt;/strong&gt; for business intelligence and analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta Live Tables&lt;/strong&gt; for declarative ETL pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unity Catalog&lt;/strong&gt; for unified data governance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1.2 The Data Lakehouse Architecture
&lt;/h3&gt;

&lt;p&gt;Before Databricks introduced the Lakehouse concept, organizations were forced to choose between two suboptimal architectures:&lt;/p&gt;

&lt;h4&gt;
  
  
  Traditional Data Warehouse
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Structured data only&lt;/li&gt;
&lt;li&gt;High cost for storage&lt;/li&gt;
&lt;li&gt;Limited scalability&lt;/li&gt;
&lt;li&gt;Excellent ACID compliance and performance&lt;/li&gt;
&lt;li&gt;Works great for SQL workloads but struggles with unstructured data&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Traditional Data Lake
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Supports all data types (structured, semi-structured, unstructured)&lt;/li&gt;
&lt;li&gt;Low-cost storage (typically object storage like S3, ADLS, GCS)&lt;/li&gt;
&lt;li&gt;Highly scalable&lt;/li&gt;
&lt;li&gt;Poor ACID compliance&lt;/li&gt;
&lt;li&gt;No support for transactions&lt;/li&gt;
&lt;li&gt;Data quality issues&lt;/li&gt;
&lt;li&gt;Performance challenges&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  The Lakehouse
&lt;/h4&gt;

&lt;p&gt;The Lakehouse paradigm brings the best of both worlds by combining:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Low-cost cloud storage&lt;/strong&gt; from data lakes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ACID transactions&lt;/strong&gt; from data warehouses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema enforcement&lt;/strong&gt; and &lt;strong&gt;schema evolution&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Support for BI tools&lt;/strong&gt; directly on the lake&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming and batch processing&lt;/strong&gt; in one platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Openness&lt;/strong&gt; — data stored in open formats like Parquet and JSON&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Lakehouse is implemented through &lt;strong&gt;Delta Lake&lt;/strong&gt;, which sits on top of cloud object storage and provides the reliability and performance features typically found only in data warehouses.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.3 Databricks Architecture Components
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Control Plane
&lt;/h4&gt;

&lt;p&gt;The control plane is managed by Databricks and includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Databricks web application&lt;/strong&gt; (UI)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster manager&lt;/strong&gt; — manages the lifecycle of compute clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Job scheduler&lt;/strong&gt; — orchestrates workflow execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Notebook server&lt;/strong&gt; — manages collaborative development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata store&lt;/strong&gt; — stores table definitions, ACLs, and configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Data Plane
&lt;/h4&gt;

&lt;p&gt;The data plane runs within the customer's cloud account and includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster compute resources&lt;/strong&gt; (EC2 on AWS, VMs on Azure/GCP)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloud object storage&lt;/strong&gt; (S3, ADLS Gen2, GCS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network resources&lt;/strong&gt; (VPC, subnets, security groups)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This separation ensures that your data never leaves your cloud account, providing security and compliance guarantees.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.4 Databricks Workspace
&lt;/h3&gt;

&lt;p&gt;The Databricks Workspace is the collaborative environment where users interact with the platform. Key components include:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notebooks&lt;/strong&gt;&lt;br&gt;
Notebooks are the primary development interface in Databricks. They support multiple languages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python (&lt;code&gt;%python&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Scala (&lt;code&gt;%scala&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;SQL (&lt;code&gt;%sql&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;R (&lt;code&gt;%r&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Markdown (&lt;code&gt;%md&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Shell commands (&lt;code&gt;%sh&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;File system operations (&lt;code&gt;%fs&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can switch languages within a single notebook using magic commands, making it incredibly flexible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repos&lt;/strong&gt;&lt;br&gt;
Databricks Repos provides Git integration, allowing teams to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect to GitHub, GitLab, Azure DevOps, or Bitbucket&lt;/li&gt;
&lt;li&gt;Version control notebooks and code&lt;/li&gt;
&lt;li&gt;Implement CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Collaborate across teams&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Clusters&lt;/strong&gt;&lt;br&gt;
Clusters are the compute resources that execute your code. There are two primary types:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;All-Purpose Clusters&lt;/strong&gt;: Used for interactive development and collaboration. They can be shared across multiple users and notebooks. These are more expensive but provide maximum flexibility.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Job Clusters&lt;/strong&gt;: Ephemeral clusters created specifically for a job and terminated when the job completes. These are more cost-effective for production workloads.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Cluster Modes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Standard Mode&lt;/strong&gt;: Single-user or shared cluster with full Spark capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High Concurrency Mode&lt;/strong&gt;: Optimized for concurrent usage with fine-grained resource sharing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single Node Mode&lt;/strong&gt;: Driver-only cluster for lightweight workloads and local Spark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster Configuration Options:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Databricks Runtime (DBR) version&lt;/li&gt;
&lt;li&gt;Node types and sizes&lt;/li&gt;
&lt;li&gt;Autoscaling configuration&lt;/li&gt;
&lt;li&gt;Auto-termination settings&lt;/li&gt;
&lt;li&gt;Spot/Preemptible instance usage&lt;/li&gt;
&lt;li&gt;Custom libraries and init scripts&lt;/li&gt;
&lt;li&gt;Environment variables&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Chapter 2: Apache Spark Fundamentals
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.1 Understanding Apache Spark
&lt;/h3&gt;

&lt;p&gt;Apache Spark is the distributed computing engine that powers Databricks. Understanding Spark's core concepts is fundamental to the Data Engineer Associate exam.&lt;/p&gt;
&lt;h4&gt;
  
  
  Spark Architecture
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Driver Node&lt;/strong&gt;&lt;br&gt;
The driver is the brain of a Spark application. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs the main() function&lt;/li&gt;
&lt;li&gt;Creates the SparkContext/SparkSession&lt;/li&gt;
&lt;li&gt;Converts user code into execution tasks&lt;/li&gt;
&lt;li&gt;Schedules jobs and stages&lt;/li&gt;
&lt;li&gt;Communicates with the cluster manager&lt;/li&gt;
&lt;li&gt;Keeps track of metadata about distributed data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Executor Nodes&lt;/strong&gt;&lt;br&gt;
Executors are the worker processes that run on cluster nodes. They:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execute the tasks assigned by the driver&lt;/li&gt;
&lt;li&gt;Store data in memory or disk (RDD/DataFrame partitions)&lt;/li&gt;
&lt;li&gt;Return results back to the driver&lt;/li&gt;
&lt;li&gt;Each executor has multiple cores, each of which can run one task at a time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cluster Manager&lt;/strong&gt;&lt;br&gt;
In Databricks, the cluster manager is handled automatically. Spark supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standalone&lt;/li&gt;
&lt;li&gt;YARN (Hadoop)&lt;/li&gt;
&lt;li&gt;Mesos&lt;/li&gt;
&lt;li&gt;Kubernetes&lt;/li&gt;
&lt;li&gt;Databricks (proprietary)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Spark Execution Model
&lt;/h4&gt;

&lt;p&gt;When you submit a Spark job, it goes through the following stages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Job&lt;/strong&gt;: Triggered by an action (collect(), show(), write(), etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stage&lt;/strong&gt;: A set of tasks that can be executed in parallel without data shuffling. Stage boundaries occur at shuffle operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task&lt;/strong&gt;: The smallest unit of work, operating on a single partition of data
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Application
└── Job (triggered by action)
    └── Stage (divided by shuffles)
        └── Task (one per partition)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  2.2 SparkSession
&lt;/h3&gt;

&lt;p&gt;The SparkSession is the entry point for all Spark functionality in modern Spark (2.0+). In Databricks, it's automatically available as &lt;code&gt;spark&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In Databricks, SparkSession is pre-created as 'spark'
# But you can access it like this:
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MyApplication&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.some.config.option&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;some-value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Check Spark version
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Access SparkContext
&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sparkContext&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.3 DataFrames and the DataFrame API
&lt;/h3&gt;

&lt;p&gt;DataFrames are the primary data abstraction in modern Spark. A DataFrame is a distributed collection of data organized into named columns, conceptually similar to a table in a relational database.&lt;/p&gt;

&lt;h4&gt;
  
  
  Creating DataFrames
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# From a Python list
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bob&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Charlie&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;span class="n"&gt;columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From a CSV file
&lt;/span&gt;&lt;span class="n"&gt;df_csv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inferSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From a JSON file
&lt;/span&gt;&lt;span class="n"&gt;df_json&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From a Parquet file
&lt;/span&gt;&lt;span class="n"&gt;df_parquet&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file.parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From a Delta table
&lt;/span&gt;&lt;span class="n"&gt;df_delta&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Or using SQL
&lt;/span&gt;&lt;span class="n"&gt;df_sql&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database.tablename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# From a JDBC source
&lt;/span&gt;&lt;span class="n"&gt;df_jdbc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:postgresql://host:5432/database&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbtable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;schema.tablename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;username&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  DataFrame Transformations
&lt;/h4&gt;

&lt;p&gt;Transformations are &lt;strong&gt;lazy&lt;/strong&gt; — they define a computation but don't execute it until an action is called.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;# Select specific columns
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Filter/Where
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age &amp;gt; 25&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add new columns
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;birth_year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Rename columns
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Drop columns
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unnecessary_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Sort/OrderBy
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Distinct values
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;distinct&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropDuplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Limit rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Group By and Aggregations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employee_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Joins
&lt;/span&gt;&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;right&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left_semi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;left_anti&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cross&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Union
&lt;/span&gt;&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;union&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;           &lt;span class="c1"&gt;# Combines by position
&lt;/span&gt;&lt;span class="n"&gt;df1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unionByName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# Combines by column name
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  DataFrame Actions
&lt;/h4&gt;

&lt;p&gt;Actions trigger computation and return results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Collect all rows to driver (careful with large datasets!)
&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Show first N rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;truncate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Count rows
&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Get first row
&lt;/span&gt;&lt;span class="n"&gt;first_row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;first&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;first_5&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Describe statistics
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Write to storage
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database.tablename&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.4 Spark SQL
&lt;/h3&gt;

&lt;p&gt;Spark SQL allows you to run SQL queries against DataFrames and Hive tables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Register DataFrame as temporary view
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Run SQL query
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT department, 
           COUNT(*) as emp_count,
           AVG(salary) as avg_salary
    FROM employees
    WHERE salary &amp;gt; 50000
    GROUP BY department
    ORDER BY avg_salary DESC
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create global temporary view (accessible across sessions)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createOrReplaceGlobalTempView&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;global_employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM global_temp.global_employees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.5 Data Types and Schema
&lt;/h3&gt;

&lt;p&gt;Understanding Spark data types is critical for data engineering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;# Define explicit schema
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;DateType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;is_active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;BooleanType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
        &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;street&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zip&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ArrayType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;MapType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt; &lt;span class="n"&gt;nullable&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Create DataFrame with explicit schema
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check schema
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt;

&lt;span class="c1"&gt;# Cast column types
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LongType&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.6 Working with Complex Types
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Working with Arrays
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SQL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Spark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Java&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Scala&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Explode array into rows
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Array functions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;size&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skill_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array_contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;knows_python&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array_distinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort_array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Working with Structs
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;John&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Doe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person.first_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;person.last_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Working with Maps
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map_keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;keys&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map_values&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;key1_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.7 Window Functions
&lt;/h3&gt;

&lt;p&gt;Window functions are essential for analytical computations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.window&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;

&lt;span class="c1"&gt;# Define window specification
&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Ranking functions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dense_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dense_rank&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;row_number&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;percent_rank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;percent_rank&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ntile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ntile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Analytic functions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lag_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lead_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lead&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_spec&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Aggregate functions with window
&lt;/span&gt;&lt;span class="n"&gt;window_agg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dept_avg_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_agg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dept_total_salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_agg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Running totals
&lt;/span&gt;&lt;span class="n"&gt;window_running&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;department&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orderBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hire_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rowsBetween&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;unboundedPreceding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Window&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running_salary_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;over&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;window_running&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.8 User-Defined Functions (UDFs)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IntegerType&lt;/span&gt;

&lt;span class="c1"&gt;# Python UDF (not optimized by Catalyst)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;categorize_salary&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Register UDF
&lt;/span&gt;&lt;span class="n"&gt;categorize_udf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;categorize_salary&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Use UDF
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary_category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;categorize_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;

&lt;span class="c1"&gt;# Lambda UDF
&lt;/span&gt;&lt;span class="n"&gt;double_salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Pandas UDF (vectorized - much faster!)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas_udf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="nd"&gt;@pandas_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;categorize_salary_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Series&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;High&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary_category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;categorize_salary_pandas&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 3: Delta Lake Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 What is Delta Lake?
&lt;/h3&gt;

&lt;p&gt;Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unified streaming/batch data processing to data lakes. It was created by Databricks and donated to the Linux Foundation.&lt;/p&gt;

&lt;p&gt;Delta Lake stores data as &lt;strong&gt;Parquet files&lt;/strong&gt; along with a &lt;strong&gt;transaction log&lt;/strong&gt; (Delta Log) that tracks all changes to the table.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.2 Delta Lake Architecture
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Delta Log (Transaction Log)
&lt;/h4&gt;

&lt;p&gt;The Delta Log is a directory called &lt;code&gt;_delta_log&lt;/code&gt; located at the root of the Delta table. It contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JSON log files&lt;/strong&gt;: Each transaction creates a new JSON file (00000000000000000000.json, 00000000000000000001.json, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpoint files&lt;/strong&gt;: Every 10 transactions, Databricks creates a Parquet checkpoint file that consolidates the log for faster reads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each log entry contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Add&lt;/strong&gt; actions: New files added to the table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Remove&lt;/strong&gt; actions: Files removed from the table&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt; actions: Schema changes, table properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protocol&lt;/strong&gt; actions: Delta protocol version upgrades&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CommitInfo&lt;/strong&gt; actions: Information about the operation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  How Delta Lake Achieves ACID
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Atomicity&lt;/strong&gt;: Either all the changes in a transaction are committed or none are. This is achieved through the atomic JSON write to the transaction log.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Schema enforcement ensures that writes conform to the table schema. Constraint checking ensures data integrity rules.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation&lt;/strong&gt;: Optimistic concurrency control with snapshot isolation. Each transaction reads from a consistent snapshot and conflicts are detected at commit time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Durability&lt;/strong&gt;: Once a transaction is committed to the log, it's permanent. The underlying cloud storage provides durability guarantees.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 Creating Delta Tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Method 1: Write DataFrame as Delta
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Method 2: Write with partitioning
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Method 3: Save as table (creates metadata in metastore)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database.table_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Method 4: Create table using SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE IF NOT EXISTS sales (
        id BIGINT,
        customer_id BIGINT,
        product_id BIGINT,
        amount DOUBLE,
        sale_date DATE
    )
    USING DELTA
    PARTITIONED BY (sale_date)
    LOCATION &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/sales&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    COMMENT &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Sales transactions table&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    TBLPROPERTIES (
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.optimizeWrite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.autoCompact&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    )
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Method 5: CREATE TABLE AS SELECT (CTAS)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE new_sales
    USING DELTA
    AS SELECT * FROM old_sales WHERE year = 2024
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.4 Reading Delta Tables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read Delta table from path
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read Delta table from catalog
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;database.table_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read a specific version (Time Travel)
&lt;/span&gt;&lt;span class="n"&gt;df_v2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read at a specific timestamp
&lt;/span&gt;&lt;span class="n"&gt;df_ts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestampAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Using SQL for time travel
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT * FROM sales VERSION AS OF 3
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    SELECT * FROM sales TIMESTAMP AS OF &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2024-01-01 00:00:00&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.5 Updating Delta Tables
&lt;/h3&gt;

&lt;h4&gt;
  
  
  INSERT Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Append data (default mode)
&lt;/span&gt;&lt;span class="n"&gt;new_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Overwrite entire table
&lt;/span&gt;&lt;span class="n"&gt;new_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Insert using SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    INSERT INTO sales VALUES (1, 100, 200, 99.99, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2024-01-15&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    INSERT INTO sales SELECT * FROM staging_sales
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Insert Overwrite (replaces data matching partition)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    INSERT OVERWRITE sales
    SELECT * FROM staging_sales WHERE sale_date = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2024-01-15&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  UPDATE Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;

&lt;span class="c1"&gt;# Load the Delta table
&lt;/span&gt;&lt;span class="n"&gt;delta_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Update with condition
&lt;/span&gt;&lt;span class="n"&gt;delta_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Update using SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    UPDATE sales
    SET amount = amount * 1.1
    WHERE customer_id = 100
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  DELETE Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Delete with condition
&lt;/span&gt;&lt;span class="n"&gt;delta_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sale_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2020-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Delete using SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    DELETE FROM sales
    WHERE sale_date &amp;lt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2020-01-01&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.6 MERGE (UPSERT) Operations
&lt;/h3&gt;

&lt;p&gt;MERGE is one of the most powerful features of Delta Lake, enabling complex upsert operations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;

&lt;span class="c1"&gt;# Target table
&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Source DataFrame
&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;staging_updates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Perform MERGE
&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target.id = source.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source.amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source.id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source.customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source.amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sale_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source.sale_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;created_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedBySourceDelete&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- MERGE using SQL&lt;/span&gt;
&lt;span class="n"&gt;MERGE&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;staging_updates&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'UPDATE'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
    &lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;updated_at&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'DELETE'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
    &lt;span class="k"&gt;DELETE&lt;/span&gt;
&lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;MATCHED&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt;
    &lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.7 Delta Lake Schema Management
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Schema Enforcement
&lt;/h4&gt;

&lt;p&gt;Delta Lake enforces schema by default — if you try to write data with an incompatible schema, it will throw an error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This will FAIL if 'new_column' doesn't exist in the schema
&lt;/span&gt;&lt;span class="n"&gt;new_data_with_extra_column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Error: A schema mismatch detected when writing to the Delta table
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Schema Evolution
&lt;/h4&gt;

&lt;p&gt;You can enable schema evolution to automatically add new columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable schema evolution (mergeSchema)
&lt;/span&gt;&lt;span class="n"&gt;new_data_with_extra_column&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Overwrite and update schema
&lt;/span&gt;&lt;span class="n"&gt;new_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwriteSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  ALTER TABLE Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add column
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales ADD COLUMN discount DOUBLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Rename column (requires column mapping)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales RENAME COLUMN amount TO total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Drop column (requires column mapping)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales DROP COLUMN old_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Change column type
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales ALTER COLUMN amount TYPE DECIMAL(10,2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add table comment
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;COMMENT ON TABLE sales IS &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Main sales transactions&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add column comment
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales ALTER COLUMN amount COMMENT &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Transaction amount in USD&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.8 Delta Lake Time Travel
&lt;/h3&gt;

&lt;p&gt;Time Travel is one of Delta Lake's most valuable features, allowing you to query historical versions of your data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Query version 0 (initial version)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;versionAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query at timestamp
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestampAsOf&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01T00:00:00.000Z&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# View table history
&lt;/span&gt;&lt;span class="n"&gt;delta_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;delta_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;delta_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Last 10 operations
&lt;/span&gt;
&lt;span class="c1"&gt;# SQL equivalent
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DESCRIBE HISTORY sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DESCRIBE HISTORY sales LIMIT 5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use Cases for Time Travel:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Auditing&lt;/strong&gt;: Track changes to data over time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rollback&lt;/strong&gt;: Restore data to a previous state&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging&lt;/strong&gt;: Reproduce issues by querying historical data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regulatory compliance&lt;/strong&gt;: Access point-in-time data for compliance&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.9 OPTIMIZE and VACUUM
&lt;/h3&gt;

&lt;h4&gt;
  
  
  OPTIMIZE
&lt;/h4&gt;

&lt;p&gt;Over time, Delta tables accumulate many small files due to frequent small writes. OPTIMIZE compacts these small files into larger, more efficient files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Optimize entire table
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPTIMIZE sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Optimize with Z-ORDER (co-locate related data)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPTIMIZE sales ZORDER BY (customer_id, sale_date)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Optimize specific partition
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPTIMIZE sales WHERE sale_date = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2024-01-15&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Z-Ordering&lt;/strong&gt; is a technique that co-locates related information in the same set of files. This improves query performance by reducing the amount of data that needs to be read. It's particularly effective for high-cardinality columns that are frequently used in filters.&lt;/p&gt;

&lt;h4&gt;
  
  
  VACUUM
&lt;/h4&gt;

&lt;p&gt;VACUUM removes files that are no longer referenced by the Delta log and are older than the retention period.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Vacuum with default retention (7 days)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VACUUM sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Vacuum with custom retention (in hours)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VACUUM sales RETAIN 168 HOURS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 7 days
&lt;/span&gt;
&lt;span class="c1"&gt;# DRY RUN - shows files to be deleted without actually deleting
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VACUUM sales DRY RUN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# WARNING: This disables time travel beyond the retention period
# To vacuum with less than 7 days (use carefully!):
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.databricks.delta.retentionDurationCheck.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;VACUUM sales RETAIN 0 HOURS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.databricks.delta.retentionDurationCheck.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important&lt;/strong&gt;: You cannot time travel to versions that have been vacuumed. Default retention is 7 days (168 hours).&lt;/p&gt;

&lt;h3&gt;
  
  
  3.10 Delta Table Properties and Features
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# View table properties
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DESCRIBE EXTENDED sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SHOW TBLPROPERTIES sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Set table properties
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE sales 
    SET TBLPROPERTIES (
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.logRetentionDuration&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;interval 30 days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.deletedFileRetentionDuration&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;interval 7 days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.optimizeWrite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.autoCompact&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    )
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Enable column mapping (required for renaming/dropping columns)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE sales 
    SET TBLPROPERTIES (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.columnMapping.mode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Enable Change Data Feed
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE sales 
    SET TBLPROPERTIES (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.enableChangeDataFeed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.11 Change Data Feed (CDF)
&lt;/h3&gt;

&lt;p&gt;Change Data Feed captures row-level changes (inserts, updates, deletes) made to a Delta table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable CDF on table creation
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    CREATE TABLE sales (
        id BIGINT,
        amount DOUBLE,
        updated_at TIMESTAMP
    )
    USING DELTA
    TBLPROPERTIES (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.enableChangeDataFeed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read changes from a specific version
&lt;/span&gt;&lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readChangeFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingVersion&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Read changes from a specific timestamp
&lt;/span&gt;&lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;readChangeFeed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingTimestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CDF adds these special columns:
# _change_type: 'insert', 'update_preimage', 'update_postimage', 'delete'
# _commit_version: The version of the commit
# _commit_timestamp: The timestamp of the commit
&lt;/span&gt;
&lt;span class="n"&gt;changes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_change_type = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;insert&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;changes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_change_type IN (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;update_preimage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;update_postimage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3.12 Constraints
&lt;/h3&gt;

&lt;p&gt;Delta Lake supports column-level and table-level constraints.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# NOT NULL constraint
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE sales 
    ALTER COLUMN id SET NOT NULL
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# CHECK constraint
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE sales 
    ADD CONSTRAINT valid_amount CHECK (amount &amp;gt; 0)
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# View constraints
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DESCRIBE EXTENDED sales&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Drop constraint
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ALTER TABLE sales DROP CONSTRAINT valid_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 4: ELT with Apache Spark and Delta Lake
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 ELT vs ETL
&lt;/h3&gt;

&lt;p&gt;Traditional ETL (Extract, Transform, Load):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is extracted from source&lt;/li&gt;
&lt;li&gt;Transformed outside the target system&lt;/li&gt;
&lt;li&gt;Loaded into the destination&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern ELT (Extract, Load, Transform):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is extracted from source&lt;/li&gt;
&lt;li&gt;Loaded raw into the data lake&lt;/li&gt;
&lt;li&gt;Transformed within the data platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the Databricks Lakehouse, ELT is preferred because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raw data is preserved for reprocessing&lt;/li&gt;
&lt;li&gt;Transformations leverage distributed compute power&lt;/li&gt;
&lt;li&gt;Single platform reduces complexity&lt;/li&gt;
&lt;li&gt;Cost optimization through separation of storage and compute&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.2 Medallion Architecture
&lt;/h3&gt;

&lt;p&gt;The Medallion Architecture (Bronze, Silver, Gold) is the recommended approach for organizing data in the Lakehouse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Source Systems → [Bronze] → [Silver] → [Gold] → Analytics/ML
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Bronze Layer (Raw)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Raw data ingested as-is from source systems&lt;/li&gt;
&lt;li&gt;No transformations (maybe some basic typing)&lt;/li&gt;
&lt;li&gt;Schema may not be enforced&lt;/li&gt;
&lt;li&gt;Keeps historical record of all raw data&lt;/li&gt;
&lt;li&gt;Data is append-only
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Ingesting raw JSON data into Bronze
&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiLine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/raw/incoming/orders/*.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add ingestion metadata
&lt;/span&gt;&lt;span class="n"&gt;bronze_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
                      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_source_file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;input_file_name&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="n"&gt;bronze_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bronze/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Silver Layer (Cleansed)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Data is cleaned, validated, and standardized&lt;/li&gt;
&lt;li&gt;Joins happen at this layer&lt;/li&gt;
&lt;li&gt;Schema enforcement applied&lt;/li&gt;
&lt;li&gt;Still at roughly the same granularity as Bronze&lt;/li&gt;
&lt;li&gt;Deduplication happens here
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from Bronze
&lt;/span&gt;&lt;span class="n"&gt;bronze_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/bronze/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Clean and validate
&lt;/span&gt;&lt;span class="n"&gt;silver_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bronze_df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropDuplicates&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decimal(10,2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Upsert into Silver (handle late-arriving data)
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;delta.tables&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDeltaTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/silver/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;silver_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DeltaTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forPath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/silver/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;silver_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;merge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;silver_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;target.order_id = source.order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenMatchedUpdateAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;whenNotMatchedInsertAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;silver_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/silver/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Gold Layer (Business Aggregated)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Business-level aggregations&lt;/li&gt;
&lt;li&gt;Optimized for specific use cases (reporting, ML)&lt;/li&gt;
&lt;li&gt;Pre-aggregated data for performance&lt;/li&gt;
&lt;li&gt;Multiple Gold tables may be derived from same Silver data
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create Gold table: Daily Sales Summary
&lt;/span&gt;&lt;span class="n"&gt;gold_daily_sales&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/silver/orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;countDistinct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unique_customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;gold_daily_sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/gold/daily_sales_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.3 Common Data Transformations
&lt;/h3&gt;

&lt;h4&gt;
  
  
  String Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# String functions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ltrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rtrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;substring&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat_ws&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;first_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;last_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;full_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;regexp_replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^0-9]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;regexp_extract&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;(@.*)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;like&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;A%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lpad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rpad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Date and Time Operations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Date/Time functions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_str&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ts_str&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd HH:mm:ss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MM/dd/yyyy&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;year&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;month&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dayofmonth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dayofweek&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dayofyear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hour&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;minute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;second&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;datediff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;months_between&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;start_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_months&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;last_day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next_day&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monday&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unix_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_unixtime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;unix_ts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_utc_timestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestamp_col&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;America/New_York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Null Handling
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Null handling
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;          &lt;span class="c1"&gt;# Returns col1 if not null, else col2
&lt;/span&gt;    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nvl2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="c1"&gt;# If col1 not null return col2, else col3
&lt;/span&gt;    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nullif&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;       &lt;span class="c1"&gt;# Returns null if col1 == col2
&lt;/span&gt;    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ifnull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;       &lt;span class="c1"&gt;# Same as NVL
&lt;/span&gt;    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nanvl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;col2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# Returns col1 if not NaN, else col2
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Fill null values
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fillna&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Drop rows with any null
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;how&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;all&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Drop rows where all values are null
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;na&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Conditional Logic
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CASE WHEN using when/otherwise
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Junior&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mid&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;when&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;150000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Senior&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
     &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;otherwise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Executive&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Using expr for complex SQL expressions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CASE WHEN age &amp;lt; 18 THEN &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;minor&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; WHEN age &amp;lt; 65 THEN &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;adult&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; ELSE &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;senior&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; END&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.4 Reading Data from Various Sources
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CSV
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inferSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delimiter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quote&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'"'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;escape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nullValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dateFormat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;timestampFormat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd HH:mm:ss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiLine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;encoding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;UTF-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/*.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# JSON
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;multiLine&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;inferSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dateFormat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/*.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Parquet
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/parquet/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Avro
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avro&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/avro/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ORC
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;orc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/orc/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Text files
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/text/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/text/*.txt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.5 Writing Data to Various Formats
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Write modes
# "append" - Append to existing data
# "overwrite" - Overwrite all existing data
# "error"/"errorifexists" - Throw error if data exists (default)
# "ignore" - Silently skip write if data exists
&lt;/span&gt;
&lt;span class="c1"&gt;# Write as Parquet
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;year&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/output/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write as CSV
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/output.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write as Delta
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;partitionBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Coalesce before writing (reduce output files)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/single/file.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Repartition for balanced output
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/delta/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 5: Incremental Data Processing
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 What is Incremental Processing?
&lt;/h3&gt;

&lt;p&gt;Incremental processing is the practice of processing only new or changed data, rather than reprocessing the entire dataset each time. This is critical for production data pipelines because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduces compute costs&lt;/li&gt;
&lt;li&gt;Minimizes processing time&lt;/li&gt;
&lt;li&gt;Enables near-real-time data freshness&lt;/li&gt;
&lt;li&gt;Scales to large datasets efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5.2 Structured Streaming
&lt;/h3&gt;

&lt;p&gt;Structured Streaming is Spark's streaming engine built on top of the DataFrame/Dataset API. It treats a streaming data source as an unbounded table that grows continuously.&lt;/p&gt;

&lt;h4&gt;
  
  
  Key Concepts
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Micro-batch Processing&lt;/strong&gt;: Processes data in small batches at regular intervals (default 500ms or when data is available).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Continuous Processing&lt;/strong&gt;: For ultra-low latency (millisecond range), though less common.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trigger Types&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;processingTime="0 seconds"&lt;/code&gt; - Process as fast as possible&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;processingTime="1 minute"&lt;/code&gt; - Fixed interval&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;once=True&lt;/code&gt; - Process all available data, then stop (batch mode)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;availableNow=True&lt;/code&gt; - Process all available data across micro-batches, then stop&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Input Sources
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Kafka source
&lt;/span&gt;&lt;span class="n"&gt;kafka_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;broker:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic1,topic2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;startingOffsets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;earliest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxOffsetsPerTrigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Parse Kafka value
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;parsed_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kafka_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Auto Loader (cloud file source)
&lt;/span&gt;&lt;span class="n"&gt;autoloader_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.inferColumnTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/schema/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/landing/zone/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Delta table as stream source
&lt;/span&gt;&lt;span class="n"&gt;delta_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;maxFilesPerTrigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Rate source (for testing)
&lt;/span&gt;&lt;span class="n"&gt;test_stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rowsPerSecond&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Output Sinks and Output Modes
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Output Modes&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;append&lt;/strong&gt;: Only new rows added since last trigger are written (default for streaming sources)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;complete&lt;/strong&gt;: The entire result table is written (required for aggregations without watermark)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;update&lt;/strong&gt;: Only rows that were updated since last trigger are written
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Write to Delta (most common)
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/checkpoint/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processingTime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 minute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/output/delta/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Delta table by name
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/checkpoint/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver.orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Kafka
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kafka.bootstrap.servers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;broker:9092&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output-topic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/checkpoint/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Write to memory (for debugging/testing)
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;queryName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temp_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM temp_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Console (for debugging)
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;console&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;outputMode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Manage streaming query
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;     &lt;span class="c1"&gt;# Current status
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isActive&lt;/span&gt;   &lt;span class="c1"&gt;# True if running
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;     &lt;span class="c1"&gt;# Stop the query
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Block until complete
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Block with timeout
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checkpointing
&lt;/h4&gt;

&lt;p&gt;Checkpointing is critical for fault tolerance in streaming. It stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The current offset/position in the source&lt;/li&gt;
&lt;li&gt;The state of aggregations&lt;/li&gt;
&lt;li&gt;Metadata about the query
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Always specify a unique checkpoint location per query
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoint/orders_to_silver/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/silver/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.3 Auto Loader
&lt;/h3&gt;

&lt;p&gt;Auto Loader is Databricks' optimized solution for incrementally loading files from cloud storage. It's the recommended approach for Bronze layer ingestion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Basic Auto Loader setup
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/schema/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/landing/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Bronze Delta table
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;writeStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;checkpointLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/checkpoint/orders_landing/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trigger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;availableNow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Auto Loader Benefits&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic schema detection and evolution&lt;/strong&gt;: Infers schema and evolves it as new files arrive&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File discovery&lt;/strong&gt;: Uses cloud notification (preferred) or directory listing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exactly-once processing&lt;/strong&gt;: Uses checkpointing to ensure no file is processed twice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handles large number of files&lt;/strong&gt;: More efficient than manual file tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Configuration Options&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/schema/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.inferColumnTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.backfillInterval&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 day&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.maxBytesPerTrigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10g&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.maxFilesPerTrigger&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaEvolutionMode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;addNewColumns&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/landing/events/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.4 Watermarking in Streaming
&lt;/h3&gt;

&lt;p&gt;Watermarks handle late-arriving data in streaming aggregations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Define watermark (tolerate up to 10 minutes of late data)
&lt;/span&gt;&lt;span class="n"&gt;windowed_counts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withWatermark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10 minutes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5 minutes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Sliding window
&lt;/span&gt;&lt;span class="n"&gt;sliding_window&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withWatermark&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 hour&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;window&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;event_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1 hour&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;15 minutes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# 1-hour window, 15-min slide
&lt;/span&gt;        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; \
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.5 Incremental Processing with COPY INTO
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;COPY INTO&lt;/code&gt; is a SQL command that incrementally loads files into a Delta table. Files are only loaded once (it tracks what's already been loaded).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Basic COPY INTO&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'/path/to/source/'&lt;/span&gt;
&lt;span class="n"&gt;FILEFORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CSV&lt;/span&gt;
&lt;span class="n"&gt;FORMAT_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'header'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'inferSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;COPY_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'mergeSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- COPY INTO with JSON&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="s1"&gt;'/landing/events/'&lt;/span&gt;
&lt;span class="n"&gt;FILEFORMAT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;JSON&lt;/span&gt;
&lt;span class="n"&gt;FORMAT_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'multiLine'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;COPY_OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'mergeSchema'&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'true'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- COPY INTO with explicit schema&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sale_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'yyyy-MM-dd'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sale_date&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;read_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/landing/sales/'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;format&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="s1"&gt;'csv'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;COPY INTO vs Auto Loader&lt;/strong&gt;:&lt;br&gt;
| Feature | COPY INTO | Auto Loader |&lt;br&gt;
|---------|-----------|-------------|&lt;br&gt;
| Interface | SQL | Python/Scala |&lt;br&gt;
| Scalability | Millions of files | Billions of files |&lt;br&gt;
| File tracking | Within Delta log | Checkpoint files |&lt;br&gt;
| Schema evolution | Manual | Automatic |&lt;br&gt;
| Streaming support | No | Yes |&lt;br&gt;
| Use case | Small-medium datasets | Large-scale ingestion |&lt;/p&gt;


&lt;h2&gt;
  
  
  Chapter 6: Production Pipelines with Delta Live Tables (DLT)
&lt;/h2&gt;
&lt;h3&gt;
  
  
  6.1 Introduction to Delta Live Tables
&lt;/h3&gt;

&lt;p&gt;Delta Live Tables (DLT) is a declarative framework for building reliable, maintainable, and testable data pipelines on Databricks. Instead of writing imperative code to manage pipeline execution, you declare the data transformations and DLT handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pipeline orchestration and execution ordering&lt;/li&gt;
&lt;li&gt;Automatic error handling and retries&lt;/li&gt;
&lt;li&gt;Data quality enforcement&lt;/li&gt;
&lt;li&gt;Monitoring and observability&lt;/li&gt;
&lt;li&gt;Infrastructure management&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  6.2 DLT Pipeline Components
&lt;/h3&gt;
&lt;h4&gt;
  
  
  Datasets
&lt;/h4&gt;

&lt;p&gt;DLT has three types of datasets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Tables&lt;/strong&gt;: For append-only streaming sources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Materialized Views&lt;/strong&gt;: For aggregations and transformations that need to be persisted&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Views&lt;/strong&gt;: For temporary transformations (not stored)&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;
  
  
  Pipeline Types
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Triggered&lt;/strong&gt;: Run once and stop (like a scheduled batch job)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous&lt;/strong&gt;: Run continuously to minimize latency&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  6.3 Defining DLT Pipelines
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;

&lt;span class="c1"&gt;# === BRONZE LAYER ===
&lt;/span&gt;
&lt;span class="c1"&gt;# Create a streaming table from Auto Loader
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Raw orders data ingested from landing zone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipelines.reset.allowed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;raw_orders&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.inferColumnTypes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.schemaLocation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/pipelines/schemas/raw_orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/landing/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# === SILVER LAYER ===
&lt;/span&gt;
&lt;span class="c1"&gt;# Create a streaming table with expectations (data quality)
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clean_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Cleaned and validated orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status IN (&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PENDING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PROCESSING&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CANCELLED&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_orders&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;raw_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decimal(10,2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# === GOLD LAYER ===
&lt;/span&gt;
&lt;span class="c1"&gt;# Create a Materialized View for aggregations
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;daily_sales_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;comment&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Daily sales aggregated by product category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table_properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;daily_sales_summary&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;clean_orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;COMPLETED&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_category&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_revenue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_order_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  6.4 DLT Expectations (Data Quality)
&lt;/h3&gt;

&lt;p&gt;Expectations define data quality rules in DLT pipelines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# @dlt.expect: Tracks violations but doesn't drop or fail
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email LIKE &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%@%.%&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# @dlt.expect_or_drop: Drops rows that violate the constraint
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;non_null_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# @dlt.expect_or_fail: Fails the pipeline if any row violates
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_check&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt;= 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# Multiple expectations
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;positive_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nd"&gt;@dlt.expect_or_fail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_customer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# expect_all: Apply multiple constraints at once
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect_all&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date &amp;gt;= &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2020-01-01&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# expect_all_or_drop
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect_all_or_drop&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;valid_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="c1"&gt;# expect_all_or_fail
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.expect_all_or_fail&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id IS NOT NULL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount &amp;gt; 0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.5 DLT with SQL
&lt;/h3&gt;

&lt;p&gt;DLT also supports SQL syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Bronze: Streaming table from Auto Loader&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;raw_customers&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="nv"&gt;"Raw customer data from source systems"&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;cloud_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"/landing/customers/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"cloudFiles.inferColumnTypes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"true"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;-- Silver: Streaming table with expectations&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;STREAMING&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;clean_customers&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_customer_id&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="n"&gt;FAIL&lt;/span&gt; &lt;span class="k"&gt;UPDATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_email&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;CONSTRAINT&lt;/span&gt; &lt;span class="n"&gt;valid_age&lt;/span&gt; &lt;span class="n"&gt;EXPECT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;VIOLATION&lt;/span&gt; &lt;span class="n"&gt;WARN&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="nv"&gt;"Cleaned customer data"&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;first_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;last_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'yyyy-MM-dd'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;signup_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;_processed_at&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;STREAM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;raw_customers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

&lt;span class="c1"&gt;-- Gold: Materialized view&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;customer_summary&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="nv"&gt;"Customer segment summary"&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="k"&gt;CASE&lt;/span&gt; 
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Gen Z'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Millennial'&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;55&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Gen X'&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Boomer'&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;age_segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;customer_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_age&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;LIVE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;clean_customers&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.6 DLT Pipeline Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Orders Pipeline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"storage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/orders/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"target"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"orders_db"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"libraries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"notebook"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/bronze/raw_orders"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"notebook"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/silver/clean_orders"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"notebook"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/gold/order_summaries"&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clusters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"label"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"autoscale"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"min_workers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"max_workers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"development"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"photon"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"channel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CURRENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"edition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ADVANCED"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"continuous"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.7 DLT APPLY CHANGES (CDC)
&lt;/h3&gt;

&lt;p&gt;DLT supports Change Data Capture (CDC) through the &lt;code&gt;APPLY CHANGES INTO&lt;/code&gt; syntax.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dlt&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="c1"&gt;# Source CDC stream
&lt;/span&gt;&lt;span class="nd"&gt;@dlt.view&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;customers_cdc&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readStream&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloudFiles.format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; \
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/cdc/customers/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Apply changes to target table
&lt;/span&gt;&lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_streaming_table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers_cdc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;sequence_by&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Sequence field to handle ordering
&lt;/span&gt;    &lt;span class="n"&gt;apply_as_deletes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DELETE&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;apply_as_truncates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TRUNCATE&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;except_column_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;operation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;  &lt;span class="c1"&gt;# Columns to exclude
&lt;/span&gt;    &lt;span class="n"&gt;stored_as_scd_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# SCD Type 1 (overwrite) or 2 (history)
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SCD Type 2 with APPLY CHANGES&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dlt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_changes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers_history&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers_cdc&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;sequence_by&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updated_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;stored_as_scd_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Maintain full history
&lt;/span&gt;    &lt;span class="n"&gt;track_history_column_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Only track specific columns
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 7: Databricks Workflows and Job Orchestration
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Introduction to Databricks Workflows
&lt;/h3&gt;

&lt;p&gt;Databricks Workflows (formerly Jobs) is the built-in orchestration service for scheduling and running data pipelines. It's used to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Schedule notebooks, Python scripts, JARs, and DLT pipelines&lt;/li&gt;
&lt;li&gt;Create complex multi-task dependencies (DAGs)&lt;/li&gt;
&lt;li&gt;Monitor execution and receive alerts&lt;/li&gt;
&lt;li&gt;Implement retry logic and error handling&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.2 Creating and Configuring Jobs
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Single Task Jobs
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using the Databricks SDK
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WorkspaceClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk.service&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WorkspaceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create a simple notebook job
&lt;/span&gt;&lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My Data Pipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Task&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;task_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingest_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;notebook_task&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;NotebookTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;notebook_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/pipelines/ingestion/load_bronze&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;base_parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{ds}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;new_cluster&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClusterSpec&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;spark_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;13.3.x-scala2.12&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;node_type_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i3.xlarge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;num_workers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;timeout_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_retries&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;min_retry_interval_millis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;  &lt;span class="c1"&gt;# 1 minute between retries
&lt;/span&gt;        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;schedule&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CronSchedule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;quartz_cron_expression&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0 0 8 * * ?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Daily at 8 AM
&lt;/span&gt;        &lt;span class="n"&gt;timezone_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;America/New_York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;pause_status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PauseStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UNPAUSED&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;email_notifications&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;JobEmailNotifications&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;on_failure&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-team@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;on_success&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-team@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Multi-Task Jobs (DAGs)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ETL Pipeline"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tasks"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ingest_bronze"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notebook_task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"notebook_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/bronze/ingest"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"new_cluster"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"spark_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"13.3.x-scala2.12"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"node_type_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"i3.xlarge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"num_workers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transform_silver"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"depends_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ingest_bronze"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notebook_task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"notebook_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/silver/transform"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"existing_cluster_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{{job_cluster_id}}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aggregate_gold_sales"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"depends_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transform_silver"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notebook_task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"notebook_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/gold/sales_agg"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aggregate_gold_customers"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"depends_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transform_silver"&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notebook_task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"notebook_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/gold/customer_agg"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"update_reporting"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"depends_on"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aggregate_gold_sales"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"task_key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aggregate_gold_customers"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"notebook_task"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"notebook_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/pipelines/reporting/refresh"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.3 Task Types
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Notebook task
&lt;/span&gt;&lt;span class="n"&gt;notebook_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;NotebookTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;notebook_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/notebook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Python Script task
&lt;/span&gt;&lt;span class="n"&gt;python_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SparkPythonTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;python_file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/script.py&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# JAR task
&lt;/span&gt;&lt;span class="n"&gt;jar_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SparkJarTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;main_class_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;com.company.MainClass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;jar_uri&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbfs:/jars/my_app.jar&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arg1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arg2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# DLT Pipeline task
&lt;/span&gt;&lt;span class="n"&gt;dlt_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PipelineTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pipeline_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pipeline-id-here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# SQL task
&lt;/span&gt;&lt;span class="n"&gt;sql_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SqlTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SqlTaskQuery&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;warehouse_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;warehouse-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Python Wheel task
&lt;/span&gt;&lt;span class="n"&gt;wheel_task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jobs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;PythonWheelTask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;package_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_package&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;entry_point&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;main&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;named_parameters&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.4 Job Parameters and Dynamic Values
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In notebooks, access job parameters using dbutils
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;env&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;run_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Dynamic value syntax in job configuration
# {{run_id}} - Current run ID
# {{job_id}} - Job ID
# {{task_key}} - Task key
# {{start_time}} - Start time
# {{parent_run_id}} - Parent run ID
&lt;/span&gt;
&lt;span class="c1"&gt;# Example: Pass current date
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;base_parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{start_date}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;{{run_id}}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.5 Cluster Policies
&lt;/h3&gt;

&lt;p&gt;Cluster policies allow administrators to control what users can configure when creating clusters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Standard Data Engineering Policy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"definition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"spark_version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allowlist"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"13.3.x-scala2.12"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"12.2.x-scala2.12"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"node_type_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"allowlist"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"i3.xlarge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"i3.2xlarge"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"i3.4xlarge"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"num_workers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"range"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"minValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"maxValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"defaultValue"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"autotermination_minutes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"spark_conf.spark.databricks.delta.preview.enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"fixed"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"true"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.6 dbutils — Databricks Utilities
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;dbutils&lt;/code&gt; is a powerful utility library available in Databricks notebooks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# File System utilities
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/directory&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# List files
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdirs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/new/dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;       &lt;span class="c1"&gt;# Create directory
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/source/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/dest/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Copy file
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/source/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/dest/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Move file
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# Remove file
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/dir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;recurse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# Remove directory
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# Read first N bytes
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/file&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Write content to file
&lt;/span&gt;
&lt;span class="c1"&gt;# Widgets
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;combobox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;region&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-east&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eu-west&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;multiselect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tables&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customers&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;products&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Get widget value
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remove&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parameter&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Remove widget
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;widgets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;removeAll&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Remove all widgets
&lt;/span&gt;
&lt;span class="c1"&gt;# Secrets
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# List secrets in scope
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope-name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secret-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Get secret value
&lt;/span&gt;
&lt;span class="c1"&gt;# Notebook utilities
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/to/notebook&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;param&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notebook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;success&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Exit with message
&lt;/span&gt;
&lt;span class="c1"&gt;# Library utilities
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;installPyPI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pandas&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.5.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;restartPython&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 8: Data Governance with Unity Catalog
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 Introduction to Unity Catalog
&lt;/h3&gt;

&lt;p&gt;Unity Catalog is Databricks' unified governance solution that provides centralized access control, auditing, lineage, and data discovery across all workspaces in an account.&lt;/p&gt;

&lt;p&gt;Key capabilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Unified data governance&lt;/strong&gt;: Single control plane for all data assets&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained access control&lt;/strong&gt;: Table, column, and row-level security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data lineage&lt;/strong&gt;: Automatic lineage tracking at the column level&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt;: Comprehensive audit trail of all data access&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data discovery&lt;/strong&gt;: Built-in search and tagging&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8.2 Unity Catalog Object Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Account
└── Metastore (one per region)
    └── Catalog
        └── Schema (Database)
            ├── Tables
            │   ├── Managed Tables
            │   └── External Tables
            ├── Views
            ├── Functions
            ├── Volumes
            └── Models (MLflow)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Three-Level Namespace
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Full path syntax: catalog.schema.table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;my_table&lt;/span&gt;

&lt;span class="c1"&gt;-- Set default catalog and schema&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;my_catalog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;my_schema&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- After setting defaults, use two-level namespace&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.3 Creating and Managing Objects
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create catalog&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Production data catalog'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create schema&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Sales data schema'&lt;/span&gt;
&lt;span class="n"&gt;MANAGED&lt;/span&gt; &lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'abfss://container@storage.dfs.core.windows.net/production/sales/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create managed table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Main orders table'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create external table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;external_data&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'abfss://container@storage.dfs.core.windows.net/external/data/'&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'External data table'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create view&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;active_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'ACTIVE'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create function&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;calculate_tax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;DOUBLE&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.4 Access Control
&lt;/h3&gt;

&lt;p&gt;Unity Catalog uses a hierarchical permission model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Grant permissions on catalog&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;CATALOG&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_engineers`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant permissions on schema&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;USAGE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_engineers`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant permissions on table&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;MODIFY&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`analysts_group`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`junior_analysts`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Grant all privileges&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`admin_group`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Revoke permissions&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="k"&gt;MODIFY&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`junior_analysts`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Show grants&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;GRANTS&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;GRANTS&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_engineers`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Row-level security using row filters&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="n"&gt;is_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'admin'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;current_user&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; 
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;region_filter&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Column-level security using column masks&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;FUNCTION&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_ssn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;RETURNS&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="k"&gt;RETURN&lt;/span&gt; &lt;span class="k"&gt;CASE&lt;/span&gt; 
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;is_member&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hr_team'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'***-**-'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;RIGHT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ssn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt; 
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;COLUMN&lt;/span&gt; &lt;span class="n"&gt;ssn&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;MASK&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mask_ssn&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.5 Unity Catalog Privileges Reference
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Privilege&lt;/th&gt;
&lt;th&gt;Applies To&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CREATE CATALOG&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metastore&lt;/td&gt;
&lt;td&gt;Create new catalogs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;USE CATALOG&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Catalog&lt;/td&gt;
&lt;td&gt;Access catalog objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CREATE SCHEMA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Catalog&lt;/td&gt;
&lt;td&gt;Create schemas in catalog&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;USE SCHEMA&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Schema&lt;/td&gt;
&lt;td&gt;Access schema objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CREATE TABLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Schema&lt;/td&gt;
&lt;td&gt;Create tables in schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;SELECT&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Table/View&lt;/td&gt;
&lt;td&gt;Read data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;MODIFY&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Table&lt;/td&gt;
&lt;td&gt;Insert, update, delete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;READ FILES&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Volume&lt;/td&gt;
&lt;td&gt;Read files from volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WRITE FILES&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Volume&lt;/td&gt;
&lt;td&gt;Write files to volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;EXECUTE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Function&lt;/td&gt;
&lt;td&gt;Call functions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ALL PRIVILEGES&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;All applicable privileges&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  8.6 Data Lineage
&lt;/h3&gt;

&lt;p&gt;Unity Catalog automatically captures data lineage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Lineage is automatically captured for:
# - Table reads and writes
# - SQL queries
# - DLT pipelines
# - Notebooks
# - Jobs
&lt;/span&gt;
&lt;span class="c1"&gt;# View lineage through the Catalog Explorer UI
# Or query system tables
&lt;/span&gt;&lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table_lineage&lt;/span&gt;
&lt;span class="n"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;target_table_full_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production.sales.orders&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.7 Volumes
&lt;/h3&gt;

&lt;p&gt;Volumes are Unity Catalog objects that manage access to non-tabular data in cloud storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create external volume&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;landing_zone&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'abfss://container@storage.dfs.core.windows.net/landing/'&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Landing zone for raw data files'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Create managed volume&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;VOLUME&lt;/span&gt; &lt;span class="n"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;raw_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;processed_files&lt;/span&gt;
&lt;span class="k"&gt;COMMENT&lt;/span&gt; &lt;span class="s1"&gt;'Processed file storage'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Access volume using /Volumes path&lt;/span&gt;
&lt;span class="c1"&gt;-- /Volumes/&amp;lt;catalog&amp;gt;/&amp;lt;schema&amp;gt;/&amp;lt;volume&amp;gt;/&amp;lt;path&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Read from volume
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Volumes/production/raw_data/landing_zone/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to volume
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Volumes/production/raw_data/processed_files/output/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Using dbutils with volumes
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ls&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Volumes/production/raw_data/landing_zone/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  8.8 System Tables
&lt;/h3&gt;

&lt;p&gt;Unity Catalog provides system tables for audit logging and access tracking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Audit logs&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;audit&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;action_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SELECT'&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Table access history&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;access&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table_access_history&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_full_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'production.sales.orders'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;access_date&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Billing and usage&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;usage&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;usage_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;current_date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Query history&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;start_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 9: Databricks SQL
&lt;/h2&gt;

&lt;h3&gt;
  
  
  9.1 Overview of Databricks SQL
&lt;/h3&gt;

&lt;p&gt;Databricks SQL is a serverless data warehouse built on top of the Databricks Lakehouse Platform. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SQL Editor&lt;/strong&gt;: Web-based SQL interface&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL Warehouses&lt;/strong&gt;: Scalable compute for SQL workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboards&lt;/strong&gt;: Built-in visualization and reporting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerts&lt;/strong&gt;: Notifications based on query results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query History&lt;/strong&gt;: Track and optimize queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9.2 SQL Warehouses
&lt;/h3&gt;

&lt;p&gt;SQL Warehouses (formerly SQL Endpoints) are the compute resources for Databricks SQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Types:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Classic&lt;/strong&gt;: Standard SQL warehouse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serverless&lt;/strong&gt;: Databricks-managed infrastructure (fastest startup, pay-per-use)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pro&lt;/strong&gt;: Enhanced features including Unity Catalog support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Auto-scaling&lt;/strong&gt;: SQL warehouses automatically scale to handle concurrent queries.&lt;/p&gt;

&lt;h3&gt;
  
  
  9.3 Advanced SQL Features
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Common Table Expressions (CTEs)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Basic CTE&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;customer_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; 
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'COMPLETED'&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="n"&gt;customer_segments&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;CASE&lt;/span&gt; 
            &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Platinum'&lt;/span&gt;
            &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Gold'&lt;/span&gt;
            &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;total_spent&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Silver'&lt;/span&gt;
            &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Bronze'&lt;/span&gt;
        &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;segment&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customer_orders&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;segment&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_spent&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customer_orders&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customer_segments&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Recursive CTE (for hierarchical data)&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="k"&gt;RECURSIVE&lt;/span&gt; &lt;span class="n"&gt;employee_hierarchy&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;-- Base case&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

    &lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

    &lt;span class="c1"&gt;-- Recursive case&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
    &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;employee_hierarchy&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;manager_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employee_hierarchy&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Window Functions
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Running total&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;running_total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;daily_sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Moving average&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;CURRENT&lt;/span&gt; &lt;span class="k"&gt;ROW&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="n"&gt;_day_avg&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;daily_sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Ranking with ties&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DENSE_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;dense_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ROW_NUMBER&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;row_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;NTILE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;quartile&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENT_RANK&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pct_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;CUME_DIST&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;cum_dist&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;product_sales&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- First/Last value&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FIRST_VALUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; 
        &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;first_order_amount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;LAST_VALUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;OVER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt; 
        &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;
        &lt;span class="k"&gt;ROWS&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;PRECEDING&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;UNBOUNDED&lt;/span&gt; &lt;span class="k"&gt;FOLLOWING&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;last_order_amount&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  PIVOT and UNPIVOT
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- PIVOT&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;monthly_sales&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;PIVOT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Jan'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Feb'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Mar'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Apr'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'May'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Jun'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- UNPIVOT&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;monthly_sales_wide&lt;/span&gt;
&lt;span class="n"&gt;UNPIVOT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;revenue&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jan_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feb_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mar_revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Advanced Aggregations
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- GROUPING SETS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ALL CATEGORIES'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;COALESCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'ALL REGIONS'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;GROUPING&lt;/span&gt; &lt;span class="k"&gt;SETS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- ROLLUP&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;ROLLUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;year&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;quarter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- CUBE&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sales&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_sales&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sales_data&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;CUBE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;channel&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- FILTER clause&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'COMPLETED'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;completed_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'PENDING'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pending_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;large_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;category&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Higher-Order Functions
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- TRANSFORM: Apply function to each element&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;TRANSFORM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;skills&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;UPPER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;upper_skills&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- FILTER: Filter array elements&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;FILTER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;passing_scores&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;students&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- AGGREGATE: Aggregate array elements&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;AGGREGATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amounts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders_with_arrays&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- EXISTS: Check if any element matches&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;has_expensive_items&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;shopping_carts&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- FORALL: Check if all elements match&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;FORALL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;all_passing&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;students&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- REDUCE (alias for AGGREGATE)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;REDUCE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;amounts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;acc&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;arrays_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 10: Security and Access Control
&lt;/h2&gt;

&lt;h3&gt;
  
  
  10.1 Databricks Security Model
&lt;/h3&gt;

&lt;p&gt;Databricks provides multiple layers of security:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Network Security&lt;/strong&gt;: VPC/VNET peering, Private Link, IP Access Lists&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: SSO, SCIM provisioning, Service Principals, PAT tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authorization&lt;/strong&gt;: Unity Catalog, Table ACLs, Workspace ACLs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Encryption&lt;/strong&gt;: At-rest and in-transit encryption&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit&lt;/strong&gt;: Comprehensive audit logging&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  10.2 Identity and Access Management
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Users and Groups
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Using Databricks SDK
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WorkspaceClient&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk.service&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;iam&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WorkspaceClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Create service principal
&lt;/span&gt;&lt;span class="n"&gt;sp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;service_principals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;display_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-service-principal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;application_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;app-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add user to group
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;groups&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;group_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;group-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;operations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Patch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;op&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;iam&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PatchOp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ADD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;members&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Service Principals
&lt;/h4&gt;

&lt;p&gt;Service principals are non-human identities used for automated jobs and service-to-service authentication.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Generate token for service principal
# Configure in the Databricks CLI
&lt;/span&gt;&lt;span class="n"&gt;databricks&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="n"&gt;login&lt;/span&gt; &lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="n"&gt;https&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;azuredatabricks&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;net&lt;/span&gt;

&lt;span class="c1"&gt;# Use service principal in jobs
# Set environment variables
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABRICKS_HOST&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://workspace.azuredatabricks.net&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABRICKS_TOKEN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sp-token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  10.3 Secrets Management
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create secret scope (using CLI)
# databricks secrets create-scope --scope my-scope
&lt;/span&gt;
&lt;span class="c1"&gt;# Create secret (using CLI)
# databricks secrets put --scope my-scope --key my-key
&lt;/span&gt;
&lt;span class="c1"&gt;# Access secrets in notebooks
&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;scope&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db-password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use in configuration (value is never shown in plain text)
&lt;/span&gt;&lt;span class="n"&gt;jdbc_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:postgresql://host:5432/db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;connection_properties&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;db-password&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;jdbc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;jdbc_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;properties&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;connection_properties&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# List secrets (names only, not values)
&lt;/span&gt;&lt;span class="n"&gt;dbutils&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;secrets&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my-scope&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  10.4 Table ACLs (Legacy - Pre-Unity Catalog)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Grant table permissions (Hive metastore)&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`user@company.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;MODIFY&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`data_engineers`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt; &lt;span class="k"&gt;PRIVILEGES&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;my_schema&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`schema_owner`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Deny permissions&lt;/span&gt;
&lt;span class="n"&gt;DENY&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sensitive_table&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="nv"&gt;`junior_analysts`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Revoke permissions&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`user@company.com`&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Show grants&lt;/span&gt;
&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 11: Performance Optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  11.1 Understanding Spark Performance
&lt;/h3&gt;

&lt;p&gt;Performance optimization in Spark involves understanding and addressing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data Skew&lt;/strong&gt;: Uneven distribution of data across partitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shuffle&lt;/strong&gt;: Expensive data movement across the network&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serialization&lt;/strong&gt;: Cost of converting objects for transfer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory Management&lt;/strong&gt;: Efficient use of executor memory&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  11.2 Partitioning Strategies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Check current number of partitions
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rdd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getNumPartitions&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Repartition (full shuffle)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;repartition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Hash partition by column
&lt;/span&gt;
&lt;span class="c1"&gt;# Coalesce (no shuffle, can only reduce partitions)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;coalesce&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Configure default shuffle partitions
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.shuffle.partitions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Default: 200
&lt;/span&gt;
&lt;span class="c1"&gt;# Adaptive Query Execution (AQE) - automatically optimizes partitions
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.coalescePartitions.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.adaptive.skewJoin.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.3 Broadcast Joins
&lt;/h3&gt;

&lt;p&gt;When one DataFrame is small enough to fit in memory, broadcasting it avoids an expensive shuffle join.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;broadcast&lt;/span&gt;

&lt;span class="c1"&gt;# Manual broadcast hint
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;large_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;broadcast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;small_df&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Configure auto-broadcast threshold
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10mb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Default: 10MB
&lt;/span&gt;
&lt;span class="c1"&gt;# Disable auto-broadcast
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.sql.autoBroadcastJoinThreshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.4 Caching and Persistence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Cache in memory (serialized)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Persist with storage level
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StorageLevel&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEMORY_AND_DISK&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEMORY_ONLY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DISK_ONLY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;persist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StorageLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OFF_HEAP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Unpersist
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unpersist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Cache a Delta table (result in Spark memory)
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cacheTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uncacheTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;my_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clearCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.5 Query Optimization
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Explain Plan
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# View logical and physical execution plans
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="c1"&gt;# Extended plan
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cost&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Cost-based optimizer plan
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;codegen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Generated code
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;explain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;formatted&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Formatted output
&lt;/span&gt;
&lt;span class="c1"&gt;# Or in SQL
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EXPLAIN SELECT * FROM orders WHERE status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ACTIVE&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;EXPLAIN EXTENDED SELECT * FROM orders WHERE status = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ACTIVE&lt;/span&gt;&lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Predicate Pushdown
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Spark automatically pushes filters down to minimize data read
# Write queries so filters can be pushed down to storage
&lt;/span&gt;
&lt;span class="c1"&gt;# Good: Filter happens early
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Bad: Filter after expensive operations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parquet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/path/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;expensive_udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;column&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2024-01-01&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Column Pruning
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Only read columns you need
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Better than df.select("*")
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.6 Delta Lake Performance Features
&lt;/h3&gt;

&lt;h4&gt;
  
  
  File Statistics and Data Skipping
&lt;/h4&gt;

&lt;p&gt;Delta Lake maintains statistics (min, max, null count) for each file. Query predicates can use these to skip files entirely.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Z-ORDER colocates data for efficient skipping&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;ZORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;-- Check data skipping effectiveness&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;table_stats&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;table_catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'production'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;table_schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sales'&lt;/span&gt;
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'orders'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Liquid Clustering
&lt;/h4&gt;

&lt;p&gt;Liquid Clustering is the next generation of data layout optimization, replacing OPTIMIZE ZORDER BY.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Create table with liquid clustering&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;DELTA&lt;/span&gt;
&lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Add clustering to existing table&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Run clustering&lt;/span&gt;
&lt;span class="n"&gt;OPTIMIZE&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Check clustering information&lt;/span&gt;
&lt;span class="k"&gt;DESCRIBE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;EXTENDED&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Auto Optimize
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enable Auto Optimize for continuous optimization
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    ALTER TABLE orders SET TBLPROPERTIES (
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.optimizeWrite&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;,
        &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;delta.autoOptimize.autoCompact&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;
    )
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Globally enable
&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.databricks.delta.optimizeWrite.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark.databricks.delta.autoCompact.enabled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.7 Photon Engine
&lt;/h3&gt;

&lt;p&gt;Photon is Databricks' next-generation query engine written in C++. It provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vectorized execution for much faster processing&lt;/li&gt;
&lt;li&gt;Accelerates SQL and DataFrame operations&lt;/li&gt;
&lt;li&gt;Particularly effective for aggregations, joins, and string operations&lt;/li&gt;
&lt;li&gt;Enabled per cluster configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  11.8 Databricks Runtime Optimizations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Use Databricks Runtime (DBR) for built-in optimizations
# DBR includes:
# - Optimized Spark
# - Delta Lake
# - MLflow
# - Performance improvements
&lt;/span&gt;
&lt;span class="c1"&gt;# Check DBR version
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;databricks.sdk.runtime&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;runtime&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dbr_version&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Use latest LTS version for stability
# or latest version for newest features
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Chapter 12: Exam Tips and Practice Questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  12.1 Exam Overview
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Databricks Certified Data Engineer Associate&lt;/strong&gt; exam tests your knowledge of:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Weight&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Databricks Lakehouse Platform&lt;/td&gt;
&lt;td&gt;~24%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ELT with Apache Spark&lt;/td&gt;
&lt;td&gt;~29%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incremental Data Processing&lt;/td&gt;
&lt;td&gt;~22%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production Pipelines&lt;/td&gt;
&lt;td&gt;~16%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data Governance&lt;/td&gt;
&lt;td&gt;~9%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Exam Format:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;45 questions (multiple choice)&lt;/li&gt;
&lt;li&gt;90 minutes&lt;/li&gt;
&lt;li&gt;Passing score: ~70%&lt;/li&gt;
&lt;li&gt;Available on Webassessor&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  12.2 Key Concepts to Master
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Delta Lake Must-Knows
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ACID transactions&lt;/strong&gt; — Delta Lake provides full ACID compliance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transaction log&lt;/strong&gt; (&lt;code&gt;_delta_log&lt;/code&gt;) — Every operation is recorded&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time Travel&lt;/strong&gt; — Query historical versions using &lt;code&gt;VERSION AS OF&lt;/code&gt; or &lt;code&gt;TIMESTAMP AS OF&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VACUUM&lt;/strong&gt; — Removes old files, default retention is 7 days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPTIMIZE + ZORDER&lt;/strong&gt; — Compacts small files and colocates data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema enforcement&lt;/strong&gt; vs &lt;strong&gt;Schema evolution&lt;/strong&gt; — Enforcement is default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MERGE&lt;/strong&gt; — Upsert operation combining INSERT and UPDATE&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CDF&lt;/strong&gt; — Captures row-level changes for downstream processing&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Spark Must-Knows
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lazy evaluation&lt;/strong&gt; — Transformations are lazy, actions trigger execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Partitions&lt;/strong&gt; — Data is distributed across partitions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shuffle&lt;/strong&gt; — Most expensive operation (GROUP BY, JOIN, DISTINCT)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Broadcast join&lt;/strong&gt; — Small table broadcast avoids shuffle&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Caching&lt;/strong&gt; — Use &lt;code&gt;.cache()&lt;/code&gt; for repeatedly accessed DataFrames&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AQE&lt;/strong&gt; — Adaptive Query Execution auto-optimizes plans&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Streaming Must-Knows
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Output modes&lt;/strong&gt; — append, complete, update&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Checkpointing&lt;/strong&gt; — Required for fault tolerance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Watermarks&lt;/strong&gt; — Handle late-arriving data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Loader&lt;/strong&gt; — Best practice for file ingestion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COPY INTO&lt;/strong&gt; — SQL alternative for incremental file loading&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Triggers&lt;/strong&gt; — Control when streaming query runs&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  DLT Must-Knows
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Streaming Tables&lt;/strong&gt; vs &lt;strong&gt;Materialized Views&lt;/strong&gt; — When to use each&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expectations&lt;/strong&gt; — expect, expect_or_drop, expect_or_fail&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;APPLY CHANGES INTO&lt;/strong&gt; — CDC handling in DLT&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pipeline modes&lt;/strong&gt; — Triggered vs Continuous&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Development mode&lt;/strong&gt; — Does not enforce expectations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;dlt.read()&lt;/code&gt; vs &lt;code&gt;dlt.read_stream()&lt;/code&gt;&lt;/strong&gt; — Batch vs streaming read&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  12.3 Practice Questions
&lt;/h3&gt;




&lt;p&gt;&lt;strong&gt;Question 1:&lt;/strong&gt; A data engineer needs to process a large JSON dataset stored in cloud object storage incrementally. New files arrive daily. Which approach is BEST suited for this use case?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;spark.read.json()&lt;/code&gt; with a daily cron schedule&lt;br&gt;
B) Auto Loader with &lt;code&gt;cloudFiles&lt;/code&gt; format and a streaming checkpoint&lt;br&gt;
C) &lt;code&gt;COPY INTO&lt;/code&gt; command in a daily job&lt;br&gt;
D) &lt;code&gt;spark.read.format("delta").option("readChangeFeed", "true")&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — Auto Loader with cloudFiles format is specifically designed for incremental file ingestion with exactly-once processing guarantees.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 2:&lt;/strong&gt; What is stored in the &lt;code&gt;_delta_log&lt;/code&gt; directory of a Delta table?&lt;/p&gt;

&lt;p&gt;A) The raw Parquet data files&lt;br&gt;
B) JSON and Parquet files tracking all changes to the table&lt;br&gt;
C) Schema definition files only&lt;br&gt;
D) A backup of deleted records&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — The &lt;code&gt;_delta_log&lt;/code&gt; directory contains JSON transaction log files and periodic Parquet checkpoint files that record all changes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 3:&lt;/strong&gt; A Delta table has accumulated many small files over time. Which command BEST addresses this issue?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;VACUUM table_name&lt;/code&gt;&lt;br&gt;
B) &lt;code&gt;OPTIMIZE table_name&lt;/code&gt;&lt;br&gt;
C) &lt;code&gt;ANALYZE table_name&lt;/code&gt;&lt;br&gt;
D) &lt;code&gt;REFRESH table_name&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — &lt;code&gt;OPTIMIZE&lt;/code&gt; compacts small files into larger ones. &lt;code&gt;VACUUM&lt;/code&gt; removes old unreferenced files but doesn't compact.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 4:&lt;/strong&gt; You run &lt;code&gt;VACUUM sales RETAIN 0 HOURS&lt;/code&gt;. What is the consequence?&lt;/p&gt;

&lt;p&gt;A) All data is permanently deleted from the table&lt;br&gt;
B) You can no longer query historical versions of the data using time travel&lt;br&gt;
C) All metadata is removed but data remains accessible&lt;br&gt;
D) Future MERGE operations will fail&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — VACUUM removes the files needed for time travel. You can no longer access historical versions that have been vacuumed.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 5:&lt;/strong&gt; In a DLT pipeline, which expectation type will FAIL the entire pipeline if violated?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;@dlt.expect()&lt;/code&gt;&lt;br&gt;
B) &lt;code&gt;@dlt.expect_or_drop()&lt;/code&gt;&lt;br&gt;
C) &lt;code&gt;@dlt.expect_or_fail()&lt;/code&gt;&lt;br&gt;
D) &lt;code&gt;@dlt.expect_all()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — &lt;code&gt;@dlt.expect_or_fail()&lt;/code&gt; fails the pipeline if any row violates the constraint.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 6:&lt;/strong&gt; Which statement about Spark's lazy evaluation is TRUE?&lt;/p&gt;

&lt;p&gt;A) Both transformations and actions execute immediately&lt;br&gt;
B) Transformations execute immediately but actions are lazy&lt;br&gt;
C) Transformations are lazy and execute only when an action is called&lt;br&gt;
D) All operations are eagerly evaluated in Databricks&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — Transformations (like &lt;code&gt;filter&lt;/code&gt;, &lt;code&gt;select&lt;/code&gt;, &lt;code&gt;groupBy&lt;/code&gt;) are lazy and only execute when an action (like &lt;code&gt;collect&lt;/code&gt;, &lt;code&gt;show&lt;/code&gt;, &lt;code&gt;count&lt;/code&gt;) is called.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 7:&lt;/strong&gt; A streaming query reads from a Kafka topic and aggregates events by 5-minute windows. Data can arrive up to 10 minutes late. Which feature handles the late data?&lt;/p&gt;

&lt;p&gt;A) Checkpoint&lt;br&gt;
B) Trigger interval&lt;br&gt;
C) Watermark&lt;br&gt;
D) Output mode&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — Watermarks specify how late data can arrive and are used to drop late data gracefully.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 8:&lt;/strong&gt; What is the difference between &lt;code&gt;MERGE&lt;/code&gt; and &lt;code&gt;INSERT OVERWRITE&lt;/code&gt; in Delta Lake?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;MERGE&lt;/code&gt; is only for streaming; &lt;code&gt;INSERT OVERWRITE&lt;/code&gt; is for batch&lt;br&gt;
B) &lt;code&gt;MERGE&lt;/code&gt; can update, insert, and delete; &lt;code&gt;INSERT OVERWRITE&lt;/code&gt; replaces matching partition data&lt;br&gt;
C) &lt;code&gt;INSERT OVERWRITE&lt;/code&gt; is ACID compliant; &lt;code&gt;MERGE&lt;/code&gt; is not&lt;br&gt;
D) There is no difference; they are aliases&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — &lt;code&gt;MERGE&lt;/code&gt; is a sophisticated upsert that can handle updates, inserts, and deletes based on conditions. &lt;code&gt;INSERT OVERWRITE&lt;/code&gt; replaces data in matching partitions.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 9:&lt;/strong&gt; In the Medallion Architecture, which layer contains business-level aggregations optimized for reporting?&lt;/p&gt;

&lt;p&gt;A) Bronze&lt;br&gt;
B) Silver&lt;br&gt;
C) Gold&lt;br&gt;
D) Platinum&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — The Gold layer contains business-level aggregations ready for consumption by BI tools and analysts.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 10:&lt;/strong&gt; Which command shows the transaction history of a Delta table?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;SHOW HISTORY table_name&lt;/code&gt;&lt;br&gt;
B) &lt;code&gt;DESCRIBE HISTORY table_name&lt;/code&gt;&lt;br&gt;
C) &lt;code&gt;SELECT * FROM table_name VERSION AS OF 0&lt;/code&gt;&lt;br&gt;
D) &lt;code&gt;SHOW TRANSACTIONS table_name&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — &lt;code&gt;DESCRIBE HISTORY table_name&lt;/code&gt; shows the full transaction history including operation type, timestamp, user, and version.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 11:&lt;/strong&gt; A streaming query uses &lt;code&gt;.trigger(once=True)&lt;/code&gt;. What does this do?&lt;/p&gt;

&lt;p&gt;A) Processes data once per second&lt;br&gt;
B) Processes all available data in one micro-batch, then stops&lt;br&gt;
C) Runs continuously with a 1-second interval&lt;br&gt;
D) Processes only the first record from the source&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — &lt;code&gt;trigger(once=True)&lt;/code&gt; processes all available data in a single batch and terminates, combining the benefits of batch and streaming.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 12:&lt;/strong&gt; Which Unity Catalog privilege allows a user to read files from a Volume?&lt;/p&gt;

&lt;p&gt;A) &lt;code&gt;SELECT&lt;/code&gt;&lt;br&gt;
B) &lt;code&gt;READ FILES&lt;/code&gt;&lt;br&gt;
C) &lt;code&gt;USE VOLUME&lt;/code&gt;&lt;br&gt;
D) &lt;code&gt;USAGE&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — &lt;code&gt;READ FILES&lt;/code&gt; is the specific privilege for reading files from Unity Catalog Volumes.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 13:&lt;/strong&gt; What is the default behavior when writing a DataFrame to a Delta table without specifying a write mode?&lt;/p&gt;

&lt;p&gt;A) Append to existing data&lt;br&gt;
B) Overwrite existing data&lt;br&gt;&lt;br&gt;
C) Throw an error if data already exists&lt;br&gt;
D) Merge with existing data&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — The default write mode is &lt;code&gt;error&lt;/code&gt; (or &lt;code&gt;errorifexists&lt;/code&gt;), which throws an error if the target already contains data.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 14:&lt;/strong&gt; A data engineer wants to capture row-level changes to a Delta table for downstream CDC processing. Which feature should they enable?&lt;/p&gt;

&lt;p&gt;A) Delta Time Travel&lt;br&gt;
B) Change Data Feed (CDF)&lt;br&gt;
C) Auto Optimize&lt;br&gt;
D) Incremental VACUUM&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: B&lt;/strong&gt; — Change Data Feed (CDF) captures row-level changes (insert, update_preimage, update_postimage, delete) for downstream consumption.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Question 15:&lt;/strong&gt; In Structured Streaming, which output mode should be used for writing aggregated results (no watermark)?&lt;/p&gt;

&lt;p&gt;A) append&lt;br&gt;
B) update&lt;br&gt;
C) complete&lt;br&gt;
D) overwrite&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Answer: C&lt;/strong&gt; — For aggregations without watermarks, you must use &lt;code&gt;complete&lt;/code&gt; mode, which writes the entire result table each trigger.&lt;/p&gt;




&lt;h3&gt;
  
  
  12.4 Common Exam Pitfalls
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VACUUM and Time Travel&lt;/strong&gt;: Remember, VACUUM with a short retention period disables time travel to those removed versions. The default is 7 days.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Schema Evolution vs Enforcement&lt;/strong&gt;: By default, Delta enforces schema. &lt;code&gt;mergeSchema=true&lt;/code&gt; is needed for schema evolution.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Streaming Output Modes&lt;/strong&gt;: Aggregations without watermarks need &lt;code&gt;complete&lt;/code&gt; mode; simple append streams use &lt;code&gt;append&lt;/code&gt; mode.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;OPTIMIZE vs VACUUM&lt;/strong&gt;: OPTIMIZE compacts files; VACUUM removes unreferenced files. Different purposes!&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;DLT Expectations&lt;/strong&gt;: Remember the three types and their behaviors (warn, drop, fail).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Auto Loader vs COPY INTO&lt;/strong&gt;: Auto Loader handles billions of files with streaming; COPY INTO is simpler SQL for smaller datasets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Managed vs External Tables&lt;/strong&gt;: Dropping a managed table deletes both metadata AND data. Dropping an external table only removes metadata.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Checkpoint Location&lt;/strong&gt;: Each streaming query needs its own unique checkpoint location.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;broadcast() hint&lt;/strong&gt;: Use for tables smaller than 10MB to avoid shuffle joins.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Unity Catalog three-level namespace&lt;/strong&gt;: &lt;code&gt;catalog.schema.table&lt;/code&gt; — don't confuse with two-level &lt;code&gt;database.table&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  12.5 Final Study Checklist
&lt;/h3&gt;

&lt;p&gt;Before taking the exam, ensure you can:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delta Lake:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Explain how the transaction log works&lt;/li&gt;
&lt;li&gt;[ ] Perform MERGE, UPDATE, DELETE operations&lt;/li&gt;
&lt;li&gt;[ ] Use Time Travel with VERSION AS OF and TIMESTAMP AS OF&lt;/li&gt;
&lt;li&gt;[ ] Run OPTIMIZE with ZORDER&lt;/li&gt;
&lt;li&gt;[ ] Configure and run VACUUM&lt;/li&gt;
&lt;li&gt;[ ] Enable and use Change Data Feed&lt;/li&gt;
&lt;li&gt;[ ] Apply schema evolution with mergeSchema&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Spark:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Create and manipulate DataFrames&lt;/li&gt;
&lt;li&gt;[ ] Write complex SQL with CTEs, window functions&lt;/li&gt;
&lt;li&gt;[ ] Handle complex types (arrays, maps, structs)&lt;/li&gt;
&lt;li&gt;[ ] Optimize joins using broadcast&lt;/li&gt;
&lt;li&gt;[ ] Configure partitioning for performance&lt;/li&gt;
&lt;li&gt;[ ] Read from and write to various file formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Streaming:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Set up Auto Loader for file ingestion&lt;/li&gt;
&lt;li&gt;[ ] Configure output modes correctly&lt;/li&gt;
&lt;li&gt;[ ] Implement checkpointing&lt;/li&gt;
&lt;li&gt;[ ] Apply watermarks for late data&lt;/li&gt;
&lt;li&gt;[ ] Use COPY INTO for incremental loading&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DLT:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Create streaming tables and materialized views&lt;/li&gt;
&lt;li&gt;[ ] Apply expectations for data quality&lt;/li&gt;
&lt;li&gt;[ ] Implement CDC with APPLY CHANGES INTO&lt;/li&gt;
&lt;li&gt;[ ] Configure pipeline settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Workflows:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Create multi-task jobs with dependencies&lt;/li&gt;
&lt;li&gt;[ ] Configure job clusters vs all-purpose clusters&lt;/li&gt;
&lt;li&gt;[ ] Use dbutils for file operations and secrets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Unity Catalog:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;[ ] Navigate the three-level namespace&lt;/li&gt;
&lt;li&gt;[ ] Grant and revoke privileges at different levels&lt;/li&gt;
&lt;li&gt;[ ] Create and use Volumes&lt;/li&gt;
&lt;li&gt;[ ] Understand data lineage&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The Databricks Data Engineer Associate exam is comprehensive, covering everything from low-level Spark mechanics to high-level platform features. The key to success is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Hands-on practice&lt;/strong&gt;: Build real pipelines using the Databricks Community Edition or a trial account&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand the WHY&lt;/strong&gt;: Don't just memorize commands — understand when and why to use each feature&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Focus on Delta Lake&lt;/strong&gt;: A large portion of the exam centers on Delta Lake features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Practice with DLT&lt;/strong&gt;: Delta Live Tables is increasingly important&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Review the official documentation&lt;/strong&gt;: Databricks documentation is excellent and aligns closely with exam content&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Lakehouse architecture represents the future of data engineering, combining the best of data lakes and data warehouses. Mastering these concepts not only prepares you for the exam but equips you with skills that are highly valuable in modern data engineering roles.&lt;/p&gt;

&lt;p&gt;Good luck with your exam! 🚀&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This guide was written based on the Databricks Certified Data Engineer Associate exam guide and covers all major topics. Always refer to the official Databricks documentation for the most up-to-date information, as the platform evolves rapidly.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>learning</category>
      <category>sql</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>🎯 RDS Backup Quiz:
❓ What happens to automated backups when instance deleted?
❓ Do hourly snapshots = point-in-time recovery?
❓ Will automated backups cross-region?
❓ Can you restore S3 exports back to RDS?</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Tue, 27 Jan 2026 00:59:25 +0000</pubDate>
      <link>https://forem.com/datatechbridge/rds-backup-quiz-what-happens-to-automated-backups-when-instance-deleted-do-hourly-2a9l</link>
      <guid>https://forem.com/datatechbridge/rds-backup-quiz-what-happens-to-automated-backups-when-instance-deleted-do-hourly-2a9l</guid>
      <description>&lt;div class="ltag__link"&gt;
  &lt;a href="/datatechbridge" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__pic"&gt;
      &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2904813%2F7f82258a-66eb-4a83-a7b9-c42ba24dc0b7.jpeg" alt="datatechbridge"&gt;
    &lt;/div&gt;
  &lt;/a&gt;
  &lt;a href="https://dev.to/datatechbridge/rds-backup-vs-snapshot-a-comprehensive-guide-2m36" class="ltag__link__link"&gt;
    &lt;div class="ltag__link__content"&gt;
      &lt;h2&gt;RDS Backup vs Snapshot: A Comprehensive Guide&lt;/h2&gt;
      &lt;h3&gt;Data Tech Bridge ・ Jan 27&lt;/h3&gt;
      &lt;div class="ltag__link__taglist"&gt;
        &lt;span class="ltag__link__tag"&gt;#aws&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#database&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#devops&lt;/span&gt;
        &lt;span class="ltag__link__tag"&gt;#tutorial&lt;/span&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/a&gt;
&lt;/div&gt;


</description>
    </item>
    <item>
      <title>RDS Backup vs Snapshot: A Comprehensive Guide</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Tue, 27 Jan 2026 00:44:48 +0000</pubDate>
      <link>https://forem.com/datatechbridge/rds-backup-vs-snapshot-a-comprehensive-guide-2m36</link>
      <guid>https://forem.com/datatechbridge/rds-backup-vs-snapshot-a-comprehensive-guide-2m36</guid>
      <description>&lt;h2&gt;
  
  
  A Conversation Between a Full Stack Developer and an RDS Expert
&lt;/h2&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Part 1: Understanding the Basics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The Confusion Begins&lt;/li&gt;
&lt;li&gt;Key Differences Overview&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 2: Deep Dive into Mechanisms
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;How Do They Actually Work?&lt;/li&gt;
&lt;li&gt;Automated Backup Process (Incremental)&lt;/li&gt;
&lt;li&gt;Manual Snapshot Process (Full Copy)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 3: Understanding Storage &amp;amp; Incremental Backups
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;What Are Incremental Backups?&lt;/li&gt;
&lt;li&gt;Storage Calculations Explained&lt;/li&gt;
&lt;li&gt;Advantages of Incremental Backups&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 4: Transaction Logs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The Transaction Log Mystery&lt;/li&gt;
&lt;li&gt;What Transaction Logs Capture&lt;/li&gt;
&lt;li&gt;How Recovery Uses Transaction Logs&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 5: AWS Console Views
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;What You See in Console&lt;/li&gt;
&lt;li&gt;Console View - Automated Backups&lt;/li&gt;
&lt;li&gt;Console View - Manual Snapshots&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 6: Real-World Scenarios
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Scenario 1: Accidental Data Deletion&lt;/li&gt;
&lt;li&gt;Scenario 2: Failed Schema Migration&lt;/li&gt;
&lt;li&gt;Scenario 3: Disaster Recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 7: Decision Making
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Decision Flow Diagram&lt;/li&gt;
&lt;li&gt;When to Use What&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 8: Cost &amp;amp; Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cost Comparison&lt;/li&gt;
&lt;li&gt;Performance Impact&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 9: Best Practices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Production Backup Strategy&lt;/li&gt;
&lt;li&gt;Automation Examples&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 10: Avoiding Mistakes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Common Pitfalls&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 11: Quick Reference
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cheat Sheet&lt;/li&gt;
&lt;li&gt;AWS CLI Commands&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 12: Monitoring
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Monitoring &amp;amp; Alerts Setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 13: Deletion Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Understanding Snapshot and Backup Deletion&lt;/li&gt;
&lt;li&gt;Automated Backup Deletion&lt;/li&gt;
&lt;li&gt;Manual Snapshot Deletion&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Part 14: Exporting Snapshots
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Exporting Snapshots to Amazon S3&lt;/li&gt;
&lt;li&gt;What is Snapshot Export to S3?&lt;/li&gt;
&lt;li&gt;When and Why to Export to S3&lt;/li&gt;
&lt;li&gt;How to Export Snapshot to S3&lt;/li&gt;
&lt;li&gt;Using Exported Data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Final Summary&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Appendix
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Additional Resources&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 1: The Confusion Begins
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Hey! I'm working on our production database on AWS RDS, and I'm completely confused. I see "Automated Backups" and "Snapshots" in the console. Aren't they the same thing? Why do we need both?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! This confuses many people. Let me break it down for you. They're actually quite different in purpose and functionality.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Think of it this way:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automated Backups&lt;/strong&gt; = Your continuous safety net (like a time machine)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual Snapshots&lt;/strong&gt; = Your bookmarks/checkpoints (like saving your game progress)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let me show you how they differ:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────┐
│                    AUTOMATED BACKUPS                         │
├─────────────────────────────────────────────────────────────┤
│ • Continuous, automatic                                      │
│ • Point-in-time recovery (PITR)                             │
│ • Includes transaction logs                                  │
│ • Retention: 0-35 days                                      │
│ • Deleted when RDS instance is deleted                      │
│ • Taken during backup window                                │
│ • INCREMENTAL after first full backup ✨                    │
│ • Storage-efficient                                          │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    MANUAL SNAPSHOTS                          │
├─────────────────────────────────────────────────────────────┤
│ • Manual, on-demand                                          │
│ • No point-in-time recovery                                 │
│ • No transaction logs                                        │
│ • Retention: Forever (until you delete)                     │
│ • Persist even after RDS deletion                           │
│ • Taken anytime you want                                    │
│ • FULL COPY (first time), then incremental ✨               │
│ • Uses EBS snapshot technology                              │
└─────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 2: How Do They Actually Work?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Okay, that helps. But HOW do they work under the hood? What's actually happening? And what does "incremental" mean?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Excellent! This is where it gets really interesting. Let me explain the crucial difference between how automated backups and snapshots handle data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Backup Process
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                AUTOMATED BACKUP FLOW (INCREMENTAL)                │
└──────────────────────────────────────────────────────────────────┘

Day 1 - Monday 03:00 AM (Backup Window Starts)
│
├─► FULL BACKUP taken (First time only)
│   ├─► Stored in S3 (encrypted)
│   ├─► Size: 100 GB (complete database)
│   └─► This becomes the baseline
│
├─► Throughout the day...
│   │
│   ├─► 03:15 AM - Transaction Log #1 captured
│   ├─► 03:20 AM - Transaction Log #2 captured
│   ├─► 03:25 AM - Transaction Log #3 captured
│   └─► ... continues every 5 minutes
│
Day 2 - Tuesday 03:00 AM (Next Backup Window)
│
├─► INCREMENTAL BACKUP taken ✨
│   ├─► Only changed blocks since Monday's full backup
│   ├─► Size: Only 3 GB (just the delta/changes)
│   └─► Much faster than full backup
│
├─► Example of changes captured:
│   ├─► Block 1234: Updated customer record
│   ├─► Block 5678: New order inserted
│   ├─► Block 9012: Product price updated
│   └─► Only these changed blocks are backed up
│
Day 3 - Wednesday 03:00 AM
│
├─► INCREMENTAL BACKUP taken ✨
│   ├─► Only changed blocks since Tuesday
│   ├─► Size: 2.5 GB (just today's changes)
│   └─► Continues building on previous backups
│
Day 4-7 - Similar incremental backups
│   └─► Each backup only stores what changed
│
Day 8 - Next Monday 03:00 AM
│
└─► New FULL BACKUP cycle may start (RDS decides)
    └─► Or continue incremental chain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Manual Snapshot Process
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              MANUAL SNAPSHOT FLOW (EBS-BASED)                     │
└──────────────────────────────────────────────────────────────────┘

First Snapshot - Monday 2:00 PM
│
├─► You click "Take Snapshot"
│   │
│   ├─► RDS uses EBS snapshot technology
│   ├─► Creates FULL copy of all EBS blocks
│   ├─► Size: 100 GB (complete database state)
│   └─► Takes 10-15 minutes
│
Second Snapshot - Monday 6:00 PM (same day)
│
├─► You click "Take Snapshot" again
│   │
│   ├─► EBS compares with previous snapshot
│   ├─► Only copies CHANGED blocks ✨
│   ├─► Size: Only 2 GB (incremental)
│   ├─► Takes 3-5 minutes
│   └─► References unchanged blocks from first snapshot
│
Third Snapshot - Tuesday 2:00 PM
│
└─► Another snapshot taken
    ├─► Again, only changed blocks since last snapshot
    ├─► Size: 5 GB (one day of changes)
    └─► Chain continues

How EBS Snapshots Work:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Snapshot 1 (Full):     [AAAAAAAAAA] 100 GB
                              ↓
Snapshot 2 (Incr):     [AAAAAAAAAA] → [BB] 2 GB
                       (references)    (new blocks)
                              ↓
Snapshot 3 (Incr):     [AAAAAAAAAA] → [BB] → [CCC] 5 GB
                       (references)    (ref)   (new)

Total Storage Used: 100 + 2 + 5 = 107 GB
(Not 100 + 100 + 100 = 300 GB!) ✨
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 3: Understanding Incremental Backups
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Wait! So both use incremental techniques? What's the actual advantage?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Exactly! Both use incremental approaches after the first backup, but they work differently. Let me explain the massive advantages:&lt;/p&gt;

&lt;h3&gt;
  
  
  What Are Incremental Backups?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              FULL BACKUP vs INCREMENTAL BACKUP                    │
└──────────────────────────────────────────────────────────────────┘

WITHOUT INCREMENTAL (Traditional Way):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Day 1: Full Backup    → 100 GB stored → Takes 30 min
Day 2: Full Backup    → 100 GB stored → Takes 30 min
Day 3: Full Backup    → 100 GB stored → Takes 30 min
Day 4: Full Backup    → 100 GB stored → Takes 30 min
Day 5: Full Backup    → 100 GB stored → Takes 30 min
Day 6: Full Backup    → 100 GB stored → Takes 30 min
Day 7: Full Backup    → 100 GB stored → Takes 30 min

Total Storage: 700 GB for 7 days! 💰💰💰
Total Time: 3.5 hours of backup time per week


WITH INCREMENTAL (RDS Way):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Day 1: Full Backup    → 100 GB stored → Takes 30 min
Day 2: Incremental    → 3 GB stored   → Takes 3 min
Day 3: Incremental    → 2.5 GB stored → Takes 2 min
Day 4: Incremental    → 4 GB stored   → Takes 4 min
Day 5: Incremental    → 3.5 GB stored → Takes 3 min
Day 6: Incremental    → 2 GB stored   → Takes 2 min
Day 7: Incremental    → 3 GB stored   → Takes 3 min

Total Storage: 118 GB for 7 days! ✨
Total Time: 47 minutes of backup time per week

SAVINGS: 83% storage, 77% time! 🎉
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Storage Calculations Explained
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              REAL STORAGE CALCULATION                             │
└──────────────────────────────────────────────────────────────────┘

Scenario: 100 GB Database, 7-day retention

CORRECT CALCULATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Day 1: Full backup        = 100 GB
Day 2: +3 GB changes      = 103 GB total
Day 3: +2.5 GB changes    = 105.5 GB total
Day 4: +4 GB changes      = 109.5 GB total
Day 5: +3.5 GB changes    = 113 GB total
Day 6: +2 GB changes      = 115 GB total
Day 7: +3 GB changes      = 118 GB total

Average daily change rate: ~3 GB (3% of database size)
Total storage for 7 days: ~118 GB

NOT: 7 days × 100 GB = 700 GB ❌


FACTORS AFFECTING STORAGE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Change Rate (Data Volatility):
   ├─ Read-heavy workload: 1-2% daily change → Less storage
   ├─ Balanced workload: 3-5% daily change → Moderate storage
   └─ Write-heavy workload: 10-15% daily change → More storage

   Example (100 GB database, 7 days):
   ├─ Low change (2%/day): 100 + (6 × 2) = 112 GB
   ├─ Medium change (5%/day): 100 + (6 × 5) = 130 GB
   └─ High change (10%/day): 100 + (6 × 10) = 160 GB

2. Retention Period:
   ├─ 7 days retention: Base + (6 × daily_change)
   ├─ 14 days retention: Base + (13 × daily_change)
   └─ 35 days retention: Base + (34 × daily_change)

3. Backup Window Optimization:
   └─ RDS periodically consolidates old incremental backups
      to optimize storage


REAL EXAMPLE CALCULATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

100 GB database with 3% daily change rate:

7-day retention:
├─ Storage needed: 100 GB + (6 days × 3 GB) = 118 GB
├─ Free tier: 100 GB (matches database size)
├─ Chargeable: 118 - 100 = 18 GB
└─ Cost: 18 GB × $0.095 = $1.71/month ✅

14-day retention:
├─ Storage needed: 100 GB + (13 days × 3 GB) = 139 GB
├─ Free tier: 100 GB
├─ Chargeable: 139 - 100 = 39 GB
└─ Cost: 39 GB × $0.095 = $3.71/month ✅

35-day retention (maximum):
├─ Storage needed: 100 GB + (34 days × 3 GB) = 202 GB
├─ Free tier: 100 GB
├─ Chargeable: 202 - 100 = 102 GB
└─ Cost: 102 GB × $0.095 = $9.69/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Advantages of Incremental Backups
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              WHY INCREMENTAL BACKUPS ARE AWESOME                  │
└──────────────────────────────────────────────────────────────────┘

1. MASSIVE STORAGE SAVINGS 💰
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Traditional full backups: 700 GB for 7 days
Incremental backups: 118 GB for 7 days
Savings: 83% reduction in storage costs!

Real cost example:
├─ Full backups: (700 - 100 free) × $0.095 = $57/month
└─ Incremental: (118 - 100 free) × $0.095 = $1.71/month
    Saves: $55.29/month! 🎉


2. FASTER BACKUP WINDOWS ⚡
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Full backup: 30 minutes (copies all 100 GB)
Incremental backup: 2-5 minutes (copies only 2-5 GB changes)

Benefits:
├─ Shorter maintenance windows
├─ Less I/O impact on production
├─ More frequent backup opportunities
└─ Faster to complete within backup window


3. REDUCED I/O OVERHEAD 📊
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Full backup: Reads entire 100 GB database
├─ Millions of I/O operations
├─ Significant CPU usage
└─ Can impact application performance

Incremental backup: Reads only changed blocks
├─ Thousands of I/O operations (not millions)
├─ Minimal CPU usage
└─ Negligible application impact


4. NETWORK BANDWIDTH SAVINGS 🌐
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Data transferred to S3:
├─ Full backup: 100 GB daily = 700 GB/week
└─ Incremental: 100 GB + 18 GB = 118 GB/week
    Saves: 582 GB bandwidth per week!


5. EFFICIENT RESTORE OPERATIONS 🔄
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RDS optimizes restore by:
├─ Using the base full backup
├─ Applying only necessary incremental changes
└─ Parallel processing of incremental blocks

Result: Restore is almost as fast as from a full backup!


6. BETTER RETENTION OPTIONS 📅
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Because incrementals use less space:
├─ Can afford longer retention periods
├─ 35 days retention becomes economical
└─ More restore points available


COMPARISON TABLE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Metric              Full Backup    Incremental    Improvement
─────────────────────────────────────────────────────────────────
Storage (7 days)    700 GB         118 GB         83% less
Backup time         30 min         2-5 min        90% faster
I/O operations      Millions       Thousands      99% less
Network transfer    700 GB/week    118 GB/week    83% less
Performance impact  High           Minimal        95% better
Cost (7 days)       $57/month      $1.71/month    97% cheaper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Wow! So incremental backups are a game-changer. But how does RDS know which blocks changed?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! Here's how:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              HOW INCREMENTAL DETECTION WORKS                      │
└──────────────────────────────────────────────────────────────────┘

BLOCK-LEVEL CHANGE TRACKING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Database storage is divided into blocks (typically 8KB each)

Example: 100 GB database = ~13 million blocks

RDS tracks each block's state:
│
├─► Block #1234567: Last modified: Jan 15 03:15:22
├─► Block #1234568: Last modified: Jan 14 22:30:45
├─► Block #1234569: Last modified: Jan 15 14:22:10
└─► ... (13 million blocks tracked)

During backup at 03:00 AM:
│
├─► Compare each block's timestamp with last backup
├─► If modified_time &amp;gt; last_backup_time:
│   └─► Include this block in incremental backup
└─► If modified_time &amp;lt;= last_backup_time:
    └─► Skip this block (already backed up)


EXAMPLE SCENARIO:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Last backup: Jan 15, 03:00 AM
Current backup: Jan 16, 03:00 AM

Changes in last 24 hours:
├─ 10,000 INSERT operations → 10,000 new blocks
├─ 5,000 UPDATE operations → 5,000 modified blocks
├─ 2,000 DELETE operations → 2,000 freed blocks
└─ Total: 17,000 blocks changed out of 13 million (0.13%)

Backup captures:
├─ Only these 17,000 changed blocks
├─ Size: 17,000 blocks × 8 KB = 136 MB
└─ Not the entire 100 GB!


TECHNOLOGY BEHIND IT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

RDS uses:
├─ MySQL: Binary log positions + InnoDB change buffer
├─ PostgreSQL: WAL (Write-Ahead Log) + LSN tracking
├─ EBS: Block-level change tracking in storage layer
└─ All coordinated to identify changed blocks efficiently
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 4: The Transaction Log Mystery
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: You keep mentioning "transaction logs." What ARE they, and why do they matter? How do they relate to incremental backups?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! Transaction logs are the SECRET SAUCE that makes Point-in-Time Recovery possible. They work alongside incremental backups. Let me explain:&lt;/p&gt;

&lt;h3&gt;
  
  
  What Transaction Logs Capture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              TRANSACTION LOG EXAMPLE                              │
└──────────────────────────────────────────────────────────────────┘

Monday 03:00 AM - Full Backup Taken
├─── Database State: 1000 users, 5000 orders
│
│    [Backup captures entire database: 100 GB]
│
│
Monday 03:05 AM - Transaction Log #1
├─── INSERT INTO users VALUES (1001, 'John Doe')
├─── UPDATE orders SET status='shipped' WHERE id=4523
├─── DELETE FROM temp_data WHERE created &amp;lt; '2024-01-01'
│
│
Monday 10:15 AM - Transaction Log #2
├─── INSERT INTO orders VALUES (5001, 1001, 'Product X')
├─── UPDATE users SET last_login=NOW() WHERE id=1001
│
│
Monday 14:30 PM - Transaction Log #3
├─── INSERT INTO orders VALUES (5002, 950, 'Product Y')
│
│
Monday 14:35 PM - 💥 DISASTER! Accidental DELETE
├─── DELETE FROM orders WHERE status='pending'  😱
     (Oops! Deleted 500 orders by mistake!)


Tuesday 03:00 AM - Incremental Backup
├─── Captures all block changes since Monday 03:00 AM
├─── Size: 3 GB (only changed data blocks)
└─── Does NOT include individual transaction details


THE KEY DIFFERENCE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Incremental Backups:
├─ Capture: Changed BLOCKS (at backup time)
├─ Frequency: Daily (during backup window)
├─ Granularity: Per-block level
└─ Purpose: Efficient storage of database state

Transaction Logs:
├─ Capture: Individual TRANSACTIONS (continuous)
├─ Frequency: Every 5 minutes, 24/7
├─ Granularity: Per-statement level
└─ Purpose: Enable point-in-time recovery to any second


TOGETHER THEY PROVIDE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Full Backup (Day 1)     → Baseline state
     +
Incremental Backups     → Daily changed blocks
     +
Transaction Logs        → Second-by-second changes

= Point-in-time recovery to ANY second! ✨
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How Recovery Uses Transaction Logs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              POINT-IN-TIME RECOVERY PROCESS                       │
└──────────────────────────────────────────────────────────────────┘

You want to restore to Monday 14:34 PM (1 minute before disaster)

Step 1: Start with Full Backup from Monday 03:00 AM
        └─── Restore: 1000 users, 5000 orders [100 GB]

Step 2: Apply Incremental Backup (if available)
        └─── Note: Next incremental is Tuesday 03:00 AM
        └─── So we skip this (not needed yet)

Step 3: Apply Transaction Logs from 03:00 AM to 14:34 PM
        ├─── Log 03:05 AM: + John Doe user
        ├─── Log 03:05 AM: + Update order 4523
        ├─── Log 03:05 AM: + Delete old temp data
        ├─── Log 10:15 AM: + Order 5001
        ├─── Log 10:15 AM: + Update John's login
        └─── Log 14:30 PM: + Order 5002

Step 4: STOP at 14:34 PM ⏰
        └─── DON'T apply the DELETE that happened at 14:35 PM

Result: Database restored to 14:34 PM - before the disaster! ✅


HOW IT WORKS VISUALLY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Timeline:
│
│ Mon 03:00 AM ──────────────────────────────── Tue 03:00 AM
│     │                                              │
│     │                                              │
│   [Full Backup]                          [Incremental Backup]
│   100 GB                                  3 GB changes
│     │                                              │
│     ├──► Transaction Logs ────────────────────────┤
│     │    (captured every 5 min, 24/7)             │
│     │                                              │
│     │         14:34 PM     14:35 PM               │
│     │            ↑           ↑                     │
│     │         (Restore)  (Disaster)               │
│     │          here                                │
│     │                                              │
│     └──────────────────┬───────────────────────────┘
│                        │
│                 Recovery combines:
│                 1. Full backup (03:00 AM)
│                 2. Transaction logs (03:00-14:34)
│                 3. Stops before disaster!


WITHOUT TRANSACTION LOGS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Could only restore to:
├─ Monday 03:00 AM (full backup) ❌ 11.5 hours of data loss
└─ Tuesday 03:00 AM (next backup) ❌ Includes the disaster!

WITH TRANSACTION LOGS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Can restore to ANY second between backups! ✅
├─ Monday 03:00:01 AM ✅
├─ Monday 10:15:30 AM ✅
├─ Monday 14:34:59 PM ✅ (1 second before disaster)
└─ Data loss: Less than 1 minute!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Wow! So transaction logs are like a detailed journal of every change?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Exactly! They're continuously captured every 5 minutes and allow you to restore to ANY point within your backup retention period, down to the second. The combination of incremental backups and transaction logs gives you the best of both worlds: storage efficiency AND precise recovery!&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: What You See in the AWS Console
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Okay, so what will I actually see in the AWS Console?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Let me show you what each section displays:&lt;/p&gt;

&lt;h3&gt;
  
  
  Console View - Automated Backups
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│  AWS RDS Console → Database → Maintenance &amp;amp; backups              │
└──────────────────────────────────────────────────────────────────┘

Automated Backups Section:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✓ Enabled
  Backup retention period: 7 days
  Backup window: 03:00 - 04:00 UTC

  Latest Restore Time: 2024-01-15 14:35:22 UTC
  Earliest Restore Time: 2024-01-08 14:35:22 UTC

  ⏰ You can restore to ANY point between these times!

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Backup Storage Information:
  Total backup storage: 118 GB
  ├─ Free tier (matches DB): 100 GB ✅
  └─ Chargeable storage: 18 GB 💰

  Breakdown:
  ├─ Full backup (7 days ago): 100 GB
  ├─ Incremental backups (6 days): 15 GB
  └─ Transaction logs: 3 GB

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

System Backups (Not directly visible, managed by AWS):
  ├─ 2024-01-15 03:00 - rds:prod-db-2024-01-15-03-00 (Incremental)
  ├─ 2024-01-14 03:00 - rds:prod-db-2024-01-14-03-00 (Incremental)
  ├─ 2024-01-13 03:00 - rds:prod-db-2024-01-13-03-00 (Incremental)
  ├─ 2024-01-12 03:00 - rds:prod-db-2024-01-12-03-00 (Incremental)
  ├─ 2024-01-11 03:00 - rds:prod-db-2024-01-11-03-00 (Incremental)
  ├─ 2024-01-10 03:00 - rds:prod-db-2024-01-10-03-00 (Incremental)
  └─ 2024-01-09 03:00 - rds:prod-db-2024-01-09-03-00 (Full backup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Console View - Manual Snapshots
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│  AWS RDS Console → Snapshots                                     │
└──────────────────────────────────────────────────────────────────┘

Manual Snapshots:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Name                          | Status    | Created              | Size   | Type
───────────────────────────────────────────────────────────────────────────────
before-schema-migration       | Available | 2024-01-15 09:00 UTC | 105 GB | Manual
pre-production-release-v2.1   | Available | 2024-01-10 14:30 UTC | 103 GB | Manual
before-data-cleanup           | Available | 2024-01-05 11:15 UTC | 100 GB | Manual

Note: Size shown is the ACTUAL space consumed:
├─ First snapshot: 100 GB (full copy)
├─ Second snapshot: +3 GB (incremental)
├─ Third snapshot: +5 GB (incremental)
└─ Total: 108 GB (not 300 GB!)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Actions Available:
  • Restore snapshot (creates new instance from this point)
  • Copy snapshot (duplicate to same/different region)
  • Share snapshot (with other AWS accounts)
  • Delete snapshot (free up storage)
  • Migrate snapshot (convert to different engine version)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 6: Real-World Scenarios with Timestamps
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: This is making sense! Can you walk me through some real scenarios with exact timestamps?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Absolutely! Let's go through three common scenarios:&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 1: Accidental Data Deletion (Use Automated Backup)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              SCENARIO: ACCIDENTAL DATA DELETION                   │
└──────────────────────────────────────────────────────────────────┘

TIMELINE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Monday, Jan 15, 2024

03:00 AM → Automated backup runs (daily backup window)
          ├─ Incremental backup taken (only changed blocks)
          ├─ Database size: 100 GB
          ├─ Changed blocks: 3 GB
          └─ Backup completes: 03:05 AM

09:00 AM → Normal business operations
          ├─ 150 new orders placed
          ├─ 45 users registered
          ├─ Transaction logs capturing everything every 5 min
          └─ All changes recorded for PITR

02:30 PM → Developer runs UPDATE query
          ├─ UPDATE products SET price = price * 0.9
          ├─ WHERE category = 'Electronics'
          └─ ✅ This is correct - 10% discount on electronics

02:35 PM → 💥 DISASTER! Developer runs wrong query
          ├─ UPDATE products SET price = price * 0.9
          ├─ ❌ Forgot the WHERE clause!
          └─ All 10,000 products now have wrong prices!

02:37 PM → Panic! Discovery of the mistake
          └─ Need to restore database


RECOVERY PLAN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Option 1: Point-in-Time Restore to 02:34 PM ✅ RECOMMENDED

Why 02:34 PM?
  ├─ After the correct discount (02:30 PM) ✅
  └─ Before the mistake (02:35 PM) ✅

Process:
  1. Select "Restore to point in time"
  2. Choose: January 15, 2024, 02:34:00 PM
  3. AWS will:
     ├─ Take the 03:00 AM incremental backup
     ├─ Apply all transaction logs from 03:00 AM to 02:34 PM
     └─ Create new RDS instance: "prod-db-restored"

  4. Behind the scenes:
     ├─ Start with yesterday's full backup (100 GB)
     ├─ Apply today's incremental (3 GB)
     ├─ Apply transaction logs (covering 11.5 hours)
     └─ Total restore time: 15-20 minutes

  5. Result:
     ├─ 150 orders from morning: ✅ Preserved
     ├─ 45 new users: ✅ Preserved
     ├─ Correct electronics discount: ✅ Applied
     └─ Wrong price update: ❌ Never happened

Time to restore: 15-20 minutes
Data loss: Only 3 minutes (02:34 PM to 02:37 PM)
Cost: Creating new instance during validation


Option 2: Restore from Manual Snapshot ❌ NOT IDEAL

If you had a snapshot from 09:00 AM:
  ├─ Snapshot is a full point-in-time copy
  ├─ No transaction logs attached
  └─ Would lose:
     ├─ 150 orders (09:00 AM to 02:37 PM) ❌
     ├─ 45 users ❌
     ├─ Correct discount ❌
     └─ 5.5 hours of data lost


LESSON LEARNED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Automated backups with PITR saved the day:
✅ Minimal data loss (3 minutes)
✅ All legitimate transactions preserved
✅ Quick recovery (15-20 minutes)
✅ Surgical precision (restore to exact second)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: Schema Migration Gone Wrong (Use Manual Snapshot)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│           SCENARIO: FAILED SCHEMA MIGRATION                       │
└──────────────────────────────────────────────────────────────────┘

TIMELINE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Saturday, Jan 20, 2024 (Maintenance Window)

01:00 AM → Create manual snapshot before migration ✅
          ├─ Name: "before-v3-schema-migration"
          ├─ Database size: 105 GB
          ├─ Type: Full snapshot (first of the day)
          ├─ Status: Creating... (uses EBS snapshot)
          └─ Status: Available (01:10 AM) - takes ~10 minutes

01:15 AM → Begin schema migration
          ├─ ALTER TABLE orders ADD COLUMN tracking_info JSON
          ├─ CREATE INDEX idx_tracking ON orders(tracking_info)
          ├─ Migrate 5 million rows...
          └─ Transaction logs recording all DDL changes

02:45 AM → 💥 DISASTER! Migration script fails
          ├─ Error: Foreign key constraint violation
          ├─ Database in inconsistent state
          ├─ Some rows migrated, some not
          ├─ Schema partially changed
          └─ Application throwing errors

02:50 AM → Decision: Rollback needed


RECOVERY PLAN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Use Manual Snapshot ✅ RECOMMENDED

Why NOT Point-in-Time Restore?
  ├─ Schema changes are DDL (Data Definition Language)
  ├─ Partial migration created inconsistent state
  ├─ Transaction logs include all the failed operations
  ├─ PITR would replay the problematic migration
  └─ Result: Restored DB would still be corrupted! ❌

Why Manual Snapshot?
  ├─ Clean state before ANY migration started
  ├─ Known good schema (verified before snapshot)
  ├─ No DDL operations in this snapshot
  └─ Guaranteed consistency ✅

Process:
  1. Select snapshot "before-v3-schema-migration"
  2. Restore to new instance: "prod-db-rollback"
  3. AWS creates exact copy from 01:00 AM state
     ├─ Restores all EBS blocks as they were
     ├─ Time: 10-15 minutes
     └─ No need to apply transaction logs

  4. Verify restored database:
     ├─ Check schema version ✅
     ├─ Test application connectivity ✅
     ├─ Verify data integrity ✅
     └─ Run read-only tests ✅

  5. Cutover:
     ├─ Update application connection string
     ├─ Switch DNS/Route53 to new instance
     ├─ Monitor for 30 minutes
     └─ Delete failed instance

Time to restore: 10-15 minutes
Data loss: Everything from 01:00 AM to 02:50 AM (1 hour 50 min)
  └─ Acceptable because:
     ├─ Maintenance window (low traffic)
     ├─ Application was down anyway
     └─ Clean state more important than recent data


WHAT AUTOMATED BACKUP WOULD HAVE DONE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If you tried PITR to 02:49 AM:
  1. Restore last full backup (Sun 03:00 AM)
  2. Apply incremental backup (Sat 03:00 AM) - includes pre-migration state
  3. Apply transaction logs from 01:15 AM to 02:49 AM
     └─ This includes ALL the failed migration steps! ❌
  4. Result: Database restored with partial/corrupt schema ❌
  5. You'd STILL need the manual snapshot ⚠️


LESSON LEARNED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Before major schema changes:
✅ ALWAYS create manual snapshot
✅ Manual snapshot = "Save game" before boss fight
✅ Automated backup would replay the mistakes
✅ Clean rollback point is critical
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 3: Disaster Recovery / Region Failure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│           SCENARIO: REGION FAILURE / DISASTER RECOVERY            │
└──────────────────────────────────────────────────────────────────┘

SETUP:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Primary Region: us-east-1
Backup Strategy:
  ├─ Automated backups: 7 days retention ✅
  │  └─ Storage: 118 GB (incremental)
  │
  └─ Daily manual snapshots copied to us-west-2 ✅
     └─ Storage: 100 GB per snapshot (first is full, rest incremental)

Every Day at 04:00 AM (automated Lambda):
  1. Automated backup runs in us-east-1 (incremental)
  2. Lambda function triggers at 04:30 AM
  3. Creates manual snapshot (shares incremental blocks with auto backup)
  4. Copies snapshot to us-west-2
     ├─ First copy: 100 GB transfer
     ├─ Subsequent copies: 3-5 GB transfer (incremental)
     └─ Cost: $0.02/GB for transfer


DISASTER TIMELINE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Tuesday, Jan 23, 2024

03:00 AM → Automated incremental backup in us-east-1 ✅
          └─ 3 GB of changes captured

04:00 AM → Lambda creates manual snapshot
          ├─ Snapshot: prod-db-2024-01-23
          └─ Incremental from yesterday's snapshot

04:30 AM → Snapshot copied to us-west-2 ✅
          ├─ Transfer size: 3.2 GB (only incremental blocks)
          ├─ Cost: 3.2 GB × $0.02 = $0.064
          └─ Copy completes: 04:45 AM

[Normal operations throughout the day...]

02:00 PM → 💥 DISASTER! AWS us-east-1 region outage
          ├─ RDS instance unreachable
          ├─ Automated backups unreachable (same region)
          ├─ Application down
          ├─ Cannot access PITR (transaction logs in us-east-1)
          └─ Need to failover to us-west-2


RECOVERY PLAN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Use Copied Manual Snapshot ✅ ONLY OPTION

Why Can't Use Automated Backup?
  ├─ Automated backups are region-locked
  ├─ Transaction logs stored in us-east-1
  ├─ Cannot access PITR from different region
  └─ Automated backups don't support cross-region ❌

Why Manual Snapshot Works?
  ├─ Snapshot copied to us-west-2
  ├─ Complete point-in-time state at 04:00 AM
  ├─ Self-contained (doesn't need transaction logs)
  └─ Can restore independently ✅

Process:
  1. Go to us-west-2 console
  2. Navigate to Snapshots
  3. Find: prod-db-2024-01-23
     ├─ Size: 105 GB (full database state)
     ├─ Created: Jan 23, 04:00 AM (us-east-1 time)
     └─ Copied: Jan 23, 04:45 AM (us-west-2 time)

  4. Restore to new RDS instance
     ├─ Instance name: prod-db-dr
     ├─ Instance type: db.r6g.2xlarge (same as primary)
     ├─ Multi-AZ: Yes
     └─ Restore time: 25-35 minutes

  5. Update application:
     ├─ Change connection string to us-west-2 endpoint
     ├─ Update Route53 to point to new region
     ├─ Verify application functionality
     └─ Monitor performance

  6. Data loss calculation:
     ├─ Last snapshot: 04:00 AM
     ├─ Outage time: 02:00 PM
     └─ Data loss: 10 hours of transactions ⚠️

Time to restore: 30-45 minutes
Data loss: 10 hours (04:00 AM to 02:00 PM)
RTO (Recovery Time Objective): 45 minutes
RPO (Recovery Point Objective): 10 hours


IMPROVED STRATEGY (Reduce Data Loss):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

For critical applications, use:

1. Read Replica in us-west-2 (Better approach)
   ├─ Continuous replication (lag: 1-5 seconds)
   ├─ Promote to master on primary region failure
   ├─ RPO: 1-5 seconds (vs 10 hours)
   ├─ RTO: 2-5 minutes (vs 45 minutes)
   └─ Cost: Additional instance cost, but worth it

2. More frequent snapshot copies
   ├─ Copy every 4 hours instead of daily
   ├─ 04:00 AM, 08:00 AM, 12:00 PM, 04:00 PM, etc.
   ├─ Incremental copies are fast (2-3 GB each)
   ├─ Reduces RPO to 4 hours
   └─ Additional cost: $0.02/GB × 3 GB × 6 copies/day = $0.36/day


COST COMPARISON:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Daily snapshot copy (current):
├─ Storage in us-west-2: 105 GB × $0.095 = $9.98/month
├─ Transfer: 3 GB × $0.02 × 30 days = $1.80/month
└─ Total: $11.78/month
   RPO: 24 hours (last daily snapshot)

4-hour snapshot copies (improved):
├─ Storage in us-west-2: 125 GB × $0.095 = $11.88/month
│  (5 extra incrementals of 4 GB each)
├─ Transfer: 3 GB × $0.02 × 6 × 30 = $10.80/month
└─ Total: $22.68/month
   RPO: 4 hours (recent snapshot)
   Additional cost: $10.90/month

Read Replica (best):
├─ Instance cost: ~$300-500/month (same as primary)
├─ Cross-region transfer: ~$50-100/month (replication data)
└─ Total: ~$350-600/month additional
   RPO: 1-5 seconds
   RTO: 2-5 minutes
   Best for production systems


LESSON LEARNED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

For disaster recovery:
✅ Cross-region snapshots are minimum requirement
✅ Incremental copies make frequent backups affordable
✅ Read replicas provide best RPO/RTO
✅ Test your DR plan quarterly
✅ Automated backups cannot help in region failure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 7: Decision Flow Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: This is super helpful! Can you give me a flowchart to help me decide which to use?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Absolutely! Here's your decision tree:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              BACKUP vs SNAPSHOT DECISION TREE                     │
└──────────────────────────────────────────────────────────────────┘

                    [Need to Recover Data?]
                            │
                            ├─ YES
                            │
            ┌───────────────┴────────────────┐
            │                                 │
       [What happened?]              [Need cross-region?]
            │                                 │
    ┌───────┴────────┐                      YES
    │                │                        │
  Accidental      Schema/App               [Use Manual
   DELETE/        Change                    Snapshot Copy]
   UPDATE          Failed                      │
    │                │                          ├─► Only option for
    │                │                          │   cross-region DR
    ├─ Need recent  ├─ Need clean              │
       recovery        state before             └─► Restore in
    │                 changes                        new region
    │                │
    ↓                ↓

[Point-in-Time    [Manual Snapshot
 Restore]          Restore]
    │                │
    ├─ Uses:        ├─ Uses:
    │  • Full       │  • Full snapshot
    │    backup     │    only
    │  • Incremental│  • No transaction
    │    backups    │    logs
    │  • Transaction│
    │    logs       │
    │                │
    ├─ Can restore  ├─ Restore to
       to ANY          specific
       second          checkpoint
    │                │
    ├─ Data loss:   ├─ Data loss:
       Minimal         Everything after
       (seconds)       snapshot time
    │                │
    └─ Best for:    └─ Best for:
       • Accidents     • Pre-deployment
       • User errors   • Before migrations
       • Recent        • Known good states
         corruption    • Cross-region DR


            ┌─────────────────────────────┐
            │   WHEN TO CREATE MANUAL     │
            │       SNAPSHOTS             │
            ├─────────────────────────────┤
            │                             │
            │  ✓ Before deployments       │
            │  ✓ Before schema changes    │
            │  ✓ Before major updates     │
            │  ✓ End of month/quarter     │
            │  ✓ Before bulk data ops     │
            │  ✓ Production milestones    │
            │  ✓ Compliance requirements  │
            │  ✓ Cross-region DR          │
            │  ✓ Long-term retention      │
            │                             │
            └─────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 8: Cost and Performance Implications
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: What about costs? Are there differences?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! Let's break down the costs with correct calculations:&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Comparison
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                    COST BREAKDOWN (CORRECTED)                     │
└──────────────────────────────────────────────────────────────────┘

AUTOMATED BACKUPS (INCREMENTAL):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Storage Cost:
├─ FREE up to 100% of your database storage
│  Example: 100 GB database = 100 GB free backup storage
│
├─ HOW STORAGE ACCUMULATES (Incremental):
│  
│  100 GB Database, 3% daily change rate, 7-day retention:
│  ├─ Day 1: Full backup = 100 GB
│  ├─ Day 2: Incremental = 100 + 3 = 103 GB total
│  ├─ Day 3: Incremental = 103 + 3 = 106 GB total
│  ├─ Day 4: Incremental = 106 + 3 = 109 GB total
│  ├─ Day 5: Incremental = 109 + 3 = 112 GB total
│  ├─ Day 6: Incremental = 112 + 3 = 115 GB total
│  └─ Day 7: Incremental = 115 + 3 = 118 GB total
│
│  Total storage: 118 GB (NOT 700 GB!) ✅
│  Free tier: 100 GB
│  Chargeable: 18 GB
│
├─ Cost calculation:
│  18 GB × $0.095/GB/month = $1.71/month
│
└─ For different retention periods:
   ├─ 7 days: ~118 GB → $1.71/month
   ├─ 14 days: ~139 GB → $3.71/month
   └─ 35 days: ~202 GB → $9.69/month


COMPARISON: FULL vs INCREMENTAL BACKUPS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

If RDS used FULL backups (it doesn't):
├─ Day 1: 100 GB
├─ Day 2: 100 GB (full copy again)
├─ Day 3: 100 GB (full copy again)
├─ Day 4: 100 GB (full copy again)
├─ Day 5: 100 GB (full copy again)
├─ Day 6: 100 GB (full copy again)
├─ Day 7: 100 GB (full copy again)
├─ Total: 700 GB
├─ Free tier: 100 GB
├─ Chargeable: 600 GB
└─ Cost: 600 GB × $0.095 = $57/month ❌

Actual RDS incremental backups:
├─ Total: 118 GB
├─ Free tier: 100 GB
├─ Chargeable: 18 GB
└─ Cost: 18 GB × $0.095 = $1.71/month ✅

SAVINGS: $55.29/month (97% reduction)! 🎉


MANUAL SNAPSHOTS (EBS-BASED INCREMENTAL):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Storage Cost:
├─ $0.095/GB/month for ALL snapshot storage
│  No free tier
│
├─ HOW SNAPSHOTS ACCUMULATE (EBS Incremental):
│  
│  Example: 5 weekly snapshots of 100 GB database
│  
│  Week 1: First snapshot (full) = 100 GB
│  Week 2: Snapshot (incremental) = 100 + 3.5 GB (week's changes) = 103.5 GB
│  Week 3: Snapshot (incremental) = 103.5 + 3.5 GB = 107 GB
│  Week 4: Snapshot (incremental) = 107 + 3.5 GB = 110.5 GB
│  Week 5: Snapshot (incremental) = 110.5 + 3.5 GB = 114 GB
│  
│  Total storage: 114 GB (NOT 500 GB!)
│  Cost: 114 GB × $0.095 = $10.83/month ✅
│
└─ Cost continues even after database deletion

If snapshots were NOT incremental:
├─ 5 snapshots × 100 GB = 500 GB
└─ Cost: 500 GB × $0.095 = $47.50/month ❌

With EBS incremental snapshots:
├─ Total: 114 GB
└─ Cost: 114 GB × $0.095 = $10.83/month ✅

SAVINGS: $36.67/month (77% reduction)! 🎉


REALISTIC PRODUCTION EXAMPLE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

100 GB Production Database:

Automated Backups (7-day retention):
├─ Storage: 118 GB (incremental)
├─ Free tier: 100 GB
├─ Chargeable: 18 GB
└─ Cost: $1.71/month

Manual Snapshots (weekly for 3 months):
├─ 12 weekly snapshots
├─ Storage with incremental: ~142 GB
│  (100 GB base + 12 weeks × 3.5 GB avg)
├─ Cost: 142 GB × $0.095 = $13.49/month

Manual Snapshots (monthly for 1 year):
├─ 12 monthly snapshots
├─ Storage with incremental: ~175 GB
│  (100 GB base + 12 months × 6.25 GB avg)
├─ Cost: 175 GB × $0.095 = $16.63/month

Total Backup Cost:
├─ Automated: $1.71/month
├─ Weekly snapshots: $13.49/month
├─ Monthly snapshots: $16.63/month
└─ TOTAL: $31.83/month

vs Non-incremental (theoretical):
├─ Automated (7 full): $57/month
├─ Weekly (12 full): $114/month
├─ Monthly (12 full): $114/month
└─ TOTAL: $285/month ❌

INCREMENTAL BACKUPS SAVE: $253.17/month (89%) 🎉🎉🎉


FACTORS AFFECTING YOUR ACTUAL COSTS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Database Size:
   └─ Larger databases = more storage, but percentages same

2. Change Rate (Most Important):
   ├─ Read-heavy (1-2% daily): Lower costs
   ├─ Balanced (3-5% daily): Moderate costs
   └─ Write-heavy (10-15% daily): Higher costs

   Example 100 GB DB, 7-day retention:
   ├─ 2% daily: 100 + (6×2) = 112 GB → $1.14/month
   ├─ 5% daily: 100 + (6×5) = 130 GB → $2.85/month
   └─ 10% daily: 100 + (6×10) = 160 GB → $5.70/month

3. Retention Period:
   └─ Longer retention = more incremental backups accumulate

4. Number of Manual Snapshots:
   └─ Each incremental snapshot adds ~3-5% of database size


COST OPTIMIZATION TIPS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Leverage automated backup free tier (100% of DB size)
2. Use lifecycle policies to delete old manual snapshots
3. For dev/test environments, reduce retention to 1-3 days
4. Cross-region copies: Only copy critical snapshots
5. Monitor your change rate to predict costs
6. Delete snapshots of deleted databases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Performance Impact
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                  PERFORMANCE IMPACT                               │
└──────────────────────────────────────────────────────────────────┘

AUTOMATED BACKUPS (INCREMENTAL):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

During Backup Window:

Single-AZ deployment:
├─ I/O suspension: Few seconds
├─ First full backup: 20-30 minutes (one-time)
│  └─ Reads entire database
├─ Subsequent incremental backups: 2-5 minutes
│  └─ Only reads changed blocks (90% faster)
├─ Elevated latency: Minimal due to incremental
└─ Schedule during low-traffic period

Multi-AZ deployment: ✅ RECOMMENDED
├─ Backup from standby: Zero impact on primary
├─ Incremental backups: Even faster
├─ No performance degradation
└─ Safe during business hours

Transaction Log Capture:
├─ Frequency: Every 5 minutes, 24/7
├─ Overhead: &amp;lt;1% CPU, minimal I/O
├─ No noticeable impact on production
└─ Required for point-in-time recovery

Performance Benefit of Incremental:
├─ Full backup: 100 GB read, ~30 min, high I/O
└─ Incremental: 3 GB read, ~3 min, low I/O
    IMPROVEMENT: 90% faster, 97% less I/O! ✨


MANUAL SNAPSHOTS (EBS INCREMENTAL):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

During Snapshot Creation:

Single-AZ deployment:
├─ First snapshot:
│  ├─ I/O freeze: 1-2 minutes (longer)
│  ├─ Snapshot duration: 20-30 minutes
│  └─ User queries may queue ⚠️
│
├─ Subsequent snapshots (incremental):
│  ├─ I/O freeze: Few seconds (much shorter)
│  ├─ Snapshot duration: 3-5 minutes
│  └─ Minimal impact ✅
│
└─ Take during maintenance window

Multi-AZ deployment: ✅ RECOMMENDED
├─ Snapshot from standby
├─ Incremental snapshots: Very fast
├─ Minimal impact on primary
└─ Safe during business hours

Snapshot Speed Comparison:
├─ First full snapshot: 20-30 min
└─ Incremental snapshots: 3-5 min
    IMPROVEMENT: 80% faster! ✨


RESTORE PERFORMANCE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Automated Backup (PITR):
├─ Process:
│  ├─ Restore base backup (full or recent incremental)
│  ├─ Apply subsequent incremental backups
│  ├─ Apply transaction logs to target time
│  └─ All optimized by AWS
│
├─ Performance:
│  ├─ 100 GB database: 15-30 minutes
│  ├─ Incremental application: Parallelized
│  └─ Almost as fast as full restore ✨
│
└─ Benefit of incremental:
   Less data to transfer from S3 = faster restore

Manual Snapshot:
├─ Process:
│  ├─ Restore EBS volumes from snapshot
│  ├─ Hydrate blocks on-demand (lazy loading)
│  └─ Full performance gradually available
│
├─ Performance:
│  ├─ 100 GB database: 10-20 minutes
│  ├─ Incremental snapshots: Same speed
│  └─ Instance available faster, but I/O slower initially
│
└─ Benefit of incremental:
   Less snapshot storage = lower S3 costs


DATABASE SIZE IMPACT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

10 GB Database:
├─ Full backup: 5-8 min
├─ Incremental: 1-2 min (80% faster)
└─ Restore: 5-10 min

100 GB Database:
├─ Full backup: 25-35 min
├─ Incremental: 3-5 min (90% faster)
└─ Restore: 15-30 min

1 TB Database:
├─ Full backup: 3-4 hours
├─ Incremental: 20-30 min (95% faster)
└─ Restore: 1-2 hours

The larger the database, the MORE beneficial incremental becomes! ✨
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 9: Best Practices &amp;amp; Recommendations
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Okay, I'm getting it now. What should our backup strategy look like?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Here's a comprehensive strategy that leverages incremental backups:&lt;/p&gt;

&lt;h3&gt;
  
  
  Production Database Backup Strategy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│         RECOMMENDED BACKUP STRATEGY (INCREMENTAL-AWARE)           │
└──────────────────────────────────────────────────────────────────┘

FOR PRODUCTION DATABASES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. AUTOMATED BACKUPS: ✅ Always Enabled (Incremental)

   Configuration:
   ├─ Retention: 7-14 days (affordable due to incremental)
   │  └─ 7 days: ~118 GB → $1.71/month
   │  └─ 14 days: ~139 GB → $3.71/month
   │  └─ 35 days: ~202 GB → $9.69/month (still affordable!)
   │
   ├─ Backup window: Low-traffic period (e.g., 2-4 AM)
   │  └─ Incremental backups complete in 2-5 minutes
   │
   ├─ Multi-AZ: Strongly recommended
   │  └─ Backup from standby = zero production impact
   │
   └─ Encryption: Always enable

   Purpose:
   └─ Day-to-day protection against accidents
      (leverages incremental efficiency)


2. MANUAL SNAPSHOTS: ✅ Event-Based (EBS Incremental)

   Create snapshots BEFORE:
   ├─ Every deployment
   │  ├─ Name: "pre-deploy-v2.3.1-2024-01-15"
   │  ├─ First: 100 GB (10 min)
   │  └─ Subsequent: +3 GB incremental (3 min)
   │
   ├─ Schema migrations
   │  └─ Name: "before-schema-migration-2024-01-15"
   │
   ├─ Bulk data operations
   │  └─ Name: "before-data-cleanup-2024-01-15"
   │
   ├─ Weekly (Sunday midnight)
   │  ├─ Name: "weekly-2024-W03"
   │  └─ Incremental from last week's snapshot
   │
   └─ Monthly (last day of month)
      ├─ Name: "monthly-2024-01"
      └─ Incremental chain continues

   Retention (with incremental storage):
   ├─ Pre-deployment: Keep for 7 days
   │  └─ ~7 snapshots × 3 GB avg = 121 GB total
   │
   ├─ Weekly: Keep for 3 months
   │  └─ 12 snapshots × 3.5 GB avg = 142 GB total
   │
   ├─ Monthly: Keep for 1 year
   │  └─ 12 snapshots × 6 GB avg = 172 GB total
   │
   └─ Total manual snapshot storage: ~250 GB
      Cost: 250 GB × $0.095 = $23.75/month
      (vs 1,260 GB if full backups = $119.70/month) ✅


3. CROSS-REGION COPIES: ✅ For Disaster Recovery

   Strategy (Incremental-optimized):
   ├─ Copy daily snapshot to DR region
   │  ├─ First copy: 100 GB transfer = $2.00
   │  └─ Daily copies: 3-5 GB transfer = $0.06-0.10 each
   │
   ├─ Monthly transfer cost: ~$3-4 (due to incremental)
   │  vs $60 if full backups every time ❌
   │
   └─ Test restore quarterly

   Example configuration:
   ├─ Primary: us-east-1
   ├─ DR: us-west-2
   ├─ Daily automated backup: 3 AM (incremental)
   ├─ Daily manual snapshot: 4 AM (for copying)
   ├─ Copy to us-west-2: 4:30 AM (only changed blocks)
   └─ Retention in DR: 7 days rolling


4. TESTING: ✅ Verify Your Backups

   Monthly:
   ├─ Restore automated backup to test instance
   │  └─ Uses incremental chain (fast restore)
   ├─ Verify data integrity
   ├─ Test application connectivity
   └─ Document restore time

   Quarterly:
   ├─ Full DR drill using cross-region snapshot
   ├─ Restore from incremental snapshot chain
   └─ Measure RTO (Recovery Time Objective)


TOTAL COST BREAKDOWN:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

100 GB Production Database:

Automated Backups (14-day retention):
└─ $3.71/month (incremental)

Manual Snapshots:
├─ Weekly (12 weeks): $10.83/month
└─ Monthly (12 months): $12.92/month
└─ Subtotal: $23.75/month

Cross-Region Copies (us-west-2):
├─ Storage: 107 GB × $0.095 = $10.17/month
└─ Transfer: ~$3/month (incremental)
└─ Subtotal: $13.17/month

TOTAL BACKUP COST: $40.63/month ✅

If NOT incremental (theoretical):
├─ Automated: $57/month
├─ Manual: $119.70/month
├─ Cross-region: $60/month
└─ TOTAL: $236.70/month ❌

INCREMENTAL SAVINGS: $196.07/month (83%) 🎉
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Automation Script Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Lambda function for incremental-aware snapshot management
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="n"&gt;rds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;db_instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production-db&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# 1. Create weekly snapshot (incremental from previous)
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;weekday&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# Sunday
&lt;/span&gt;        &lt;span class="n"&gt;snapshot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weekly-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;strftime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;%Y-W%W&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creating weekly snapshot: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Note: This will be incremental from last week&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s snapshot&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_db_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Tags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Retention&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;90days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Incremental&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Wait for snapshot to complete
&lt;/span&gt;        &lt;span class="n"&gt;waiter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_waiter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;db_snapshot_completed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;waiter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Copy to DR region (only incremental changes transferred)
&lt;/span&gt;        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Copying snapshot to DR region (incremental transfer)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy_db_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;SourceDBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;arn:aws:rds:us-east-1:123456789012:snapshot:&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;TargetDBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;SourceRegion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;TargetRegion&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;CopyTags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Monitor snapshot size (for cost tracking)
&lt;/span&gt;        &lt;span class="n"&gt;snapshot_info&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;
        &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;allocated_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot_info&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AllocatedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Custom/RDS/Backups&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotSize&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;allocated_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotIncremental&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# This is an incremental snapshot
&lt;/span&gt;                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotType&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 2. Cleanup old snapshots based on retention
&lt;/span&gt;    &lt;span class="n"&gt;snapshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SnapshotType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;manual&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate total storage used (incremental accumulation)
&lt;/span&gt;    &lt;span class="n"&gt;total_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="n"&gt;snapshot_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
                          &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;created&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tzinfo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;
        &lt;span class="n"&gt;size_gb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AllocatedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;snapshot_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;size_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;created&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;created&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="n"&gt;total_storage&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;size_gb&lt;/span&gt;

        &lt;span class="c1"&gt;# Delete based on retention policy
&lt;/span&gt;        &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TagList&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])}&lt;/span&gt;

        &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PreDeploy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deleting pre-deploy snapshot (&amp;gt;7 days): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deleting weekly snapshot (&amp;gt;90 days): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Type&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Monthly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deleting monthly snapshot (&amp;gt;365 days): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_delete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_db_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# 3. Report storage efficiency
&lt;/span&gt;    &lt;span class="n"&gt;db_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_instances&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;
    &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstances&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AllocatedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_chain&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;avg_incremental_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_storage&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== Backup Storage Report (Incremental) ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of snapshots: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total storage: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Average incremental size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_incremental_size&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Storage efficiency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Theoretical full backups: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Savings from incremental: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Publish metrics
&lt;/span&gt;    &lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Custom/RDS/Backups&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TotalBackupStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AverageIncrementalSize&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;avg_incremental_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;StorageEfficiency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Percent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Snapshot management completed. Total storage: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_storage&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 10: Common Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: What mistakes should I watch out for?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! Here are the common pitfalls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                   COMMON MISTAKES                                 │
└──────────────────────────────────────────────────────────────────┘

❌ MISTAKE #1: Disabling Automated Backups
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Setting retention period to 0 disables automated backups
├─ No transaction logs captured
├─ No point-in-time recovery possible
├─ Automated backups deleted immediately
└─ Lose incremental backup efficiency

Real scenario:
Developer: "Let's disable backups to save costs on dev DB"
[3 days later...]
Developer: "We need to recover yesterday's data!"
Result: Impossible. No backups exist. 😱

Reality check:
├─ 7-day retention cost: ~$1.71/month (with incremental)
└─ Data loss from no backup: Priceless ⚠️

Fix:
└─ Keep minimum 1 day retention even for dev/test


❌ MISTAKE #2: Relying Only on Manual Snapshots
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Snapshots taken once a day
├─ Accidental deletion at 3 PM
└─ Can only restore to midnight (15 hours of data loss)

Real scenario:
Company relies on daily midnight snapshots only
Ransomware attack at 2 PM
Result: Lose entire day's transactions

Fix:
└─ Always enable automated backups + selective manual snapshots
   (Incremental backups make this affordable!)


❌ MISTAKE #3: Not Understanding Incremental Nature
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Developer thinks: "7 days = 7 × 100 GB = 700 GB cost"
├─ Actually: 7 days = ~118 GB (incremental)
├─ Disables backups unnecessarily
└─ Or panics about costs that don't exist

Real scenario:
"We can't afford 35-day retention, it'll cost $300/month!"
Reality: 35-day retention = ~$10/month with incremental ✅

Fix:
└─ Understand incremental nature = affordable long retention


❌ MISTAKE #4: Not Testing Restores
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Backups created but never tested
├─ Disaster strikes
└─ Discover backup is corrupted or incomplete

Real scenario:
"We have backups for 2 years"
[During disaster recovery...]
"The incremental chain is broken!"
Result: Extended downtime, data loss

Fix:
└─ Monthly restore tests
   (Incremental restores are fast, so test often!)


❌ MISTAKE #5: Forgetting About Cross-Region
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ All backups in same region as production
├─ Region-wide outage occurs
└─ Cannot access ANY backups

Real scenario:
All resources in us-east-1
AWS us-east-1 major outage
Result: No access to database or backups for hours

With incremental:
├─ Cross-region copy: Only 3-5 GB/day transfer
├─ Cost: ~$3-4/month (affordable!)
└─ No excuse not to have DR

Fix:
└─ Copy snapshots to secondary region
   (Incremental makes daily copies affordable)


❌ MISTAKE #6: Deleting Snapshots in Chain
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens (EBS snapshots):
├─ Snap1 (full): 100 GB
├─ Snap2 (incremental): +3 GB references Snap1
├─ Snap3 (incremental): +3 GB references Snap2
├─ Delete Snap2
└─ AWS automatically merges Snap2 into Snap3 ✅

Note: AWS handles this gracefully, but:
├─ Deletion takes longer
├─ May incur temporary extra storage
└─ Better to delete oldest or newest, not middle

Fix:
└─ Use lifecycle policies, delete in order


❌ MISTAKE #7: Not Monitoring Change Rate
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Assume 3% daily change rate
├─ Actually experiencing 15% change rate
├─ Storage costs 5x higher than expected
└─ Bill shock at end of month

Real scenario:
Database change rate increases due to new feature
Backup storage grows from 118 GB to 200 GB
Cost increases from $1.71 to $9.50/month

Fix:
└─ Monitor backup storage growth
   Track incremental size trends
   Set CloudWatch alarms


❌ MISTAKE #8: Using Snapshot When PITR Needed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What happens:
├─ Accidental DELETE at 2:35 PM
├─ Have snapshot from 2:00 PM
├─ Restore snapshot
└─ Lose 35 minutes of legitimate transactions ❌

Should have used:
└─ PITR to 2:34 PM (1 minute before mistake)
   Only loses 1 minute, not 35 minutes

Fix:
└─ Use PITR for recent accidents
   Use snapshots for major changes


COMPLETE FIX CHECKLIST:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

□ Automated backups enabled (7-14 days)
□ Understand incremental nature for cost planning
□ Manual snapshots before major changes
□ Monthly restore tests
□ Cross-region copies for DR
□ Monitor backup storage growth
□ Monitor daily change rate
□ Use PITR for accidents, snapshots for milestones
□ Don't delete middle snapshots in chain
□ Document backup/restore procedures
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 11: Quick Reference Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Can you give me a cheat sheet I can refer to quickly?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Absolutely! Here's your quick reference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                    QUICK REFERENCE GUIDE                          │
└──────────────────────────────────────────────────────────────────┘

WHEN TO USE WHAT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Situation                           → Use This
─────────────────────────────────────────────────────────────────
Accidental DELETE/UPDATE            → Automated Backup (PITR)
User error (last few hours)         → Automated Backup (PITR)
Data corruption (recent)            → Automated Backup (PITR)
Before deployment                   → Manual Snapshot
Before schema migration             → Manual Snapshot
Before bulk data operation          → Manual Snapshot
Monthly archival                    → Manual Snapshot
Cross-region disaster recovery      → Manual Snapshot (copied)
Need to restore to exact second     → Automated Backup (PITR)
Need clean state before change      → Manual Snapshot
Database deleted accidentally       → Manual Snapshot (if exists)
Compliance/audit requirements       → Manual Snapshot


KEY DIFFERENCES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Feature              Automated Backups    Manual Snapshots
──────────────────────────────────────────────────────────────────
Incremental          ✅ Yes (after 1st)   ✅ Yes (EBS-based)
Transaction Logs     ✅ Yes               ❌ No
PITR                 ✅ Yes               ❌ No
Cross-Region         ❌ No                ✅ Yes
Retention            0-35 days            Forever
Auto-deleted         ✅ With DB           ❌ Manual delete
Free tier            ✅ 100% of DB size   ❌ No free tier
Cost (7 days)        ~$1.71/mo            $9.50/mo (weekly)


STORAGE CALCULATIONS (100 GB DATABASE):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Assuming 3% daily change rate:

Retention    Storage Needed    Chargeable    Cost/Month
─────────────────────────────────────────────────────────
1 day        100 GB            0 GB          $0.00
7 days       118 GB            18 GB         $1.71
14 days      139 GB            39 GB         $3.71
35 days      202 GB            102 GB        $9.69

Manual Snapshots (incremental):
5 weekly     114 GB            114 GB        $10.83
12 monthly   175 GB            175 GB        $16.63


RESTORE TIME ESTIMATE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Database Size    PITR Restore    Snapshot Restore
────────────────────────────────────────────────
10 GB           5-10 min        5-8 min
50 GB           10-20 min       8-15 min
100 GB          15-30 min       10-20 min
500 GB          30-60 min       20-40 min
1 TB+           1-2 hours       40-90 min

Note: Incremental nature doesn't significantly impact restore time
      AWS optimizes restore process automatically


BACKUP WINDOW DURATION (100 GB DB):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

First backup (full):     25-35 minutes
Daily backup (incr):     2-5 minutes ✨ 90% faster!
Snapshot (first):        20-30 minutes
Snapshot (incremental):  3-5 minutes ✨ 85% faster!


CHANGE RATE IMPACT (100 GB DB, 7 days):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Daily Change    Storage     Cost/Month
─────────────────────────────────────
1% (1 GB)      106 GB      $0.57
3% (3 GB)      118 GB      $1.71
5% (5 GB)      130 GB      $2.85
10% (10 GB)    160 GB      $5.70
15% (15 GB)    190 GB      $8.55


RETENTION RECOMMENDATIONS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Environment     Automated        Manual Snapshots
─────────────────────────────────────────────────────────
Production      7-14 days        Weekly (3mo) + Monthly (1yr)
Staging         3-7 days         Pre-deployment only
Development     1-3 days         Optional
Test            1 day            Not needed


COST COMPARISON: INCREMENTAL vs FULL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

100 GB Database, 7-day retention:

With Incremental (RDS actual):
├─ Storage: 118 GB
├─ Free tier: 100 GB
└─ Cost: $1.71/month ✅

Without Incremental (theoretical):
├─ Storage: 700 GB (7 × 100 GB)
├─ Free tier: 100 GB
└─ Cost: $57/month ❌

SAVINGS: 97% reduction! 🎉


CHECKLIST FOR PRODUCTION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

□ Automated backups enabled (7-14 days)
□ Multi-AZ enabled (zero-impact incremental backups)
□ Backup window during low traffic
□ Encryption enabled
□ Manual snapshot before each deployment
□ Weekly automated snapshots
□ Monthly long-term snapshots
□ Cross-region copies configured (incremental transfer)
□ Snapshot lifecycle policy in place
□ Monthly restore tests scheduled
□ Monitor backup storage growth
□ Monitor daily change rate
□ Backup/restore runbook documented
□ Team trained on PITR vs snapshot usage
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  AWS CLI Commands
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create manual snapshot (will be incremental if previous exists)&lt;/span&gt;
aws rds create-db-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; mydb &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; mydb-snapshot-2024-01-15

&lt;span class="c"&gt;# Restore from PITR (uses incremental backups + transaction logs)&lt;/span&gt;
aws rds restore-db-instance-to-point-in-time &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source-db-instance-identifier&lt;/span&gt; mydb &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target-db-instance-identifier&lt;/span&gt; mydb-restored &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restore-time&lt;/span&gt; 2024-01-15T14:30:00Z

&lt;span class="c"&gt;# Restore from snapshot (uses incremental snapshot chain)&lt;/span&gt;
aws rds restore-db-instance-from-db-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; mydb-restored &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; mydb-snapshot-2024-01-15

&lt;span class="c"&gt;# List available restore times&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; mydb &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[0].[LatestRestorableTime,EarliestRestorableTime]'&lt;/span&gt;

&lt;span class="c"&gt;# Check backup storage (shows incremental accumulation)&lt;/span&gt;
aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; mydb &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[0].BackupRetentionPeriodStorageUsed'&lt;/span&gt;

&lt;span class="c"&gt;# Copy snapshot to another region (only changed blocks transferred)&lt;/span&gt;
aws rds copy-db-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source-db-snapshot-identifier&lt;/span&gt; arn:aws:rds:us-east-1:123456789012:snapshot:mydb-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target-db-snapshot-identifier&lt;/span&gt; mydb-snapshot-dr &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--source-region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-west-2

&lt;span class="c"&gt;# Delete old snapshots (AWS handles incremental chain automatically)&lt;/span&gt;
aws rds delete-db-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; mydb-snapshot-old

&lt;span class="c"&gt;# Describe snapshot to see actual size (incremental)&lt;/span&gt;
aws rds describe-db-snapshots &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; mydb-snapshot-2024-01-15 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBSnapshots[0].[AllocatedStorage,SnapshotType,PercentProgress]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 12: Monitoring and Alerts
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: How do I monitor my backups and get alerted if something goes wrong?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Excellent question! Monitoring is critical, especially with incremental backups. Here's your setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              MONITORING &amp;amp; ALERTS SETUP                            │
└──────────────────────────────────────────────────────────────────┘

CLOUDWATCH METRICS TO MONITOR:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. BackupRetentionPeriodStorageUsed
   ├─ Shows total automated backup storage (incremental)
   ├─ Expected: DB_size + (retention_days × daily_change)
   ├─ Alert if growing faster than expected
   └─ Alert if approaching limits

2. DailyBackupSize (Custom metric)
   ├─ Track daily incremental backup size
   ├─ Should be 2-5% of database size typically
   ├─ Alert if suddenly much larger (indicates issues)
   └─ Helps predict future costs

3. SnapshotStorageUsed (Custom metric)
   ├─ Total manual snapshot storage
   ├─ Should grow incrementally
   └─ Alert if unexpected spikes

4. ChangeRate (Custom metric)
   ├─ Percentage of database changed daily
   ├─ Affects incremental backup size
   └─ Alert if rate doubles unexpectedly


CLOUDWATCH ALARMS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudFormation for backup monitoring (incremental-aware)&lt;/span&gt;

&lt;span class="na"&gt;BackupStorageAlarm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::CloudWatch::Alarm&lt;/span&gt;
  &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;AlarmName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RDS-Backup-Storage-High&lt;/span&gt;
    &lt;span class="na"&gt;AlarmDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Alert when backup storage growing unexpectedly&lt;/span&gt;
    &lt;span class="na"&gt;MetricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BackupRetentionPeriodStorageUsed&lt;/span&gt;
    &lt;span class="na"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS/RDS&lt;/span&gt;
    &lt;span class="na"&gt;Dimensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;
        &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;DBInstance&lt;/span&gt;
    &lt;span class="na"&gt;Statistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Average&lt;/span&gt;
    &lt;span class="na"&gt;Period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3600&lt;/span&gt;
    &lt;span class="na"&gt;EvaluationPeriods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;Threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;150&lt;/span&gt;  &lt;span class="c1"&gt;# 150% of database size&lt;/span&gt;
    &lt;span class="na"&gt;ComparisonOperator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GreaterThanThreshold&lt;/span&gt;
    &lt;span class="na"&gt;AlarmActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;SNSTopic&lt;/span&gt;
    &lt;span class="na"&gt;TreatMissingData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;notBreaching&lt;/span&gt;

&lt;span class="na"&gt;IncrementalSizeAlarm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::CloudWatch::Alarm&lt;/span&gt;
  &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;AlarmName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RDS-Large-Incremental-Backup&lt;/span&gt;
    &lt;span class="na"&gt;AlarmDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Alert when daily incremental unusually large&lt;/span&gt;
    &lt;span class="na"&gt;MetricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DailyIncrementalSize&lt;/span&gt;
    &lt;span class="na"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Custom/RDS&lt;/span&gt;
    &lt;span class="na"&gt;Dimensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;
        &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;DBInstance&lt;/span&gt;
    &lt;span class="na"&gt;Statistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Average&lt;/span&gt;
    &lt;span class="na"&gt;Period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;86400&lt;/span&gt;
    &lt;span class="na"&gt;EvaluationPeriods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;Threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# 10 GB (adjust based on your DB)&lt;/span&gt;
    &lt;span class="na"&gt;ComparisonOperator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GreaterThanThreshold&lt;/span&gt;
    &lt;span class="na"&gt;AlarmActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;SNSTopic&lt;/span&gt;

&lt;span class="na"&gt;ChangeRateAlarm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS::CloudWatch::Alarm&lt;/span&gt;
  &lt;span class="na"&gt;Properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;AlarmName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RDS-High-Change-Rate&lt;/span&gt;
    &lt;span class="na"&gt;AlarmDescription&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Alert when change rate unusually high&lt;/span&gt;
    &lt;span class="na"&gt;MetricName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DailyChangeRate&lt;/span&gt;
    &lt;span class="na"&gt;Namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Custom/RDS&lt;/span&gt;
    &lt;span class="na"&gt;Dimensions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;
        &lt;span class="na"&gt;Value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;DBInstance&lt;/span&gt;
    &lt;span class="na"&gt;Statistic&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Average&lt;/span&gt;
    &lt;span class="na"&gt;Period&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;86400&lt;/span&gt;
    &lt;span class="na"&gt;EvaluationPeriods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;Threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;  &lt;span class="c1"&gt;# 15% daily change&lt;/span&gt;
    &lt;span class="na"&gt;ComparisonOperator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GreaterThanThreshold&lt;/span&gt;
    &lt;span class="na"&gt;AlarmActions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;!Ref&lt;/span&gt; &lt;span class="s"&gt;SNSTopic&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Enhanced Lambda for incremental backup monitoring
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;cloudwatch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cloudwatch&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;db_instance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;production-db&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Get database info
&lt;/span&gt;    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_instances&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;db&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstances&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;db_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AllocatedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;backup_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BackupRetentionPeriodStorageUsed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;retention_days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BackupRetentionPeriod&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Check if backups are disabled
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retention_days&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;send_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;🚨 Automated backups are DISABLED!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Backups disabled!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate expected storage (incremental model)
&lt;/span&gt;    &lt;span class="c1"&gt;# Assume 3% daily change rate
&lt;/span&gt;    &lt;span class="n"&gt;expected_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retention_days&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;db_size&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.03&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate actual vs expected
&lt;/span&gt;    &lt;span class="n"&gt;storage_ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;backup_storage&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;expected_storage&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;expected_storage&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="c1"&gt;# Calculate average daily incremental size
&lt;/span&gt;    &lt;span class="n"&gt;avg_incremental&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;backup_storage&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;retention_days&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;daily_change_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_incremental&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

    &lt;span class="c1"&gt;# Publish metrics
&lt;/span&gt;    &lt;span class="n"&gt;metrics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BackupStorageUsed&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;backup_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;BackupStorageRatio&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;storage_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DailyIncrementalSize&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;avg_incremental&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DailyChangeRate&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;daily_change_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Percent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ExpectedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;expected_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;metric&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Custom/RDS/Backups&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;metrics&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Alert conditions
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;storage_ratio&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mf"&gt;1.5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;send_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ Backup storage &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;storage_ratio&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;x higher than expected!&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Current: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;backup_storage&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Expected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;expected_storage&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Possible high change rate or retention issues.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;db_instance&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;daily_change_rate&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;send_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ High daily change rate detected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;daily_change_rate&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Average incremental: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;avg_incremental&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB/day&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;This will increase backup costs.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;db_instance&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Check manual snapshots
&lt;/span&gt;    &lt;span class="n"&gt;snapshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;SnapshotType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;manual&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;MaxRecords&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# Calculate total snapshot storage (incremental)
&lt;/span&gt;        &lt;span class="n"&gt;total_snapshot_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;snap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AllocatedStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;snap&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Analyze snapshot chain
&lt;/span&gt;        &lt;span class="n"&gt;snapshot_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Calculate incremental efficiency
&lt;/span&gt;        &lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshot_list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;theoretical_full&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;
            &lt;span class="n"&gt;actual_storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;total_snapshot_storage&lt;/span&gt;
            &lt;span class="n"&gt;efficiency&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;actual_storage&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;theoretical_full&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;

            &lt;span class="n"&gt;cloudwatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put_metric_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;Namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Custom/RDS/Snapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;MetricData&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TotalSnapshotStorage&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_snapshot_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Gigabytes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                        &lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotEfficiency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;efficiency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Percent&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                        &lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;MetricName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;NumberOfSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;num_snapshots&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Unit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Dimensions&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBInstanceIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                        &lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=== Snapshot Efficiency Report ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Number of snapshots: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;num_snapshots&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database size: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Total storage used: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;total_snapshot_storage&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Theoretical (full backups): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;theoretical_full&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Storage efficiency: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;efficiency&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Savings from incremental: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;theoretical_full&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;actual_storage&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; GB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Check for old snapshots
&lt;/span&gt;        &lt;span class="n"&gt;latest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot_list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;age_hours&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;tzinfo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; 
                    &lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="nf"&gt;total_seconds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;age_hours&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;168&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="c1"&gt;# 7 days
&lt;/span&gt;            &lt;span class="nf"&gt;send_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;⚠️ No recent manual snapshot!&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Latest snapshot: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;latest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Age: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;age_hours&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;db_instance&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate report
&lt;/span&gt;    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;database_size_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;db_size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;backup_storage_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;backup_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;retention_days&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retention_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;expected_storage_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;expected_storage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;storage_ratio&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storage_ratio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;avg_incremental_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_incremental&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;daily_change_rate_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;daily_change_rate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;total_snapshot_storage_gb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;total_snapshot_storage&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;num_snapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;efficiency_pct&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;efficiency&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;num_snapshots&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;timestamp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utcnow&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;isoformat&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;send_alert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sns&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;TopicArn&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arn:aws:sns:us-east-1:123456789012:rds-alerts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RDS Backup Alert: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;db_instance&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 13: Understanding Snapshot and Backup Deletion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: I have a bunch of old snapshots piling up. Can I just delete them? Will it break anything? And what happens to automated backups?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Great question! Deletion is often misunderstood and can be risky if done incorrectly. Let me explain how it all works:&lt;/p&gt;

&lt;h3&gt;
  
  
  Automated Backup Deletion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              AUTOMATED BACKUP DELETION                            │
└──────────────────────────────────────────────────────────────────┘

HOW AUTOMATED BACKUPS ARE DELETED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Automated backups are AUTOMATICALLY deleted in these scenarios:

1. ROLLING RETENTION WINDOW (Normal Operation):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Configuration: 7-day retention

Timeline:
Day 1 (Jan 1):  Full backup created ✅
Day 2 (Jan 2):  Incremental backup ✅
Day 3 (Jan 3):  Incremental backup ✅
Day 4 (Jan 4):  Incremental backup ✅
Day 5 (Jan 5):  Incremental backup ✅
Day 6 (Jan 6):  Incremental backup ✅
Day 7 (Jan 7):  Incremental backup ✅
                └─ 7 backups exist, storage: ~118 GB

Day 8 (Jan 8):  New incremental backup ✅
                └─ Jan 1 backup AUTOMATICALLY DELETED ♻️
                └─ Storage: Still ~118 GB (rolling window)

This happens AUTOMATICALLY every day:
├─ New backup created
├─ Oldest backup deleted
├─ Maintains constant storage usage
└─ No manual intervention needed ✅


2. CHANGING RETENTION PERIOD:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario A: REDUCE retention (14 days → 7 days)

Before change:
├─ 14 days of backups exist
└─ Storage: ~139 GB

After change:
├─ Backups older than 7 days IMMEDIATELY DELETED 🗑️
├─ Only last 7 days retained
├─ Storage: ~118 GB
└─ CANNOT be undone! ⚠️

Example:
10:00 AM → Change retention from 14 to 7 days
10:01 AM → Backups from Jan 1-7 deleted immediately
10:02 AM → Only Jan 8-14 backups remain
Result: Lost 7 days of restore points! ⚠️


Scenario B: INCREASE retention (7 days → 14 days)

Before change:
├─ 7 days of backups exist
└─ Storage: ~118 GB

After change:
├─ Existing backups NOT deleted
├─ New backups start accumulating to 14 days
├─ After 14 days: Storage grows to ~139 GB
└─ No data lost ✅


3. DISABLING AUTOMATED BACKUPS (DANGEROUS!):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Action: Set retention period to 0

What happens IMMEDIATELY:
├─ All automated backups DELETED 🗑️
├─ All transaction logs DELETED 🗑️
├─ Point-in-time recovery DISABLED ❌
└─ CANNOT be undone! ⚠️

Timeline:
10:00 AM → Retention period = 7 days
           └─ Can restore to any point in last 7 days ✅

10:05 AM → Change retention to 0 (disable backups)
           └─ Confirmation required in console

10:06 AM → ALL BACKUPS DELETED IMMEDIATELY 🗑️
           └─ Cannot restore to ANY point ❌

10:10 AM → Oh no! Need yesterday's data!
           └─ IMPOSSIBLE - backups gone forever 💀

Real-world disaster:
Developer: "Let me disable backups to save $2/month on dev DB"
[3 days later...]
Developer: "Need to restore yesterday's code testing data"
Result: Data lost forever, 8 hours to rebuild 😱


4. DELETING THE RDS INSTANCE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

When you delete an RDS instance:

Option A: WITHOUT final snapshot
┌─────────────────────────────────────────────┐
│  Delete DB Instance                          │
├─────────────────────────────────────────────┤
│  ⚠️  This will PERMANENTLY delete:          │
│     • The database instance                 │
│     • ALL automated backups 🗑️             │
│     • ALL transaction logs 🗑️              │
│                                              │
│  □ Create final snapshot before deletion    │
│    [Leave unchecked = NO SNAPSHOT]          │
│                                              │
│  Type "delete me" to confirm:               │
│  [____________]                             │
│                                              │
│  [Cancel]  [Delete]                         │
└─────────────────────────────────────────────┘

Result:
├─ Instance: DELETED ✅
├─ Automated backups: DELETED 🗑️
├─ Manual snapshots: PRESERVED ✅
└─ Recovery: ONLY from manual snapshots


Option B: WITH final snapshot (RECOMMENDED)
┌─────────────────────────────────────────────┐
│  Delete DB Instance                          │
├─────────────────────────────────────────────┤
│  ☑️  Create final snapshot                  │
│                                              │
│  Snapshot name:                             │
│  [prod-db-final-2024-01-15] ✅             │
│                                              │
│  This snapshot will be retained even        │
│  after instance deletion                    │
│                                              │
│  [Cancel]  [Delete with Snapshot]           │
└─────────────────────────────────────────────┘

Result:
├─ Instance: DELETED ✅
├─ Automated backups: DELETED 🗑️
├─ Final snapshot: CREATED ✅
├─ Manual snapshots: PRESERVED ✅
└─ Recovery: From final snapshot or old manual snapshots


WHAT GETS DELETED vs PRESERVED:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Action                      Automated Backups    Manual Snapshots
──────────────────────────────────────────────────────────────────
Rolling retention           Oldest deleted       Preserved
Reduce retention period     Old ones deleted     Preserved
Disable backups             ALL deleted 🗑️      Preserved ✅
Delete instance (no snap)   ALL deleted 🗑️      Preserved ✅
Delete instance (w/ snap)   ALL deleted 🗑️      Preserved ✅

Key takeaway:
└─ Manual snapshots are NEVER automatically deleted ✅
   You must delete them manually
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Manual Snapshot Deletion
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              MANUAL SNAPSHOT DELETION                             │
└──────────────────────────────────────────────────────────────────┘

MANUAL SNAPSHOTS NEVER AUTO-DELETE:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Manual snapshots persist forever until YOU delete them:

Scenario:
├─ Create snapshot: Jan 1, 2023
├─ Delete RDS instance: Jan 15, 2023
├─ One year later: Jan 1, 2024
└─ Snapshot STILL EXISTS (and costs $10/month!) 💰

You MUST manually delete:
├─ Via AWS Console
├─ Via AWS CLI
├─ Via API/SDK
└─ Or set up automated cleanup (Lambda)


HOW TO DELETE A MANUAL SNAPSHOT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

AWS Console Method:
1. Go to RDS → Snapshots
2. Select snapshot(s) to delete
3. Actions → Delete snapshot
4. Confirm deletion

⚠️  WARNING: No confirmation dialog!
    Once you click delete, it's gone immediately!

CLI Method:
aws rds delete-db-snapshot \
  --db-snapshot-identifier my-snapshot-name

Response:
{
    "DBSnapshot": {
        "DBSnapshotIdentifier": "my-snapshot-name",
        "Status": "deleting",
        ...
    }
}

Status progression:
├─ "available" → "deleting" → deleted
└─ Takes a few seconds to minutes


SNAPSHOT DELETION ISSUES &amp;amp; DEPENDENCIES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

ISSUE #1: Shared Snapshots
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

You shared a snapshot with another AWS account:

Account A (Owner):
├─ Created snapshot: prod-db-snapshot-2024-01-15
├─ Shared with Account B
└─ Tries to delete...

Error:
❌ Cannot delete snapshot while shared with other accounts

Solution:
1. First, un-share from all accounts
   ├─ Go to snapshot → Modify snapshot
   ├─ Remove all shared accounts
   └─ Save changes
2. Then delete the snapshot ✅

AWS CLI:
# Un-share
aws rds modify-db-snapshot-attribute \
  --db-snapshot-identifier prod-db-snapshot-2024-01-15 \
  --attribute-name restore \
  --values-to-remove 123456789012

# Then delete
aws rds delete-db-snapshot \
  --db-snapshot-identifier prod-db-snapshot-2024-01-15


ISSUE #2: Encrypted Snapshots with KMS
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Encrypted snapshot uses customer-managed KMS key:

Scenario:
├─ Snapshot encrypted with KMS key: my-rds-key
├─ Delete KMS key (schedule deletion)
└─ Can still delete snapshot? YES ✅

However:
├─ If KMS key deleted AND snapshot deleted
└─ CANNOT restore from snapshot copies ⚠️

Best practice:
1. Delete all snapshots using that KMS key
2. Ensure no copies exist in other regions
3. Then schedule KMS key deletion (7-30 day wait)


ISSUE #3: Snapshots in Incremental Chain
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Remember: EBS snapshots are incremental!

Snapshot Chain:
├─ Snapshot 1 (Jan 1): 100 GB (full)
├─ Snapshot 2 (Jan 8): +3 GB (incremental, references Snap 1)
├─ Snapshot 3 (Jan 15): +3 GB (incremental, references Snap 2)
└─ Total storage: 106 GB

What happens if you delete Snapshot 2?

AWS Automatically:
1. Merges Snapshot 2 data into Snapshot 3
2. Snapshot 3 now references Snapshot 1 directly
3. No data lost ✅
4. Deletion takes longer (merge operation)

Before deletion:
Snap 1 (100 GB) ← Snap 2 (3 GB) ← Snap 3 (3 GB)

After deleting Snap 2:
Snap 1 (100 GB) ← Snap 3 (6 GB, merged)

Result:
├─ Total storage: Still 106 GB
├─ Can still restore Snap 1: ✅
├─ Can still restore Snap 3: ✅
└─ Snap 3 restore includes Snap 2 data ✅

Time to delete:
├─ Simple snapshot: Seconds
├─ Snapshot in chain: Minutes (depends on size)


ISSUE #4: Cross-Region Snapshot Copies
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original snapshot copied to multiple regions:

Setup:
├─ Original: us-east-1 → snapshot-prod-2024-01-15
├─ Copy 1: us-west-2 → snapshot-prod-2024-01-15
└─ Copy 2: eu-west-1 → snapshot-prod-2024-01-15

Deleting original (us-east-1):
├─ Does NOT delete copies ✅
├─ Copies remain independent
└─ Must delete each copy separately

Cost impact:
├─ Original deleted: Save $10/month in us-east-1
├─ 2 copies remain: Still pay $20/month total
└─ Don't forget to delete copies! 💰


SAFE DELETION PROCESS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Before deleting any snapshot:

✅ CHECKLIST:
□ Is this snapshot shared? → Un-share first
□ Are there copies in other regions? → Delete those too
□ Is this the only backup of important data? → Create new one first
□ Is instance still running? → Verify recent backups exist
□ Is this production? → Get approval/document
□ Have you verified snapshot name? → Double-check!

Safe deletion script:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nv"&gt;SNAPSHOT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"my-snapshot-to-delete"&lt;/span&gt;

&lt;span class="c"&gt;# 1. Check snapshot details&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Checking snapshot: &lt;/span&gt;&lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
aws rds describe-db-snapshots &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; &lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBSnapshots[0].[DBSnapshotIdentifier,SnapshotCreateTime,AllocatedStorage,Status]'&lt;/span&gt;

&lt;span class="c"&gt;# 2. Check if shared&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Checking if snapshot is shared..."&lt;/span&gt;
&lt;span class="nv"&gt;SHARED&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws rds describe-db-snapshot-attributes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; &lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBSnapshotAttributesResult.DBSnapshotAttributes[?AttributeName==`restore`].AttributeValues'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; &lt;span class="nt"&gt;-z&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SHARED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"⚠️  Snapshot is shared with: &lt;/span&gt;&lt;span class="nv"&gt;$SHARED&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Un-share before deletion"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;1
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 3. List cross-region copies&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Checking for cross-region copies..."&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;region &lt;span class="k"&gt;in &lt;/span&gt;us-east-1 us-west-2 eu-west-1&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Checking region: &lt;/span&gt;&lt;span class="nv"&gt;$region&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  aws rds describe-db-snapshots &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; &lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--region&lt;/span&gt; &lt;span class="nv"&gt;$region&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBSnapshots[0].DBSnapshotIdentifier'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--output&lt;/span&gt; text 2&amp;gt;/dev/null
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# 4. Confirm deletion&lt;/span&gt;
&lt;span class="nb"&gt;read&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Are you sure you want to delete &lt;/span&gt;&lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt;&lt;span class="s2"&gt;? (yes/no): "&lt;/span&gt; confirm
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$confirm&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s2"&gt;"yes"&lt;/span&gt; &lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
  &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Deletion cancelled"&lt;/span&gt;
  &lt;span class="nb"&gt;exit &lt;/span&gt;0
&lt;span class="k"&gt;fi&lt;/span&gt;

&lt;span class="c"&gt;# 5. Delete snapshot&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Deleting snapshot..."&lt;/span&gt;
aws rds delete-db-snapshot &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; &lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Snapshot deletion initiated"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AUTOMATED CLEANUP STRATEGY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Lambda function for lifecycle management:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timedelta&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lambda_handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rds&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Get all manual snapshots
&lt;/span&gt;    &lt;span class="n"&gt;snapshots&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_snapshots&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;SnapshotType&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;manual&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;deleted_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;snapshots&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;snapshot_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;created_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;SnapshotCreateTime&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tzinfo&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;created_time&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;days&lt;/span&gt;

        &lt;span class="c1"&gt;# Parse snapshot tags to determine retention
&lt;/span&gt;        &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tag&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;snapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;TagList&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])}&lt;/span&gt;

        &lt;span class="n"&gt;retention_policy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RetentionPolicy&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Apply retention policies
&lt;/span&gt;        &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;retention_policy&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;weekly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Weekly snapshot older than 90 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;retention_policy&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;monthly&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monthly snapshot older than 365 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;retention_policy&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;temp&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Temporary snapshot older than 7 days&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;retention_policy&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;age_days&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;should_delete&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Default retention exceeded (30 days)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;should_delete&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Deleting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="c1"&gt;# Check if shared
&lt;/span&gt;                &lt;span class="n"&gt;attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;describe_db_snapshot_attributes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;restore_attrs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;attrs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotAttributesResult&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;DBSnapshotAttributes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="n"&gt;is_shared&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AttributeName&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;restore&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;AttributeValues&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; 
                               &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;restore_attrs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_shared&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Skipping &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: snapshot is shared&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;continue&lt;/span&gt;

                &lt;span class="c1"&gt;# Delete snapshot
&lt;/span&gt;                &lt;span class="n"&gt;rds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delete_db_snapshot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;DBSnapshotIdentifier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;deleted_count&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error deleting &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;statusCode&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;body&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Deleted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;deleted_count&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; snapshots&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Schedule: Run daily via EventBridge
Cost savings: Can save hundreds per month on old snapshots! 💰


DELETION COST &amp;amp; TIMING:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Deletion cost: FREE ✅
├─ No charge to delete snapshots
└─ Stop paying storage cost immediately

Timing:
├─ Simple snapshot: 10-30 seconds
├─ Snapshot in incremental chain: 2-10 minutes
├─ Large snapshot (&amp;gt;500 GB): 5-15 minutes
└─ You're billed until deletion completes

Storage billing:
├─ Snapshot created: Jan 1 at 10:00 AM
├─ Snapshot deleted: Jan 15 at 3:00 PM
├─ Billed for: 14.2 days of storage
└─ Prorated to the hour ✅

Example:
├─ 100 GB snapshot
├─ Existed for 14.2 days
├─ Cost: 100 GB × $0.095/GB/month × (14.2/30) = $4.50
└─ After deletion: $0/month ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Part 14: Exporting Snapshots to Amazon S3
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: I see an option to "Export to S3" in the console. What's that for? Why would I export a snapshot to S3?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Excellent question! This is a powerful but often overlooked feature. Let me explain:&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Snapshot Export to S3?
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              SNAPSHOT EXPORT TO S3 EXPLAINED                      │
└──────────────────────────────────────────────────────────────────┘

WHAT IT DOES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Export to S3 converts RDS snapshot → Apache Parquet files in S3

Normal RDS Snapshot:
├─ Format: RDS proprietary binary format
├─ Can only be restored to: RDS instance
├─ Cannot directly query data
└─ Cannot use with analytics tools

Exported to S3:
├─ Format: Apache Parquet (columnar, compressed)
├─ Can query with: Athena, Redshift Spectrum, EMR, Glue
├─ Can analyze with: QuickSight, Tableau, Python/Pandas
├─ Can archive long-term: S3 Glacier
└─ Can share: Just share S3 objects (not whole snapshot)


VISUAL REPRESENTATION:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Without Export:
┌─────────────┐
│ RDS Snapshot│  ─────────────┐
│  (Binary)   │               │
└─────────────┘               │
                              ├──&amp;gt; Only option: Restore to RDS
                              │    (takes 15-30 minutes)
                              │    (costs $200+/month for instance)
                              │
                              └──&amp;gt; To analyze data:
                                   1. Restore to RDS ❌
                                   2. Connect and query ❌
                                   3. Export results ❌
                                   Time: Hours, Cost: High


With Export to S3:
┌─────────────┐
│ RDS Snapshot│
│  (Binary)   │
└──────┬──────┘
       │
       ├──&amp;gt; Export to S3
       │
       ▼
┌──────────────────────────────────────┐
│  Amazon S3 Bucket                     │
│                                       │
│  s3://my-bucket/exports/              │
│  ├── database_name/                   │
│  │   ├── table1/                     │
│  │   │   ├── part-00001.parquet      │
│  │   │   └── part-00002.parquet      │
│  │   └── table2/                     │
│  │       └── part-00001.parquet      │
│  └── metadata.json                    │
└───────────────┬───────────────────────┘
                │
                ├──&amp;gt; Athena: Query with SQL ✅
                ├──&amp;gt; Redshift Spectrum: Analyze ✅
                ├──&amp;gt; Glue/EMR: Process ✅
                ├──&amp;gt; QuickSight: Visualize ✅
                └──&amp;gt; Archive to Glacier: $1/TB/month ✅


OUTPUT FORMAT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Example S3 structure after export:

s3://my-exports-bucket/prod-db-export-2024-01-15/
│
├── mydb/                          (database name)
│   │
│   ├── users/                     (table name)
│   │   ├── part-00000-*.parquet   (data files)
│   │   ├── part-00001-*.parquet
│   │   └── part-00002-*.parquet
│   │
│   ├── orders/
│   │   ├── part-00000-*.parquet
│   │   └── part-00001-*.parquet
│   │
│   └── products/
│       └── part-00000-*.parquet
│
└── export_info.json               (metadata)

Each Parquet file contains:
├─ Table schema
├─ Column data (compressed)
├─ Optimized for analytics
└─ Can be read by many tools


COMPRESSION EFFICIENCY:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original database: 100 GB
Export to S3 (Parquet): ~30-40 GB ✨

Why smaller?
├─ Parquet columnar compression
├─ No indexes (table data only)
├─ No transaction logs
└─ Optimized storage format

Cost comparison:
├─ Snapshot storage: 100 GB × $0.095 = $9.50/month
├─ S3 Standard: 35 GB × $0.023 = $0.81/month ✅
└─ S3 Glacier Deep: 35 GB × $0.00099 = $0.03/month ✅
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When and Why to Export to S3
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              USE CASES FOR EXPORT TO S3                           │
└──────────────────────────────────────────────────────────────────┘

✅ USE CASE #1: Analytics &amp;amp; Business Intelligence
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario:
├─ Production RDS database
├─ Business team wants historical data analysis
├─ Don't want to impact production with heavy queries
└─ Need data from multiple points in time

Traditional approach (BAD):
1. Create read replica ($200/month)
2. Run analytics queries (impacts replica performance)
3. Can only query current data
4. Costs ongoing $200+/month

Export to S3 approach (GOOD):
1. Export weekly snapshot to S3 (30 minutes)
2. Query with Athena (pay per query)
3. Access historical data (multiple exports)
4. Cost: ~$1-5/month storage + minimal query costs ✅

Example Athena query:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Query exported data directly in S3&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;exported_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-exports-bucket/prod-db-export-2024-01-15/mydb/orders/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Now query historical data&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2023-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Cost: $5 per TB scanned (typical query: &amp;lt;$0.10)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ USE CASE #2: Long-Term Archival &amp;amp; Compliance
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario:
├─ Must retain data for 7 years (compliance)
├─ RDS snapshots expensive for long-term storage
├─ Rarely need to access old data
└─ Need cost-effective solution

Snapshot retention (EXPENSIVE):
├─ 7 years of snapshots: 84 monthly snapshots
├─ Each snapshot: 100 GB
├─ With incremental: ~500 GB total
├─ Cost: 500 GB × $0.095 = $47.50/month
└─ 7 years: $47.50 × 84 = $3,990 total ❌

Export + Glacier Deep Archive (CHEAP):
├─ Export 84 monthly snapshots: 84 × 35 GB = 2,940 GB
├─ Store in Glacier Deep Archive: 2,940 GB × $0.00099
└─ Cost: $2.91/month × 84 = $244 total ✅

SAVINGS: $3,746 over 7 years (94% reduction)! 🎉

Lifecycle policy:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;S3 Bucket Lifecycle Rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Archive-Old-Exports"&lt;/span&gt;
  &lt;span class="na"&gt;Filter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rds-exports/"&lt;/span&gt;
  &lt;span class="na"&gt;Transitions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
      &lt;span class="na"&gt;StorageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;GLACIER&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;Days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;180&lt;/span&gt;
      &lt;span class="na"&gt;StorageClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DEEP_ARCHIVE&lt;/span&gt;
  &lt;span class="na"&gt;Status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Enabled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost progression per export:
├─ Day 0-30: S3 Standard ($0.81/month)
├─ Day 31-180: Glacier ($0.14/month)
└─ Day 180+: Glacier Deep Archive ($0.03/month) ✅


✅ USE CASE #3: Data Migration &amp;amp; ETL
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario:
├─ Migrating from RDS MySQL to Redshift
├─ Need to transform data during migration
├─ Want to validate data before cutting over
└─ Can't afford production downtime

Process:
1. Export RDS snapshot to S3 (30 min)
2. Transform with AWS Glue or EMR
3. Load into Redshift
4. Validate and compare
5. Cutover when ready

Benefits:
├─ No impact on production RDS
├─ Can retry/iterate transformations
├─ Parallel processing of multiple tables
└─ Validate before go-live

Glue ETL example:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.transforms&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;getResolvedOptions&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkContext&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;awsglue.context&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GlueContext&lt;/span&gt;

&lt;span class="c1"&gt;# Read from exported Parquet
&lt;/span&gt;&lt;span class="n"&gt;rds_export&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;glueContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;create_dynamic_frame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;format_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{},&lt;/span&gt;
    &lt;span class="n"&gt;connection_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;parquet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;connection_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;paths&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-exports-bucket/prod-db-export/mydb/orders/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Transform data
&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rds_export&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_mapping&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;int&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bigint&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;string&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;order_date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decimal&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_amount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;decimal(18,2)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Write to Redshift
&lt;/span&gt;&lt;span class="n"&gt;glueContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write_dynamic_frame&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_options&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;frame&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transformed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;connection_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redshift&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;connection_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jdbc:redshift://...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dbtable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orders&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redshiftTmpDir&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://temp-bucket/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ USE CASE #4: Cross-Account/Cross-Organization Sharing
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario:
├─ Need to share data with partner organization
├─ Don't want to share entire database
├─ Only need specific tables
└─ Security/compliance concerns

Snapshot sharing (COMPLICATED):
├─ Share entire snapshot (all tables) ❌
├─ Includes sensitive data ❌
├─ Partner must restore to RDS ($$$) ❌
└─ Hard to control access

Export to S3 (BETTER):
├─ Export only needed tables ✅
├─ Grant S3 bucket access to partner account ✅
├─ Partner queries with Athena (no RDS needed) ✅
├─ Fine-grained access control ✅
└─ Revoke access anytime ✅

S3 bucket policy:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AllowPartnerAccess"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"AWS"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::PARTNER-ACCOUNT:root"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"s3:GetObject"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"s3:ListBucket"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::my-exports-bucket/shared-data/*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:s3:::my-exports-bucket"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Partner queries your data:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Partner's Athena query (no RDS needed!)&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;partner_orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://your-bucket/shared-data/orders/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;partner_orders&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ USE CASE #5: Disaster Recovery Testing
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Scenario:
├─ Need to verify backup integrity
├─ Don't want to spin up full RDS instance
├─ Want quick validation of data
└─ Regular DR drills

Traditional DR test:
1. Restore snapshot to RDS (20 min)
2. Connect and run validation queries (1 hour)
3. Delete test instance
4. Cost: 2 hours × $0.29 = $0.58 per test
5. Time: 2-3 hours

Export to S3 validation:
1. Export snapshot to S3 (30 min, one-time)
2. Query with Athena (instant)
3. Run validation queries (10 min)
4. Cost: Query costs ~$0.01
5. Time: 10 minutes ✅

Validation queries:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Check record counts&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'users'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_users&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'orders'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_orders&lt;/span&gt;
&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="s1"&gt;'products'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_products&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Check data integrity&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;unique_users&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;earliest_order&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;latest_order&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Verify foreign key relationships&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;orphaned_orders&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;exported_orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;exported_users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ DON'T USE EXPORT FOR:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Real-time Data Access
   ├─ Export takes 30+ minutes
   ├─ Data is point-in-time
   └─ Use: Read replica or DMS instead

2. Restoring to RDS
   ├─ Exported data cannot be restored to RDS
   ├─ It's analytics format, not RDS format
   └─ Use: Regular snapshot restore

3. Transactional Queries
   ├─ S3/Athena not optimized for OLTP
   ├─ Higher latency than RDS
   └─ Use: RDS or Aurora for transactions

4. Frequently Updated Data
   ├─ Must re-export for each update
   ├─ Not cost-effective
   └─ Use: Aurora with Athena federation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How to Export Snapshot to S3
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              EXPORT PROCESS STEP-BY-STEP                          │
└──────────────────────────────────────────────────────────────────┘

PREREQUISITES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. S3 Bucket (must exist)
2. IAM Role with permissions
3. KMS Key (if snapshot is encrypted)
4. Snapshot to export

Start Export (AWS Console)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. Go to RDS → Snapshots
2. Select snapshot to export
3. Actions → Export to Amazon S3
4. Fill in details:
   ┌─────────────────────────────────────────────┐
   │  Export DB snapshot to Amazon S3             │
   ├─────────────────────────────────────────────┤
   │  Export identifier:                          │
   │  [prod-db-export-2024-01-15]                │
   │                                              │
   │  Data to be exported:                        │
   │  ○ All (entire snapshot)                    │
   │  ● Partial (select tables/schemas)          │
   │                                              │
   │  Select identifiers: [✓] users              │
   │                      [✓] orders             │
   │                      [ ] logs               │
   │                                              │
   │  S3 bucket: [my-rds-exports-bucket]         │
   │  S3 prefix: [exports/2024-01-15/]           │
   │                                              │
   │  IAM role: [RDSExportRole]                  │
   │                                              │
   │  KMS key: [aws/rds] (if encrypted)          │
   │                                              │
   │  [Cancel]  [Export to S3]                   │
   └─────────────────────────────────────────────┘

5. Click "Export to S3"


EXPORT TIMING &amp;amp; COSTS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Export time estimates:
├─ 10 GB snapshot: 10-15 minutes
├─ 100 GB snapshot: 30-60 minutes
├─ 500 GB snapshot: 2-4 hours
└─ 1 TB snapshot: 4-8 hours

Export costs:
├─ Export process: FREE ✅
├─ S3 storage: $0.023/GB/month (Standard)
├─ S3 PUT requests: $0.005 per 1,000 (negligible)
└─ Data transfer: FREE (same region) ✅

Example cost (100 GB database):
├─ Export: $0 (free)
├─ S3 storage: 35 GB × $0.023 = $0.81/month
├─ Athena queries: ~$0.01-0.10 per query
└─ Total: ~$1-2/month ✅

vs keeping snapshot:
└─ Snapshot: 100 GB × $0.095 = $9.50/month ❌
   Export saves 91%! 🎉
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using Exported Data
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│              QUERYING EXPORTED DATA                               │
└──────────────────────────────────────────────────────────────────┘

Amazon Athena (Serverless SQL)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Step 1: Create Glue Database
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 2: Create External Tables
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Automatically detect schema from Parquet&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;username&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;email&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;last_login&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-rds-exports-bucket/exports/2024-01-15/mydb/users/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;EXTERNAL&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;order_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_amount&lt;/span&gt; &lt;span class="nb"&gt;DECIMAL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="n"&gt;STRING&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;STORED&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;PARQUET&lt;/span&gt;
&lt;span class="k"&gt;LOCATION&lt;/span&gt; &lt;span class="s1"&gt;'s3://my-rds-exports-bucket/exports/2024-01-15/mydb/orders/'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 3: Query the Data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Simple queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Analytics queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'month'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="k"&gt;month&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;order_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;revenue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_order_value&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2023-01-01'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Join queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; 
    &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_orders&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_amount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;lifetime_value&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;rds_exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;orders&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;username&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;lifetime_value&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost:
└─ $5 per TB scanned
   Typical query on 100 GB: $0.50


BEST PRACTICES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ DO:
├─ Export only needed tables (saves time and cost)
├─ Use S3 lifecycle policies for cost optimization
├─ Partition exports by date (year/month/day structure)
├─ Tag exports with metadata
├─ Monitor export progress
├─ Test restore/query procedures
└─ Document S3 bucket structure

❌ DON'T:
├─ Export unnecessarily large/sensitive tables
├─ Keep exports forever without lifecycle policy
├─ Export real-time changing data repeatedly
├─ Use for operational queries (use read replica)
└─ Forget to clean up old exports
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Final Summary
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Developer&lt;/strong&gt;: Wow, this has been incredibly helpful! Can you give me one final summary?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Expert&lt;/strong&gt;: Of course! Here's everything in a nutshell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────────────┐
│                      FINAL SUMMARY                                │
└──────────────────────────────────────────────────────────────────┘

THE GOLDEN RULES:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1. AUTOMATED BACKUPS = Your Insurance Policy (INCREMENTAL)
   ├─ Always enabled in production
   ├─ For recovering from recent accidents
   ├─ Point-in-time recovery is your superpower
   └─ Incremental nature makes long retention affordable!

2. MANUAL SNAPSHOTS = Your Bookmarks (INCREMENTAL)
   ├─ Before major changes
   ├─ For long-term retention
   ├─ For disaster recovery across regions
   └─ EBS incremental tech = efficient storage!

3. TRANSACTION LOGS = Your Time Machine
   ├─ Enable precise recovery
   ├─ Captured automatically every 5 minutes
   └─ Only available with automated backups

4. INCREMENTAL BACKUPS = Your Cost Saver ✨
   ├─ 83-97% storage reduction vs full backups
   ├─ 80-90% faster backup windows
   ├─ Makes long retention affordable
   └─ Both automated backups AND snapshots use it!

5. TEST YOUR RESTORES
   └─ Backups are worthless if you can't restore


KEY INSIGHTS ABOUT INCREMENTAL:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Automated backups are incremental after first full backup
✅ Manual snapshots use EBS incremental technology
✅ Only changed blocks are stored, not entire database
✅ 7-day retention ≈ 118 GB, NOT 700 GB
✅ Storage cost: ~$1.71/month, NOT $57/month
✅ Backup windows: 2-5 minutes, NOT 30 minutes
✅ Cross-region transfers: 3-5 GB, NOT 100 GB
✅ Makes frequent backups economically viable


WHEN TO USE WHAT:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Recent Accidents (last few hours):
└─ Use: Automated Backup (PITR)
   Why: Precise second-by-second recovery
   How: Uses incremental backups + transaction logs

Major Changes (deployments, migrations):
└─ Use: Manual Snapshot
   Why: Clean rollback point
   How: Full snapshot first time, then incremental

Disaster Recovery (region failure):
└─ Use: Cross-region Manual Snapshot
   Why: Only option that works across regions
   How: Incremental copies keep cost down


COST REALITY (100 GB Database):
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

WITHOUT incremental (theoretical):
├─ 7-day auto backups: $57/month
├─ 12 weekly snapshots: $114/month
├─ 12 monthly snapshots: $114/month
└─ TOTAL: $285/month ❌

WITH incremental (actual):
├─ 7-day auto backups: $1.71/month ✅
├─ 12 weekly snapshots: $10.83/month ✅
├─ 12 monthly snapshots: $16.63/month ✅
└─ TOTAL: $29.17/month ✅

SAVINGS: $255.83/month (90% reduction)! 🎉


YOUR ACTION ITEMS:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

□ Enable automated backups (7-14 days) - it's affordable!
□ Understand storage = DB_size + (retention × daily_change)
□ Set up snapshot automation for deployments
□ Configure cross-region copies for DR (incremental transfer)
□ Monitor daily change rate for cost prediction
□ Create restore runbook (document PITR vs snapshot usage)
□ Set up monitoring and alerts
□ Schedule monthly restore tests
□ Implement snapshot lifecycle policy
□ Train team on incremental nature and when to use what
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Appendix: Additional Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  AWS Documentation Links
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_CommonTasks.BackupRestore.html" rel="noopener noreferrer"&gt;RDS Backup and Restore&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithAutomatedBackups.html" rel="noopener noreferrer"&gt;Working with Backups&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_CreateSnapshot.html" rel="noopener noreferrer"&gt;Creating DB Snapshots&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSSnapshots.html" rel="noopener noreferrer"&gt;How EBS Snapshots Work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Best Practice Guides
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;AWS Well-Architected Framework - Reliability Pillar&lt;/li&gt;
&lt;li&gt;AWS Disaster Recovery Whitepaper&lt;/li&gt;
&lt;li&gt;RDS Best Practices Guide&lt;/li&gt;
&lt;li&gt;EBS Snapshot Best Practices&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>database</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Stop Guessing if Your AI Works: A Complete Guide to Evaluating and Monitoring on Bedrock</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Fri, 16 Jan 2026 00:59:24 +0000</pubDate>
      <link>https://forem.com/datatechbridge/stop-guessing-if-your-ai-works-a-complete-guide-to-evaluating-and-monitoring-on-bedrock-20b0</link>
      <guid>https://forem.com/datatechbridge/stop-guessing-if-your-ai-works-a-complete-guide-to-evaluating-and-monitoring-on-bedrock-20b0</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; – Three things matter: (1) Know your model works before deploying. (2) Stop it from saying dumb stuff. &lt;br&gt;
(3) Watch what happens in production. Spend 1 hour on this now, save yourself weeks of headaches later.&lt;/p&gt;

&lt;h2&gt;
  
  
  I. Model Evaluation
&lt;/h2&gt;

&lt;p&gt;Let me be honest – before you put any AI model into production, you need to know it actually works. Amazon Bedrock makes this easier with built-in evaluation tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which models can you evaluate?&lt;/strong&gt;&lt;br&gt;
Basically any model you use on Bedrock – whether it's the foundation models, your own customized versions, models from the marketplace, or even your fine-tuned versions. You can also evaluate specialized stuff like prompt routers or models with Provisioned Throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you evaluate?&lt;/strong&gt;&lt;br&gt;
You've got a few options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Automatic evaluation&lt;/strong&gt; – Bedrock tests your model against pre-built test sets. Quick and hands-off.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Human review&lt;/strong&gt; – You or your team manually checks responses for quality. Takes longer but catches nuances automation misses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM-as-judge&lt;/strong&gt; – Use another AI model to grade your model's responses. Surprisingly effective for subjective quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAG evaluation&lt;/strong&gt; – If you're using retrieval-augmented generation, this checks both the retrieval part and the generation part separately.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What scores do you get back?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bedrock gives you three main categories:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Accuracy&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Robustness&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Toxicity&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Does it know the right facts? (RWK score)&lt;/td&gt;
&lt;td&gt;Does it stay consistent when things change? (Word error rate, F1 score)&lt;/td&gt;
&lt;td&gt;Does it say bad stuff? (Toxicity score)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is the response semantically similar to the right answer? (BERTScore)&lt;/td&gt;
&lt;td&gt;Can you trust it to work reliably? (Delta metrics)&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;How precise is it overall? (NLP-F1)&lt;/td&gt;
&lt;td&gt;Does it handle edge cases?&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;What tasks can you evaluate?&lt;/strong&gt;&lt;br&gt;
Pick what matches your use case – general responses, summaries, Q&amp;amp;A, or text classification.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After evaluation completes:&lt;/strong&gt;&lt;br&gt;
You get a report showing how your model scored. You can compare different versions and see what needs improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  II. Guardrails: Stop Your Model from Acting Stupid
&lt;/h2&gt;

&lt;p&gt;Think of guardrails as a filter that stops your model from saying things it shouldn't. It catches both bad input (nasty prompts) and bad output (model saying something harmful).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What can guardrails block?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Harmful content&lt;/strong&gt; – Hate speech, insults, sexual stuff, violence. You set how strict you want to be (strict = catch more, but might block okay stuff too).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Jailbreak attempts&lt;/strong&gt; – People trying tricks like "Do Anything Now" to make your model ignore its rules. Guardrails catch these.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sneaky attacks&lt;/strong&gt; – Like "Ignore what I said before and..." or people trying to trick your model into revealing its instructions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Topics you don't want to discuss&lt;/strong&gt; – Say you don't want your model giving investment advice or medical diagnoses. You can block those topics entirely.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bad words&lt;/strong&gt; – Profanity or custom words your company doesn't want used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Private information&lt;/strong&gt; – Guardrails can mask or block things like email addresses, phone numbers, SSNs, credit cards.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fake information&lt;/strong&gt; – Check if the model is actually grounded in real facts or just making stuff up. Also verify the answer is relevant to what was asked.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Hallucinations&lt;/strong&gt; – Catch when your model sounds confident but is completely wrong.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How you use it:&lt;/strong&gt;&lt;br&gt;
Set your policies once, then every request gets checked automatically. No extra work needed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Want practical examples?&lt;/strong&gt; Check out the AWS blog on &lt;a href="https://aws.amazon.com/blogs/machine-learning/build-responsible-ai-applications-with-amazon-bedrock-guardrails/" rel="noopener noreferrer"&gt;implementing Guardrails&lt;/a&gt; for step-by-step guidance on setting up policies and configurations.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  III. Responsible AI Framework
&lt;/h2&gt;

&lt;p&gt;Responsible AI is basically asking: "Is my AI system trustworthy and doing the right thing?" It's not just about avoiding bad outcomes – it's about building systems people can actually trust.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What does "responsible" mean?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fairness&lt;/strong&gt; – Your model doesn't treat people unfairly based on their background&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explainability&lt;/strong&gt; – People can understand why your model gave a certain answer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy &amp;amp; Security&lt;/strong&gt; – Personal data is protected&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety&lt;/strong&gt; – No harmful outputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Controllability&lt;/strong&gt; – Humans are still in charge&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy&lt;/strong&gt; – The model is actually correct&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Governance&lt;/strong&gt; – Clear rules and accountability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt; – You're honest about what your model can and can't do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How do you actually do this?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Bedrock's evaluation tools&lt;/strong&gt; – Test across all these dimensions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SageMaker Clarify&lt;/strong&gt; – Check if your model is biased, explain why it made decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SageMaker Model Monitor&lt;/strong&gt; – Watch your model 24/7 and alert you if quality drops&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bring humans in&lt;/strong&gt; – Amazon Augmented AI lets humans review uncertain decisions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document everything&lt;/strong&gt; – Use Model Cards to write down what your model does, what it shouldn't do, and who it's meant for&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control access&lt;/strong&gt; – Use Role Manager to make sure only the right people can use or change your model&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Deep dive into security?&lt;/strong&gt; Read about &lt;a href="https://aws.amazon.com/blogs/machine-learning/safeguard-generative-ai-applications-with-amazon-bedrock-guardrails/" rel="noopener noreferrer"&gt;safeguarding your AI applications&lt;/a&gt; with real-world examples and best practices.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  IV. Model Monitoring
&lt;/h2&gt;

&lt;p&gt;Okay, so you've deployed your model. Now what? You need to watch it. Things break, performance drops, stuff happens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 ways to monitor on AWS:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Invocation Logs&lt;/strong&gt; – Every time someone calls your model, log it. Who called it? What did they ask? What did it respond with? Super useful for debugging and compliance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;CloudWatch Metrics&lt;/strong&gt; – Real-time numbers on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many times your model got called&lt;/li&gt;
&lt;li&gt;How long it took to respond&lt;/li&gt;
&lt;li&gt;How many errors happened&lt;/li&gt;
&lt;li&gt;How many requests hit guardrails&lt;/li&gt;
&lt;li&gt;Token usage (helps you track costs)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;CloudTrail&lt;/strong&gt; – The audit log. Shows who accessed what, when they accessed it, and what they changed. Useful for "who broke what?" investigations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;X-Ray&lt;/strong&gt; – Traces requests through your whole system. Shows you where things are slow or where failures happen.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Logging&lt;/strong&gt; – Log whatever matters to your business. Your custom metrics, your business logic, whatever.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Key numbers to watch:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Invocations&lt;/strong&gt; – How much is it being used?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; – How fast is it responding? (Slow = users getting frustrated)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client Errors&lt;/strong&gt; – Are people sending bad requests? (Could be a UX problem)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Server Errors&lt;/strong&gt; – Is the model/service broken?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Throttles&lt;/strong&gt; – Are you hitting rate limits? (Time to upgrade or optimize)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token counts&lt;/strong&gt; – How much are you spending per request?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt;&lt;br&gt;
Set up alerts with Amazon EventBridge. Like: "Text me if error rate goes above 1%" or "Restart this job if it fails." This way you find problems before customers do.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Want better dashboards?&lt;/strong&gt; Learn how to &lt;a href="https://aws.amazon.com/blogs/machine-learning/improve-visibility-into-amazon-bedrock-usage-and-performance-with-amazon-cloudwatch/" rel="noopener noreferrer"&gt;improve visibility with CloudWatch&lt;/a&gt; and set up comprehensive monitoring from the start.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  V. Tokenizer: Know Your Costs Before They Happen
&lt;/h2&gt;

&lt;p&gt;Bedrock's tokenizer lets you see exactly how many tokens your prompts use before deployment. Why? You pay per token. A prompt you think is 100 tokens might actually be 1,000 – that's 10x the cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you use it for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test prompts to see token count (no surprises on your bill)&lt;/li&gt;
&lt;li&gt;Optimize expensive prompts to save money&lt;/li&gt;
&lt;li&gt;Estimate monthly costs upfront&lt;/li&gt;
&lt;li&gt;Compare which model is cheaper for your use case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How to use it:&lt;/strong&gt;&lt;br&gt;
Go to Bedrock Console → Text Playground. Paste your prompt. See token count instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pro tip:&lt;/strong&gt;&lt;br&gt;
Use the tokenizer during evaluation. You'll know both quality AND cost before going live.&lt;/p&gt;




&lt;h2&gt;
  
  
  Other Important Monitoring &amp;amp; Evaluation Things
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Cost Tracking:&lt;/strong&gt;&lt;br&gt;
Keep an eye on how many tokens you're using per request and multiply by price. Your CFO will thank you. If costs suddenly spike, something's wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Drift:&lt;/strong&gt;&lt;br&gt;
Your model's performance will slowly get worse over time as the world changes. Compare your current metrics to baseline metrics monthly. If accuracy dropped 5%, something's off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Feedback Loop:&lt;/strong&gt;&lt;br&gt;
Ask users to rate responses or report when the model got something wrong. This real-world data is gold – it tells you what actually matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A/B Testing:&lt;/strong&gt;&lt;br&gt;
Want to test a new model? Don't deploy to everyone. Send 10% of traffic to the new one, 90% to the old one. Compare results. If new one is better, slowly shift more traffic over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Response Time Patterns:&lt;/strong&gt;&lt;br&gt;
Watch for slowdowns. If your model suddenly takes 2x longer to respond, investigate. Could be overload, could be a problem with the backend.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Questions to Guide Your Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Model Evaluation (Pick 3 to start):
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What does your model actually need to be good at? (Is accuracy most important, or robustness, or low toxicity?)&lt;/li&gt;
&lt;li&gt;How good does it need to be before you'll risk putting it in production?&lt;/li&gt;
&lt;li&gt;How often will you re-evaluate? (Before every update? Once a week? Once a month?)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Advanced:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you have test data ready, or should you start with Bedrock's built-in test sets?&lt;/li&gt;
&lt;li&gt;Should you have humans double-check the automated evaluation, or do you trust it?&lt;/li&gt;
&lt;li&gt;What metric would make you decide "nope, this model isn't ready yet"?&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Guardrails (Pick 3 to start):
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What's the one type of harmful content you're most worried about?&lt;/li&gt;
&lt;li&gt;Are there specific topics your company shouldn't discuss? (Legal advice? Stock tips? Medical stuff?)&lt;/li&gt;
&lt;li&gt;Should guardrails be paranoid (block everything possibly problematic) or relaxed (only block obvious stuff)?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Advanced:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you need to track what got blocked for compliance reasons?&lt;/li&gt;
&lt;li&gt;Should your guardrails protect against external jailbreaks, or also internal staff mistakes?&lt;/li&gt;
&lt;li&gt;Do you need to mask PII, or just block requests that contain it?&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Monitoring (Pick 3 to start):
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Critical:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How fast NEEDS your model to respond? If it's slower, is that a problem?&lt;/li&gt;
&lt;li&gt;What error rate is acceptable? 0.1%? 1%? 5%?&lt;/li&gt;
&lt;li&gt;Who should get alerted if something breaks? Your Slack channel? Your on-call engineer?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Advanced:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do you need to see metrics right now (real-time), or is daily/weekly good enough?&lt;/li&gt;
&lt;li&gt;How long do you need to keep logs for legal/compliance reasons?&lt;/li&gt;
&lt;li&gt;What would you actually DO if you got an alert? (Do you have a playbook?)&lt;/li&gt;
&lt;li&gt;Are costs spiraling out of control? Set a budget alert.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Responsible AI (Pick 1-2 to think about):
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Could your model treat some groups of people unfairly? (This is worth thinking about)&lt;/li&gt;
&lt;li&gt;Does your industry have compliance requirements? (Healthcare, finance, etc.?)&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/evaluation.html" rel="noopener noreferrer"&gt;Evaluate Performance&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html" rel="noopener noreferrer"&gt;Guardrails Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/monitoring.html" rel="noopener noreferrer"&gt;Monitoring Guide&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>aws</category>
      <category>bedrock</category>
      <category>ai</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Apache Airflow for Production: Essential Concepts Every Developer Should Know</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Thu, 08 Jan 2026 06:01:57 +0000</pubDate>
      <link>https://forem.com/datatechbridge/apache-airflow-for-production-essential-concepts-every-developer-should-know-2n1j</link>
      <guid>https://forem.com/datatechbridge/apache-airflow-for-production-essential-concepts-every-developer-should-know-2n1j</guid>
      <description>&lt;p&gt;&lt;strong&gt;From local development to enterprise-scale orchestration&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  📚 Table of Contents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PART 1: FUNDAMENTALS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 1: Introduction&lt;/li&gt;
&lt;li&gt;Section 2: Architecture&lt;/li&gt;
&lt;li&gt;Section 3: DAG &amp;amp; Building Blocks&lt;/li&gt;
&lt;li&gt;Section 4: TaskFlow API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PART 2: ADVANCED CONCEPTS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 5: Deferrable Operators&lt;/li&gt;
&lt;li&gt;Section 6: Dependencies &amp;amp; Relationships&lt;/li&gt;
&lt;li&gt;Section 7: Scheduling&lt;/li&gt;
&lt;li&gt;Section 8: Failure Handling&lt;/li&gt;
&lt;li&gt;Section 9: SLA Management&lt;/li&gt;
&lt;li&gt;Section 10: Catchup Behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PART 3: OPERATIONS &amp;amp; DEPLOYMENT&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 11: Configuration&lt;/li&gt;
&lt;li&gt;Section 12: Version Comparison&lt;/li&gt;
&lt;li&gt;Section 13: Best Practices&lt;/li&gt;
&lt;li&gt;Section 14: Connections &amp;amp; Secrets&lt;/li&gt;
&lt;li&gt;Summary &amp;amp; Recommendations&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  SECTION 1: INTRODUCTION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Apache Airflow?
&lt;/h3&gt;

&lt;p&gt;Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code using Python to define Directed Acyclic Graphs (DAGs).&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic workflows&lt;/strong&gt;: Define complex pipelines in Python with full control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable&lt;/strong&gt;: From single-machine setups to distributed Kubernetes deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible&lt;/strong&gt;: Create custom operators, sensors, and hooks without modifying core&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich monitoring&lt;/strong&gt;: Web UI with real-time pipeline visibility and control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible scheduling&lt;/strong&gt;: Time-based, event-driven, or custom business logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable&lt;/strong&gt;: Built-in retry logic, SLAs, and comprehensive failure handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community-driven&lt;/strong&gt;: Active ecosystem with hundreds of provider integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Who Uses Airflow?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data engineers building ETL/ELT pipelines&lt;/li&gt;
&lt;li&gt;ML engineers orchestrating training workflows&lt;/li&gt;
&lt;li&gt;DevOps teams managing infrastructure automation&lt;/li&gt;
&lt;li&gt;Analytics teams automating reports&lt;/li&gt;
&lt;li&gt;Any organization with complex scheduled workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 2: ARCHITECTURE
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Web Server&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provides REST API and web UI (port 8080 by default)&lt;/li&gt;
&lt;li&gt;Allows manual DAG triggering, viewing logs, managing connections&lt;/li&gt;
&lt;li&gt;Stateless (multiple instances can be deployed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scheduler&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitors all DAGs and determines when tasks should execute&lt;/li&gt;
&lt;li&gt;Parses DAG files periodically (every 5 minutes by default)&lt;/li&gt;
&lt;li&gt;Handles task dependency resolution and sequencing&lt;/li&gt;
&lt;li&gt;Single instance per deployment (HA coming in Airflow 3.1+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Executor&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determines HOW tasks are physically executed&lt;/li&gt;
&lt;li&gt;Options:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LocalExecutor&lt;/strong&gt;: Single machine (~10-50 parallel tasks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CeleryExecutor&lt;/strong&gt;: Distributed across worker nodes (100s of tasks)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KubernetesExecutor&lt;/strong&gt;: Each task runs in separate pod (elastic scaling)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metadata Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Central repository for DAG definitions, task states, variables, connections&lt;/li&gt;
&lt;li&gt;PostgreSQL 12+ or MySQL 8+ in production&lt;/li&gt;
&lt;li&gt;SQLite only for development&lt;/li&gt;
&lt;li&gt;Requires daily backups&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Message Broker&lt;/strong&gt; (for distributed setups)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis or RabbitMQ&lt;/li&gt;
&lt;li&gt;Queues tasks from Scheduler to Workers&lt;/li&gt;
&lt;li&gt;Required only for CeleryExecutor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Triggerer Service&lt;/strong&gt; (Airflow 2.2+)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manages all deferrable/async tasks&lt;/li&gt;
&lt;li&gt;Single lightweight async service per deployment&lt;/li&gt;
&lt;li&gt;Can handle 1000+ concurrent deferred tasks&lt;/li&gt;
&lt;li&gt;Fires triggers when conditions are met&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 3: DAG &amp;amp; BUILDING BLOCKS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is a DAG?
&lt;/h3&gt;

&lt;p&gt;A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Directed&lt;/strong&gt;: Tasks execute in specific direction (A → B → C)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acyclic&lt;/strong&gt;: No circular dependencies allowed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph&lt;/strong&gt;: Visual representation of tasks and relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key DAG Properties
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dag_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unique identifier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'daily_etl'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;start_date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;First execution date&lt;/td&gt;
&lt;td&gt;&lt;code&gt;datetime(2024, 1, 1)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;schedule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When to run&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'@daily'&lt;/code&gt;, &lt;code&gt;CronTimetable&lt;/code&gt;, &lt;code&gt;Dataset&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_runs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent runs allowed&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1-5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;default_args&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task defaults&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;retries&lt;/code&gt;, &lt;code&gt;timeout&lt;/code&gt;, &lt;code&gt;owner&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What DAG does&lt;/td&gt;
&lt;td&gt;&lt;code&gt;"Daily customer ETL"&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tags&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Categorization&lt;/td&gt;
&lt;td&gt;&lt;code&gt;['etl', 'daily']&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Operators (Task Types)
&lt;/h3&gt;

&lt;p&gt;Operators are classes that execute actual work. Each is a reusable task type.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PythonOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Execute Python function&lt;/td&gt;
&lt;td&gt;Data processing, API calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BashOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Run bash commands&lt;/td&gt;
&lt;td&gt;Scripts, system commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Execute SQL query&lt;/td&gt;
&lt;td&gt;Database transformations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EmailOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Send emails&lt;/td&gt;
&lt;td&gt;Notifications, reports&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3Operator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 file operations&lt;/td&gt;
&lt;td&gt;Data movement, management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Make HTTP requests&lt;/td&gt;
&lt;td&gt;External service integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BranchOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conditional branching&lt;/td&gt;
&lt;td&gt;Different execution paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DummyOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Placeholder/no-op&lt;/td&gt;
&lt;td&gt;Flow control&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Sensors (Waiting for Conditions)
&lt;/h3&gt;

&lt;p&gt;Sensors are operators that &lt;strong&gt;wait for something to happen&lt;/strong&gt;. They poll a condition until it becomes true, blocking further execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Sensors&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sensor&lt;/th&gt;
&lt;th&gt;Waits For&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FileSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File creation on disk&lt;/td&gt;
&lt;td&gt;Wait for uploaded data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3KeySensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 object existence&lt;/td&gt;
&lt;td&gt;Wait for data in S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL query condition&lt;/td&gt;
&lt;td&gt;Wait for record count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific time of day&lt;/td&gt;
&lt;td&gt;Wait until 2 PM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External DAG task completion&lt;/td&gt;
&lt;td&gt;Cross-DAG dependency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP endpoint response&lt;/td&gt;
&lt;td&gt;Wait for API ready&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeDeltaSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time duration elapsed&lt;/td&gt;
&lt;td&gt;Wait 30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sensor Modes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Poke Mode&lt;/strong&gt; (default): Worker thread continuously checks → holds worker slot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reschedule Mode&lt;/strong&gt;: Worker released, task rescheduled when ready&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Mode&lt;/strong&gt;: Batched checking for efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;⚠️ &lt;strong&gt;Performance Note&lt;/strong&gt;: Many idle sensors waste resources. Use &lt;strong&gt;deferrable operators&lt;/strong&gt; (with &lt;code&gt;Async&lt;/code&gt; suffix) for waits longer than 5 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hooks (Connection Abstractions)
&lt;/h3&gt;

&lt;p&gt;Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hook&lt;/th&gt;
&lt;th&gt;Connects To&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostgresHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MySqlHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;Database operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3Hook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS S3&lt;/td&gt;
&lt;td&gt;File operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST APIs&lt;/td&gt;
&lt;td&gt;HTTP requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SlackHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack&lt;/td&gt;
&lt;td&gt;Send messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSHHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remote servers&lt;/td&gt;
&lt;td&gt;Execute commands&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralize credential management&lt;/li&gt;
&lt;li&gt;Reusable across multiple tasks&lt;/li&gt;
&lt;li&gt;No credentials in DAG code&lt;/li&gt;
&lt;li&gt;Connection pooling support&lt;/li&gt;
&lt;li&gt;Error handling abstraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Tasks
&lt;/h3&gt;

&lt;p&gt;A task is an instance of an operator within a DAG—the smallest unit of work.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;task_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unique identifier&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'extract_customers'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retries&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retry attempts on failure&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2-3&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry_delay&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wait between retries&lt;/td&gt;
&lt;td&gt;&lt;code&gt;timedelta(minutes=5)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum execution time&lt;/td&gt;
&lt;td&gt;&lt;code&gt;timedelta(hours=1)&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrency control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;'default_pool'&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;priority_weight&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Execution priority&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;1-100&lt;/code&gt; (higher = urgent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;trigger_rule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When to execute&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;'all_success'&lt;/code&gt; (default)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  XCom (Cross-Communication)
&lt;/h3&gt;

&lt;p&gt;XCom allows tasks to pass data to each other (automatically serialized to JSON).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Push
&lt;/span&gt;&lt;span class="n"&gt;task_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Pull
&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;task_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xcom_pull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;task_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;upstream&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: Intended for metadata only, not large datasets (causes database bloat).&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 4: TASKFLOW API
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is TaskFlow API?
&lt;/h3&gt;

&lt;p&gt;TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators instead of verbose operator instantiation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Comparison: Traditional vs TaskFlow
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Syntax&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verbose operator creation&lt;/td&gt;
&lt;td&gt;Clean function decorators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Passing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual XCom push/pull&lt;/td&gt;
&lt;td&gt;Implicit via return values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Explicit &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt; operators&lt;/td&gt;
&lt;td&gt;Implicit via parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Excellent (autocomplete)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Length&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;More lines, repetitive&lt;/td&gt;
&lt;td&gt;Fewer lines, concise&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Decorators
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a class="mentioned-user" href="https://dev.to/dag"&gt;@dag&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decorates function that contains tasks&lt;/li&gt;
&lt;li&gt;Function body defines workflow&lt;/li&gt;
&lt;li&gt;Returns DAG instance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decorates function that becomes a task&lt;/li&gt;
&lt;li&gt;Function is task logic&lt;/li&gt;
&lt;li&gt;Return value becomes XCom value&lt;/li&gt;
&lt;li&gt;Parameters are upstream dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task_group&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groups related tasks together&lt;/li&gt;
&lt;li&gt;Creates visual subgraph in UI&lt;/li&gt;
&lt;li&gt;Simplifies large DAGs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task.expand&lt;/strong&gt; (Dynamic Mapping - Airflow 2.3+)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates task for each item in list&lt;/li&gt;
&lt;li&gt;Reduces repetitive task creation&lt;/li&gt;
&lt;li&gt;Dynamic fan-out pattern&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  When to Use TaskFlow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New DAGs (recommended)&lt;/li&gt;
&lt;li&gt;Complex dependencies&lt;/li&gt;
&lt;li&gt;Python-savvy teams&lt;/li&gt;
&lt;li&gt;Modern deployments (2.0+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alternative if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legacy Airflow 1.10 requirement&lt;/li&gt;
&lt;li&gt;Using operators not available as tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 5: DEFERRABLE OPERATORS &amp;amp; TRIGGERS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem: Traditional Sensors Waste Resources
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Traditional Approach (Blocking)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sensor holds worker slot while waiting&lt;/li&gt;
&lt;li&gt;Worker CPU: ~5-10% even during idle&lt;/li&gt;
&lt;li&gt;Memory: ~50MB per idle task&lt;/li&gt;
&lt;li&gt;Can handle: ~50-100 concurrent waits&lt;/li&gt;
&lt;li&gt;Issue: Massive waste for long waits (hours/days)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: 1000 tasks waiting 1 hour = 1000 worker slots held continuously!&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solution: Deferrable Operators
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Deferrable Approach (Non-blocking)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task releases worker slot immediately after deferring&lt;/li&gt;
&lt;li&gt;Minimal CPU usage: ~0.1% during wait&lt;/li&gt;
&lt;li&gt;Memory: ~1MB per deferred task&lt;/li&gt;
&lt;li&gt;Can handle: 1000+ concurrent waits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings&lt;/strong&gt;: ~95% resource reduction&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How Deferrable Operators Work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;4-Phase Execution:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Execution Phase&lt;/strong&gt;: Task runs initial logic, then defers to trigger&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker slot &lt;strong&gt;RELEASED&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deferred Phase&lt;/strong&gt;: Trigger monitors event asynchronously&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Triggerer service handles&lt;/li&gt;
&lt;li&gt;Very lightweight&lt;/li&gt;
&lt;li&gt;No worker slot needed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Completion Phase&lt;/strong&gt;: Event occurs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger detects condition met&lt;/li&gt;
&lt;li&gt;Returns event data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resume Phase&lt;/strong&gt;: Task resumes with new worker slot&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task completes with trigger result&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Available Deferrable Operators
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;Defers For&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific time&lt;/td&gt;
&lt;td&gt;Waiting until business hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3KeySensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 object&lt;/td&gt;
&lt;td&gt;Long-running file uploads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External DAG&lt;/td&gt;
&lt;td&gt;Cross-DAG long waits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL condition&lt;/td&gt;
&lt;td&gt;Database state changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP response&lt;/td&gt;
&lt;td&gt;External service availability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  When to Use Deferrable
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait duration &amp;gt; 5 minutes&lt;/li&gt;
&lt;li&gt;Many concurrent waits&lt;/li&gt;
&lt;li&gt;Resource-constrained environment&lt;/li&gt;
&lt;li&gt;Can tolerate event latency (&amp;lt;1 second)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't use when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait &amp;lt; 5 minutes (overhead not worth it)&lt;/li&gt;
&lt;li&gt;Real-time detection critical (sub-second)&lt;/li&gt;
&lt;li&gt;Simple local operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Performance Impact
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1000 Tasks Waiting 1 Hour&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional&lt;/td&gt;
&lt;td&gt;1000 worker slots × 1 hour = 1000 slot-hours wasted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deferrable&lt;/td&gt;
&lt;td&gt;0 worker slots + minimal CPU = 95% savings&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 6: DEPENDENCIES &amp;amp; RELATIONSHIPS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Task Dependencies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Linear Chain&lt;/strong&gt;: A → B → C → D&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential execution&lt;/li&gt;
&lt;li&gt;Best for: ETL pipelines with stages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fan-out / Fan-in&lt;/strong&gt;: One task triggers many, then merge&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      ├─ Process A ─┐
Extract ┤            ├─ Merge
      └─ Process B ─┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Parallel processing&lt;/li&gt;
&lt;li&gt;Best for: Multiple independent operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Branching&lt;/strong&gt;: Conditional execution&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      ┌─ Success path
Branch ┤
      └─ Error path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Different paths based on condition&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Trigger Rules
&lt;/h3&gt;

&lt;p&gt;Determine WHEN downstream tasks execute based on UPSTREAM status.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Executes When&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_success&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All upstream succeed&lt;/td&gt;
&lt;td&gt;Default, sequential flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All upstream fail&lt;/td&gt;
&lt;td&gt;Cleanup after failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_done&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All complete (any status)&lt;/td&gt;
&lt;td&gt;Final reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;one_success&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At least one succeeds&lt;/td&gt;
&lt;td&gt;Alternative paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;one_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At least one fails&lt;/td&gt;
&lt;td&gt;Error alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No failures (skips OK)&lt;/td&gt;
&lt;td&gt;⭐ Robust pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none_failed_or_skipped&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No failures/skips&lt;/td&gt;
&lt;td&gt;Strict validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Use &lt;code&gt;none_failed&lt;/code&gt; for fault-tolerant pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  DAG-to-DAG Dependencies
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Poll for completion&lt;/td&gt;
&lt;td&gt;Different schedules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TriggerDagRunOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Actively trigger&lt;/td&gt;
&lt;td&gt;Orchestrate workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event-driven (2.3+)&lt;/td&gt;
&lt;td&gt;Data pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Datasets (Airflow 2.3+)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Producer DAG updates dataset&lt;/li&gt;
&lt;li&gt;Consumer DAG triggered automatically&lt;/li&gt;
&lt;li&gt;Real-time responsiveness: &amp;lt;1 second&lt;/li&gt;
&lt;li&gt;Perfect for data-driven workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL DAG extracts and updates &lt;code&gt;orders_dataset&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Analytics DAG triggered automatically when dataset updates&lt;/li&gt;
&lt;li&gt;No waiting, real-time processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 7: SCHEDULING
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Three Scheduling Methods
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Time-based&lt;/strong&gt;: Specific times/intervals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven&lt;/strong&gt;: When data arrives/updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual&lt;/strong&gt;: User-triggered via UI&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Time-Based: Cron Expressions
&lt;/h3&gt;

&lt;p&gt;Legacy method using cron format: &lt;code&gt;minute hour day month day_of_week&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@hourly&lt;/code&gt;: Every hour&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@daily&lt;/code&gt;: Every day at midnight&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@weekly&lt;/code&gt;: Every Sunday at midnight&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@monthly&lt;/code&gt;: First day of month&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0 9 * * MON-FRI&lt;/code&gt;: 9 AM on weekdays&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*/15 * * * *&lt;/code&gt;: Every 15 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;: Simple time-only scheduling, no business logic&lt;/p&gt;

&lt;h3&gt;
  
  
  Time-Based: Timetable API (Modern)
&lt;/h3&gt;

&lt;p&gt;Object-oriented scheduling (Airflow 2.2+) with unlimited flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in Timetables&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CronTimetable&lt;/code&gt;: Cron-based (replaces schedule_interval)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DailyTimetable&lt;/code&gt;: Daily runs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IntervalTimetable&lt;/code&gt;: Fixed intervals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom Timetables Allow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business day scheduling (Mon-Fri)&lt;/li&gt;
&lt;li&gt;Holiday-aware schedules&lt;/li&gt;
&lt;li&gt;Timezone-specific runs&lt;/li&gt;
&lt;li&gt;Complex business logic&lt;/li&gt;
&lt;li&gt;End-of-month runs&lt;/li&gt;
&lt;li&gt;Business hour runs (9 AM - 5 PM)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Event-Driven: Datasets (Airflow 2.3+)
&lt;/h3&gt;

&lt;p&gt;Enable data-driven workflows instead of time-driven.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Producer task completes and updates dataset&lt;/li&gt;
&lt;li&gt;Scheduler detects dataset update&lt;/li&gt;
&lt;li&gt;Consumer DAG triggered immediately&lt;/li&gt;
&lt;li&gt;No scheduled waiting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time responsiveness (&amp;lt;1 second)&lt;/li&gt;
&lt;li&gt;Resource efficient (no idle waits)&lt;/li&gt;
&lt;li&gt;Data-quality focused&lt;/li&gt;
&lt;li&gt;Perfect for data pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hybrid Scheduling
&lt;/h3&gt;

&lt;p&gt;Combine time-based and event-driven:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAG runs on schedule &lt;strong&gt;OR&lt;/strong&gt; when data arrives&lt;/li&gt;
&lt;li&gt;Whichever comes first&lt;/li&gt;
&lt;li&gt;Best of both worlds&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduling Decision Tree
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TIME-BASED?
├─ YES → Complex logic?
│        ├─ YES → Custom Timetable
│        └─ NO → Cron Expression
│
└─ NO → Event-driven?
         ├─ YES → Use Datasets
         └─ NO → Manual trigger or reconsider
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 8: FAILURE HANDLING
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Retry Mechanisms
&lt;/h3&gt;

&lt;p&gt;Allow tasks to recover from transient failures (network timeout, API rate limit, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;retries&lt;/code&gt;: Number of retry attempts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry_delay&lt;/code&gt;: Wait time between retries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry_exponential_backoff&lt;/code&gt;: Exponentially increasing delays&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_retry_delay&lt;/code&gt;: Cap on maximum wait&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Retry Sequence&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attempt 1: Fail → Wait 1 minute
Attempt 2: Fail → Wait 2 minutes (1 × 2^1)
Attempt 3: Fail → Wait 4 minutes (1 × 2^2)
Attempt 4: Fail → Wait 8 minutes (1 × 2^3)
Attempt 5: Fail → Wait 16 minutes (1 × 2^4)
Attempt 6: Fail → Wait 1 hour max (capped)
Attempt 7: Fail → Task FAILED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: 2-3 retries with exponential backoff for most tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Callbacks
&lt;/h3&gt;

&lt;p&gt;Execute custom logic when task state changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;on_failure_callback&lt;/code&gt;: When task fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_retry_callback&lt;/code&gt;: When task retries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_success_callback&lt;/code&gt;: When task succeeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common Uses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send notifications (Slack, email, PagerDuty)&lt;/li&gt;
&lt;li&gt;Log to external monitoring systems&lt;/li&gt;
&lt;li&gt;Trigger incident management&lt;/li&gt;
&lt;li&gt;Update dashboards&lt;/li&gt;
&lt;li&gt;Cleanup resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Error Handling Patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: Idempotency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make tasks safe to retry&lt;/li&gt;
&lt;li&gt;Delete-then-insert instead of insert&lt;/li&gt;
&lt;li&gt;Same result regardless of retry count&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: Circuit Breaker&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stop retrying after N attempts&lt;/li&gt;
&lt;li&gt;Raise critical exceptions&lt;/li&gt;
&lt;li&gt;Alert operations team&lt;/li&gt;
&lt;li&gt;Prevent cascading failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: Graceful Degradation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continue with partial data&lt;/li&gt;
&lt;li&gt;Log warnings&lt;/li&gt;
&lt;li&gt;Don't fail entire pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 4: Skip on Condition&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raise &lt;code&gt;AirflowSkipException&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Skip task without marking as failure&lt;/li&gt;
&lt;li&gt;Allow downstream to continue&lt;/li&gt;
&lt;li&gt;Best for conditional workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 9: SLA MANAGEMENT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is SLA?
&lt;/h3&gt;

&lt;p&gt;SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If SLA is missed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task marked as "SLA Missed"&lt;/li&gt;
&lt;li&gt;Notifications triggered&lt;/li&gt;
&lt;li&gt;Callbacks executed&lt;/li&gt;
&lt;li&gt;Logged for auditing/metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Setting SLAs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;At DAG Level&lt;/strong&gt; (all tasks inherit):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;default_args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;sla&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;At Task Level&lt;/strong&gt; (specific to individual task):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;task&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PythonOperator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;task_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;critical_task&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sla&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;timedelta&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;minutes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Per Task Overrides&lt;/strong&gt;: Task SLA overrides DAG default&lt;/p&gt;

&lt;h3&gt;
  
  
  SLA Monitoring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLA compliance rate (%)&lt;/li&gt;
&lt;li&gt;Average overdue duration&lt;/li&gt;
&lt;li&gt;Most violated tasks&lt;/li&gt;
&lt;li&gt;Impact on downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alerts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send notifications when SLA at risk (80%)&lt;/li&gt;
&lt;li&gt;Escalate on missed SLA&lt;/li&gt;
&lt;li&gt;Create incidents for critical SLAs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track compliance trends&lt;/li&gt;
&lt;li&gt;Identify problematic tasks&lt;/li&gt;
&lt;li&gt;Alert history&lt;/li&gt;
&lt;li&gt;Remediation tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SLA Best Practices
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Practice&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Set realistic SLAs&lt;/td&gt;
&lt;td&gt;Avoid false positives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Different tiers&lt;/td&gt;
&lt;td&gt;Critical &amp;lt; Normal &amp;lt; Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review quarterly&lt;/td&gt;
&lt;td&gt;Account for growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document rationale&lt;/td&gt;
&lt;td&gt;Team alignment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Escalation policy&lt;/td&gt;
&lt;td&gt;Proper incident response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Callbacks&lt;/td&gt;
&lt;td&gt;Automated notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitor compliance&lt;/td&gt;
&lt;td&gt;Identify trends&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 10: CATCHUP BEHAVIOR
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Catchup?
&lt;/h3&gt;

&lt;p&gt;Catchup is automatic backfill of missed DAG runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Scenario&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start date:     2024-01-01
Today:          2024-05-15
Schedule:       @daily (every day)
Catchup:        True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Catchup Settings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catchup=True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Backfill all missed runs&lt;/td&gt;
&lt;td&gt;Historical data loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catchup=False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No backfill, start fresh&lt;/td&gt;
&lt;td&gt;New pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Catchup Risks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Potential Issues&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource exhaustion (CPU, memory, database)&lt;/li&gt;
&lt;li&gt;Network bandwidth spike&lt;/li&gt;
&lt;li&gt;Third-party API rate limits exceeded&lt;/li&gt;
&lt;li&gt;Database connection pool exhausted&lt;/li&gt;
&lt;li&gt;Worker node crashes&lt;/li&gt;
&lt;li&gt;Scheduler paralysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Safe Catchup Strategy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Use Separate Backfill DAG&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Main DAG: &lt;code&gt;catchup=False&lt;/code&gt; (normal operations)&lt;/li&gt;
&lt;li&gt;Backfill DAG: &lt;code&gt;catchup=True&lt;/code&gt;, &lt;code&gt;end_date&lt;/code&gt; set (historical only)&lt;/li&gt;
&lt;li&gt;Run backfill during maintenance window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Control Parallelism&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_active_runs&lt;/code&gt;: Limit concurrent DAG runs (1-3)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_active_tasks_per_dag&lt;/code&gt;: Limit concurrent tasks (16-32)&lt;/li&gt;
&lt;li&gt;Prevents resource exhaustion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Staged Approach&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test with subset of dates first&lt;/li&gt;
&lt;li&gt;Monitor resource usage&lt;/li&gt;
&lt;li&gt;Gradually increase pace&lt;/li&gt;
&lt;li&gt;Have rollback plan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verify data completeness&lt;/li&gt;
&lt;li&gt;Check for duplicates&lt;/li&gt;
&lt;li&gt;Validate transformations&lt;/li&gt;
&lt;li&gt;Compare with expected results&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Catchup Checklist
&lt;/h3&gt;

&lt;p&gt;Before enabling catchup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;☐ Database capacity available&lt;/li&gt;
&lt;li&gt;☐ Worker resources adequate&lt;/li&gt;
&lt;li&gt;☐ Source data exists for all periods&lt;/li&gt;
&lt;li&gt;☐ Data schema hasn't changed&lt;/li&gt;
&lt;li&gt;☐ No API rate limits exceeded&lt;/li&gt;
&lt;li&gt;☐ Network bandwidth available&lt;/li&gt;
&lt;li&gt;☐ Stakeholders informed&lt;/li&gt;
&lt;li&gt;☐ Rollback procedure ready&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 11: CONFIGURATION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Configuration Hierarchy
&lt;/h3&gt;

&lt;p&gt;Airflow reads configuration in this order (first found wins):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Environment Variables&lt;/strong&gt; (&lt;code&gt;AIRFLOW___&lt;/code&gt; prefix)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;airflow.cfg&lt;/strong&gt; file (main config)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default configuration&lt;/strong&gt; (hardcoded)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Command-line arguments&lt;/strong&gt; (runtime)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Critical Settings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Production Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sql_alchemy_conn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metadata database&lt;/td&gt;
&lt;td&gt;PostgreSQL connection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;executor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task execution&lt;/td&gt;
&lt;td&gt;CeleryExecutor or KubernetesExecutor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;parallelism&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Max tasks globally&lt;/td&gt;
&lt;td&gt;32-128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_runs_per_dag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent DAG runs&lt;/td&gt;
&lt;td&gt;1-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_tasks_per_dag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent tasks&lt;/td&gt;
&lt;td&gt;16-32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;remote_logging&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Centralized logs&lt;/td&gt;
&lt;td&gt;True (to S3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;email_backend&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Email provider&lt;/td&gt;
&lt;td&gt;SMTP&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Database Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Production Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL 12+ (recommended) or MySQL 8+&lt;/li&gt;
&lt;li&gt;Dedicated database server (not shared)&lt;/li&gt;
&lt;li&gt;Daily backups automated&lt;/li&gt;
&lt;li&gt;Connection pooling enabled (pool_size=5, max_overflow=10)&lt;/li&gt;
&lt;li&gt;Monitoring for slow queries&lt;/li&gt;
&lt;li&gt;Regular maintenance (VACUUM, ANALYZE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Not SQLite&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not concurrent-access safe&lt;/li&gt;
&lt;li&gt;Locks on writes&lt;/li&gt;
&lt;li&gt;Single machine only&lt;/li&gt;
&lt;li&gt;Not suitable for production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Executor Configuration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Executor&lt;/th&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LocalExecutor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[local_executor]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CeleryExecutor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[celery]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Distributed, high volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KubernetesExecutor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;[kubernetes]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cloud-native&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 12: VERSION COMPARISON
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Version Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements &amp;amp; Deferrable Operators)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Feature Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;1.10&lt;/th&gt;
&lt;th&gt;2.x&lt;/th&gt;
&lt;th&gt;3.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TaskFlow API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deferrable Operators&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasets (Event-driven)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;3-5x faster&lt;/td&gt;
&lt;td&gt;5-6x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Python 3.10+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Full Async/Await&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Performance Improvements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1.10&lt;/th&gt;
&lt;th&gt;2.5&lt;/th&gt;
&lt;th&gt;3.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DAG parsing (1000 DAGs)&lt;/td&gt;
&lt;td&gt;~120s&lt;/td&gt;
&lt;td&gt;~45s&lt;/td&gt;
&lt;td&gt;~15s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task scheduling latency&lt;/td&gt;
&lt;td&gt;~5-10s&lt;/td&gt;
&lt;td&gt;~2-3s&lt;/td&gt;
&lt;td&gt;&amp;lt;1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory (idle scheduler)&lt;/td&gt;
&lt;td&gt;~500MB&lt;/td&gt;
&lt;td&gt;~300MB&lt;/td&gt;
&lt;td&gt;~150MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web UI response&lt;/td&gt;
&lt;td&gt;~800ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;td&gt;~50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Migration Path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From 1.10 to 2.x&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.6+ required&lt;/li&gt;
&lt;li&gt;Import paths changed (reorganized)&lt;/li&gt;
&lt;li&gt;Operators moved to providers package&lt;/li&gt;
&lt;li&gt;Test thoroughly in staging&lt;/li&gt;
&lt;li&gt;Gradual production rollout (1-2 weeks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;From 2.x to 3.0&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most DAGs work without changes&lt;/li&gt;
&lt;li&gt;Drop support for Python &amp;lt; 3.10&lt;/li&gt;
&lt;li&gt;Update async patterns&lt;/li&gt;
&lt;li&gt;Leverage new performance features&lt;/li&gt;
&lt;li&gt;Simpler migration (mostly backward compatible)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Version Selection
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;New projects&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.5+ or 3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Production (stable)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.4.x or 2.5.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Enterprise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.5+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance-critical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning/Development&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.5+ or 3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why Not Use 1.10?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;EOL (End of Life) as of 2021&lt;/li&gt;
&lt;li&gt;No deferrable operators&lt;/li&gt;
&lt;li&gt;No TaskFlow API&lt;/li&gt;
&lt;li&gt;No datasets&lt;/li&gt;
&lt;li&gt;Slow DAG parsing&lt;/li&gt;
&lt;li&gt;No modern Python support&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Action&lt;/strong&gt;: If you're on 1.10, plan migration to 2.5+ now.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 13: BEST PRACTICES
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DAG Design DO's ✅
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;TaskFlow API&lt;/strong&gt; for modern code&lt;/li&gt;
&lt;li&gt;Keep tasks &lt;strong&gt;small and focused&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Name tasks &lt;strong&gt;descriptively&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document&lt;/strong&gt; complex logic&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;type hints&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Add &lt;strong&gt;comprehensive logging&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handle errors&lt;/strong&gt; gracefully&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test DAGs&lt;/strong&gt; before production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DAG Design DON'Ts ❌
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Put &lt;strong&gt;heavy processing&lt;/strong&gt; in DAG definition&lt;/li&gt;
&lt;li&gt;Create &lt;strong&gt;1000s of tasks&lt;/strong&gt; per DAG&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Hardcode credentials&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Ignore data validation&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;DAG-level database&lt;/strong&gt; connections&lt;/li&gt;
&lt;li&gt;Forget about &lt;strong&gt;idempotency&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skip error handling&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Assume tasks &lt;strong&gt;always succeed&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Task Design DO's ✅
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Make tasks &lt;strong&gt;idempotent&lt;/strong&gt; (safe to retry)&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;appropriate timeouts&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Implement &lt;strong&gt;retries with backoff&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;pools&lt;/strong&gt; for resource control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skip tasks&lt;/strong&gt; on errors (when appropriate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log important&lt;/strong&gt; events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clean up&lt;/strong&gt; resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Task Design DON'Ts ❌
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Assume tasks &lt;strong&gt;always succeed&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Run &lt;strong&gt;long operations&lt;/strong&gt; without timeout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hold resources&lt;/strong&gt; unnecessarily&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store large data&lt;/strong&gt; in XCom&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;blocking operations&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignore upstream&lt;/strong&gt; failures&lt;/li&gt;
&lt;li&gt;Forget &lt;strong&gt;cleanup&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduling DO's ✅
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Match &lt;strong&gt;frequency&lt;/strong&gt; to data needs&lt;/li&gt;
&lt;li&gt;Account for &lt;strong&gt;upstream dependencies&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Plan for failures&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test schedule&lt;/strong&gt; changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document&lt;/strong&gt; rationale&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monitor adherence&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduling DON'Ts ❌
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Schedule &lt;strong&gt;too frequently&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ignore resource&lt;/strong&gt; constraints&lt;/li&gt;
&lt;li&gt;Forget about &lt;strong&gt;catchup&lt;/strong&gt; implications&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;ambiguous&lt;/strong&gt; expressions&lt;/li&gt;
&lt;li&gt;Create &lt;strong&gt;tight couplings&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Skip monitoring&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment DO's ✅
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;external secrets backend&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Store DAG code in &lt;strong&gt;Git&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Implement &lt;strong&gt;monitoring&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Set up &lt;strong&gt;disaster recovery&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Rotate &lt;strong&gt;credentials&lt;/strong&gt; quarterly&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;separate environments&lt;/strong&gt; (dev, staging, prod)&lt;/li&gt;
&lt;li&gt;Document &lt;strong&gt;runbooks&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Deployment DON'Ts ❌
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Store &lt;strong&gt;credentials&lt;/strong&gt; in DAGs&lt;/li&gt;
&lt;li&gt;Deploy without &lt;strong&gt;monitoring&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Skip &lt;strong&gt;backups&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Use same &lt;strong&gt;credentials everywhere&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Forget &lt;strong&gt;version control&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Deploy &lt;strong&gt;without testing&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SECTION 14: CONNECTIONS &amp;amp; SECRETS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What are Connections?
&lt;/h3&gt;

&lt;p&gt;Connections store authentication credentials to external systems centrally. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these securely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection ID&lt;/strong&gt;: Unique identifier (e.g., &lt;code&gt;postgres_default&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection Type&lt;/strong&gt;: System type (postgres, aws, http, mysql, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host&lt;/strong&gt;: Server address&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Login&lt;/strong&gt;: Username&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password&lt;/strong&gt;: Secret credential&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port&lt;/strong&gt;: Service port&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema&lt;/strong&gt;: Database or default path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extra&lt;/strong&gt;: JSON for additional parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Use Connections?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Credentials not in DAG code (won't be accidentally committed to git)&lt;/li&gt;
&lt;li&gt;Centralized credential management&lt;/li&gt;
&lt;li&gt;Easy credential rotation&lt;/li&gt;
&lt;li&gt;Different credentials per environment (dev, staging, prod)&lt;/li&gt;
&lt;li&gt;Audit trail of access&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Connection Types
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Used For&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;postgres&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL databases&lt;/td&gt;
&lt;td&gt;Analytics warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mysql&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MySQL databases&lt;/td&gt;
&lt;td&gt;Transactional data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;snowflake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake warehouse&lt;/td&gt;
&lt;td&gt;Cloud data platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;aws&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS services (S3, Redshift, Lambda)&lt;/td&gt;
&lt;td&gt;Cloud infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gcp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Cloud services&lt;/td&gt;
&lt;td&gt;BigQuery, Cloud Storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;http&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST APIs&lt;/td&gt;
&lt;td&gt;External services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ssh&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remote server access&lt;/td&gt;
&lt;td&gt;Bastion hosts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;slack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack messaging&lt;/td&gt;
&lt;td&gt;Notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sftp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SFTP file transfer&lt;/td&gt;
&lt;td&gt;Remote servers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  How to Add Connections
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method 1: Web UI (Easiest)&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Airflow UI → Admin → Connections → Create Connection&lt;/li&gt;
&lt;li&gt;Fill form with credentials&lt;/li&gt;
&lt;li&gt;Click Save&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Method 2: Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_CONN_POSTGRES_DEFAULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:password@localhost:5432/mydb
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_CONN_AWS_DEFAULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;aws://ACCESS_KEY:SECRET_KEY@
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Method 3: Airflow CLI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow connections add postgres_prod &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-type&lt;/span&gt; postgres &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-host&lt;/span&gt; localhost &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-login&lt;/span&gt; user &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-password&lt;/span&gt; password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using Connections in DAGs
&lt;/h3&gt;

&lt;p&gt;Connections are accessed through &lt;strong&gt;Hooks&lt;/strong&gt; (connection abstraction layer):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.hooks.postgres_hook&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PostgresHook&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PostgresHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postgres_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgres_default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits of Hooks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encapsulate connection logic&lt;/li&gt;
&lt;li&gt;Reusable across tasks&lt;/li&gt;
&lt;li&gt;Handle connection pooling&lt;/li&gt;
&lt;li&gt;Error handling built-in&lt;/li&gt;
&lt;li&gt;No credentials in task code&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What are Variables/Secrets?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Variables&lt;/strong&gt; store non-sensitive configuration values.&lt;br&gt;
&lt;strong&gt;Secrets&lt;/strong&gt; store sensitive data (marked as secret in UI).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Variables&lt;/th&gt;
&lt;th&gt;Secrets&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configuration settings&lt;/td&gt;
&lt;td&gt;Passwords, API keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visible in UI&lt;/td&gt;
&lt;td&gt;Masked/hidden in UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metadata database&lt;/td&gt;
&lt;td&gt;External backend (Vault, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;batch_size, email&lt;/td&gt;
&lt;td&gt;api_key, db_password&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  How to Add Variables/Secrets
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Method 1: Web UI&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow UI → Admin → Variables → Create Variable&lt;/li&gt;
&lt;li&gt;Enter Key and Value&lt;/li&gt;
&lt;li&gt;Check "Is Secret" for sensitive data&lt;/li&gt;
&lt;li&gt;Click Save&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Method 2: Environment Variables&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_VAR_BATCH_SIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_VAR_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Method 3: Airflow CLI&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow variables &lt;span class="nb"&gt;set &lt;/span&gt;batch_size 1000
airflow variables &lt;span class="nb"&gt;set &lt;/span&gt;api_key secret123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using Variables in DAGs
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;

&lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;is_production&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# With JSON parsing
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deserialize_json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Secrets Backend: Advanced Security
&lt;/h3&gt;

&lt;p&gt;For production, use external secrets backends instead of storing in Airflow database.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Security Level&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata DB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Development only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Secrets Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;AWS deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HashiCorp Vault&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Secret Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;GCP deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Key Vault&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Azure deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure Airflow to use external backend&lt;/li&gt;
&lt;li&gt;Airflow queries backend when credential needed&lt;/li&gt;
&lt;li&gt;Secrets never stored in database&lt;/li&gt;
&lt;li&gt;Automatic rotation supported&lt;/li&gt;
&lt;li&gt;Audit logging in backend&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Security Best Practices
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;DO ✅&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use external secrets backend in production&lt;/li&gt;
&lt;li&gt;Rotate credentials quarterly&lt;/li&gt;
&lt;li&gt;Use IAM roles instead of hardcoded keys&lt;/li&gt;
&lt;li&gt;Encrypt connections in transit (SSL/TLS)&lt;/li&gt;
&lt;li&gt;Encrypt secrets at rest (database encryption)&lt;/li&gt;
&lt;li&gt;Limit who can see connections (RBAC)&lt;/li&gt;
&lt;li&gt;Use short-lived credentials (temporary tokens)&lt;/li&gt;
&lt;li&gt;Audit access to secrets&lt;/li&gt;
&lt;li&gt;Never commit credentials to git&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DON'T ❌&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hardcode credentials in DAG code&lt;/li&gt;
&lt;li&gt;Store passwords in comments or docstrings&lt;/li&gt;
&lt;li&gt;Share credentials via Slack/Email&lt;/li&gt;
&lt;li&gt;Use default passwords&lt;/li&gt;
&lt;li&gt;Store credentials in config files in git&lt;/li&gt;
&lt;li&gt;Forget to revoke old credentials&lt;/li&gt;
&lt;li&gt;Use same credentials everywhere&lt;/li&gt;
&lt;li&gt;Log credential values&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Testing Connections
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Via Web UI:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to Admin → Connections&lt;/li&gt;
&lt;li&gt;Click on connection&lt;/li&gt;
&lt;li&gt;Click "Test Connection"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Via CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow connections &lt;span class="nb"&gt;test &lt;/span&gt;postgres_default
airflow connections &lt;span class="nb"&gt;test &lt;/span&gt;aws_default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.hooks.postgres_hook&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PostgresHook&lt;/span&gt;

&lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PostgresHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postgres_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgres_default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_conn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Will raise error if connection fails
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  SUMMARY &amp;amp; RECOMMENDATIONS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For New Projects Starting Today
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Choose Airflow 2.5+ or 3.0&lt;/strong&gt; (not 1.10)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use TaskFlow API&lt;/strong&gt; (modern, clean code)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Timetable API&lt;/strong&gt; for scheduling (not cron strings)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Deferrable operators&lt;/strong&gt; for long waits (&amp;gt;5 min)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for scale&lt;/strong&gt; from day one (don't start with LocalExecutor)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up monitoring&lt;/strong&gt; immediately (not later)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use event-driven scheduling&lt;/strong&gt; (Datasets) where possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store credentials externally&lt;/strong&gt; (Secrets Manager, not hardcoded)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Teams Upgrading from 1.10
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plan 2-3 months&lt;/strong&gt; for migration (not quick)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test thoroughly&lt;/strong&gt; in staging environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep 1.10 backup&lt;/strong&gt; for emergency rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update DAGs incrementally&lt;/strong&gt; (not all at once)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor closely&lt;/strong&gt; first month post-upgrade&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document breaking changes&lt;/strong&gt; for team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celebrate milestones&lt;/strong&gt; (keep team motivated)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Large Enterprise Deployments
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Kubernetes executor&lt;/strong&gt; (scalability, isolation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement comprehensive monitoring&lt;/strong&gt; (critical for troubleshooting)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up disaster recovery&lt;/strong&gt; (business continuity)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regular chaos engineering tests&lt;/strong&gt; (find weak points)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning quarterly&lt;/strong&gt; (predict growth)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimization ongoing&lt;/strong&gt; (never "done")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security audits regular&lt;/strong&gt; (compliance, risk)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Data Pipeline Use Cases
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use event-driven scheduling&lt;/strong&gt; (Datasets, not time-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement data quality checks&lt;/strong&gt; (validate early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor SLAs closely&lt;/strong&gt; (data freshness critical)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement idempotent tasks&lt;/strong&gt; (safe retries)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan catchup carefully&lt;/strong&gt; (historical data loads)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version control everything&lt;/strong&gt; (DAGs, configurations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use separate environments&lt;/strong&gt; (dev, staging, prod)&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  KEY TAKEAWAYS
&lt;/h2&gt;

&lt;p&gt;✅ &lt;strong&gt;Fundamentals&lt;/strong&gt;: DAGs, operators, sensors, and TaskFlow API are the foundation&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Scheduling&lt;/strong&gt;: Choose appropriate method (time-based, event-driven, or hybrid)&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Reliability&lt;/strong&gt;: Retries, error handling, SLAs, and monitoring ensure success&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Scale&lt;/strong&gt;: Deferrable operators and proper architecture enable massive workloads&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Security&lt;/strong&gt;: Centralized credential management with external backends&lt;/p&gt;

&lt;p&gt;✅ &lt;strong&gt;Versions&lt;/strong&gt;: Upgrade from 1.10 to 2.5+ now (3.0 for greenfield projects)&lt;/p&gt;




&lt;h2&gt;
  
  
  SUCCESS FORMULA
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Treat DAGs as code&lt;/strong&gt; → Version control, code review, testing&lt;br&gt;
&lt;strong&gt;Monitor continuously&lt;/strong&gt; → You can't fix what you don't see&lt;br&gt;
&lt;strong&gt;Document thoroughly&lt;/strong&gt; → Future you and teammates will thank you&lt;br&gt;
&lt;strong&gt;Automate responses&lt;/strong&gt; → Reduce manual incident response&lt;br&gt;
&lt;strong&gt;Iterate and improve&lt;/strong&gt; → Perfect is enemy of done&lt;/p&gt;




&lt;h2&gt;
  
  
  Airflow is Perfect For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ Scheduled batch processing (daily, hourly, etc.)&lt;/li&gt;
&lt;li&gt;✅ Multi-stage ETL/ELT workflows&lt;/li&gt;
&lt;li&gt;✅ Cross-system orchestration&lt;/li&gt;
&lt;li&gt;✅ Data quality validation&lt;/li&gt;
&lt;li&gt;✅ DevOps automation&lt;/li&gt;
&lt;li&gt;✅ ML pipeline orchestration&lt;/li&gt;
&lt;li&gt;✅ Report generation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Airflow is NOT For
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;❌ Real-time streaming (use Kafka, Spark Streaming)&lt;/li&gt;
&lt;li&gt;❌ Replace data warehouses (use Snowflake, BigQuery)&lt;/li&gt;
&lt;li&gt;❌ Sub-second latency requirements&lt;/li&gt;
&lt;li&gt;❌ Complex transactional systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Wisdom
&lt;/h2&gt;

&lt;p&gt;Success with Airflow depends on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Technical setup&lt;/strong&gt;: Infrastructure, configuration, security&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational practices&lt;/strong&gt;: Code discipline, documentation, monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team culture&lt;/strong&gt;: Knowledge sharing, continuous learning, collaboration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache Airflow's strength lies in its &lt;strong&gt;flexibility&lt;/strong&gt;, &lt;strong&gt;active community&lt;/strong&gt;, and &lt;strong&gt;rich ecosystem&lt;/strong&gt;. Whether you're building simple daily pipelines or complex multi-stage workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;[End of Tutorial]&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>airflow</category>
      <category>dataengineering</category>
      <category>developer</category>
    </item>
    <item>
      <title>Apache Airflow: Complete Guide for Basic to Advanced Developers</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Thu, 08 Jan 2026 03:28:50 +0000</pubDate>
      <link>https://forem.com/datatechbridge/apache-airflow-complete-guide-for-intermediate-to-advanced-developers-263p</link>
      <guid>https://forem.com/datatechbridge/apache-airflow-complete-guide-for-intermediate-to-advanced-developers-263p</guid>
      <description>&lt;p&gt;&lt;strong&gt;The Beginning of Your Airflow Adventure&lt;/strong&gt; 🚀☸️&lt;br&gt;
&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  📚 Table of Contents
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;PART 1: FUNDAMENTALS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 1: Introduction&lt;/li&gt;
&lt;li&gt;Section 2: Architecture&lt;/li&gt;
&lt;li&gt;Section 3: DAG &amp;amp; Building Blocks&lt;/li&gt;
&lt;li&gt;Section 4: Plugins&lt;/li&gt;
&lt;li&gt;Section 5: TaskFlow API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PART 2: ADVANCED CONCEPTS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 6: Deferrable Operators&lt;/li&gt;
&lt;li&gt;Section 7: Dependencies &amp;amp; Relationships&lt;/li&gt;
&lt;li&gt;Section 8: Scheduling&lt;/li&gt;
&lt;li&gt;Section 9: Failure Handling&lt;/li&gt;
&lt;li&gt;Section 10: SLA Management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;PART 3: OPERATIONS &amp;amp; DEPLOYMENT&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 11: Catchup Behavior&lt;/li&gt;
&lt;li&gt;Section 12: Configuration&lt;/li&gt;
&lt;li&gt;Section 13: AWS Integration&lt;/li&gt;
&lt;li&gt;Section 14: Version Comparison&lt;/li&gt;
&lt;li&gt;Section 15: Best Practices&lt;/li&gt;
&lt;li&gt;Section 16: Monitoring&lt;/li&gt;
&lt;li&gt;Section 17: CONNECTIONS AND SECRETS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;FINAL SECTIONS&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Summary &amp;amp; Checklist&lt;/li&gt;
&lt;li&gt;Recommendations&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  SECTION 1: INTRODUCTION
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is Apache Airflow?
&lt;/h3&gt;

&lt;p&gt;Apache Airflow is a workflow orchestration platform that allows you to define, schedule, and monitor complex data pipelines as code. Instead of using UI-based workflow builders, Airflow uses Python to define Directed Acyclic Graphs (DAGs) representing your workflows.&lt;/p&gt;
&lt;h3&gt;
  
  
  Key Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic workflows&lt;/strong&gt;: Define complex pipelines in Python with full control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalable&lt;/strong&gt;: From single-machine setups to distributed Kubernetes deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible&lt;/strong&gt;: Create custom operators, sensors, and hooks without modifying core&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rich monitoring&lt;/strong&gt;: Web UI with real-time pipeline visibility and control&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible scheduling&lt;/strong&gt;: Time-based, event-driven, or custom business logic schedules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable&lt;/strong&gt;: Built-in retry logic, SLAs, and comprehensive failure handling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Community-driven&lt;/strong&gt;: Active ecosystem with hundreds of provider integrations&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why Airflow Over Alternatives?
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Alternative&lt;/th&gt;
&lt;th&gt;Why Airflow is Better&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cron scripts&lt;/td&gt;
&lt;td&gt;Better reliability, monitoring, dependency management, version control&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shell scripts&lt;/td&gt;
&lt;td&gt;Testable, debuggable, proper error handling, team collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Commercial tools&lt;/td&gt;
&lt;td&gt;Open source, customizable, no vendor lock-in, self-hosted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloud-native tools&lt;/td&gt;
&lt;td&gt;Language agnostic, can self-host, avoid provider lock-in&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Who Uses Airflow?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Data engineers building ETL/ELT pipelines&lt;/li&gt;
&lt;li&gt;ML engineers orchestrating training workflows&lt;/li&gt;
&lt;li&gt;DevOps teams managing infrastructure automation&lt;/li&gt;
&lt;li&gt;Analytics teams automating reports&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Any organization with complex scheduled workflows&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  - ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 2: ARCHITECTURE
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Core Components Overview
&lt;/h3&gt;

&lt;p&gt;Apache Airflow consists of several interconnected services working together:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Web Server&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provides REST API and web UI&lt;/li&gt;
&lt;li&gt;Runs on port 8080 by default&lt;/li&gt;
&lt;li&gt;Allows manual DAG triggering, viewing logs, and managing connections&lt;/li&gt;
&lt;li&gt;Multiple instances can be deployed for high availability&lt;/li&gt;
&lt;li&gt;Stateless (can run on different machines)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scheduler&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitors all DAGs and determines when tasks should execute&lt;/li&gt;
&lt;li&gt;Parses DAG files periodically (every 5 minutes by default)&lt;/li&gt;
&lt;li&gt;Handles task dependency resolution and sequencing&lt;/li&gt;
&lt;li&gt;Manages retry logic and failure callbacks&lt;/li&gt;
&lt;li&gt;Single instance per deployment (HA scheduler coming in Airflow 3.1+)&lt;/li&gt;
&lt;li&gt;Continuously runs in background&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Executor&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Determines HOW tasks are physically executed&lt;/li&gt;
&lt;li&gt;Different executors for different deployment models:

&lt;ul&gt;
&lt;li&gt;LocalExecutor: Single machine, limited parallelism (~10-50 tasks)&lt;/li&gt;
&lt;li&gt;SequentialExecutor: One task at a time (testing only)&lt;/li&gt;
&lt;li&gt;CeleryExecutor: Distributed across worker nodes (100s of tasks)&lt;/li&gt;
&lt;li&gt;KubernetesExecutor: Each task runs in separate pod (elastic scaling)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metadata Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Central repository for all Airflow state&lt;/li&gt;
&lt;li&gt;Stores: DAG definitions, task instance states, variables, connections, logs metadata&lt;/li&gt;
&lt;li&gt;Should be PostgreSQL 12+ or MySQL 8+ in production&lt;/li&gt;
&lt;li&gt;SQLite only for development (not concurrent access)&lt;/li&gt;
&lt;li&gt;Requires daily backups and monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Message Broker&lt;/strong&gt; (for distributed setups)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Redis or RabbitMQ&lt;/li&gt;
&lt;li&gt;Queues tasks from Scheduler to Workers&lt;/li&gt;
&lt;li&gt;Enables horizontal scaling&lt;/li&gt;
&lt;li&gt;Required only for CeleryExecutor&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Triggerer Service&lt;/strong&gt; (Airflow 2.2+)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Manages all deferrable/async tasks&lt;/li&gt;
&lt;li&gt;Single lightweight async service per deployment&lt;/li&gt;
&lt;li&gt;Can handle 1000+ concurrent deferred tasks&lt;/li&gt;
&lt;li&gt;Monitors events efficiently&lt;/li&gt;
&lt;li&gt;Fires triggers when conditions met&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Execution Flow Diagram
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. DAG File Loaded
   ↓
2. Scheduler Parses
   ↓
3. Check Task Ready?
   ├─ Dependencies met?
   ├─ Start date passed?
   ├─ Not already running?
   ↓
4. Task State: SCHEDULED
   ↓
5. Queue to Message Broker
   ↓
6. Worker Picks Up Task
   ↓
7. Execution Phase
   ├─ Run operator code
   ├─ OR Defer to Trigger
   ↓
8. Log Output
   ↓
9. Record Result
   ├─ Success → COMPLETED
   ├─ Failure → FAILED
   ├─ Skipped → SKIPPED
   ↓
10. Execute Callback (if configured)
    ↓
11. Mark downstream ready
    ↓
12. Update Dashboard
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Architecture Patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Single Machine Setup&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One server with all components&lt;/li&gt;
&lt;li&gt;LocalExecutor&lt;/li&gt;
&lt;li&gt;SQLite database (acceptable for development)&lt;/li&gt;
&lt;li&gt;Best for: Development, testing, small workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Distributed Setup&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Separate scheduler machine&lt;/li&gt;
&lt;li&gt;Multiple worker machines&lt;/li&gt;
&lt;li&gt;CeleryExecutor&lt;/li&gt;
&lt;li&gt;PostgreSQL database&lt;/li&gt;
&lt;li&gt;Redis message broker&lt;/li&gt;
&lt;li&gt;Best for: Production, high volume, 100+ daily DAGs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes Native&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduler pod&lt;/li&gt;
&lt;li&gt;Web server pod (multiple replicas)&lt;/li&gt;
&lt;li&gt;KubernetesExecutor&lt;/li&gt;
&lt;li&gt;Managed PostgreSQL (Cloud SQL, RDS)&lt;/li&gt;
&lt;li&gt;Dynamic pod creation per task&lt;/li&gt;
&lt;li&gt;Best for: Cloud-native, elastic scaling, containerized workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 3: DAG &amp;amp; BUILDING BLOCKS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is a DAG?
&lt;/h3&gt;

&lt;p&gt;A DAG (Directed Acyclic Graph) is the fundamental unit in Airflow. It represents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Directed&lt;/strong&gt;: Tasks execute in specific direction (flow from A → B → C)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Acyclic&lt;/strong&gt;: No circular dependencies allowed (prevents infinite loops)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graph&lt;/strong&gt;: Visual representation of tasks and relationships&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configurable&lt;/strong&gt;: Start date, schedule, retries, timeouts, SLAs, etc.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Key DAG Properties
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Property&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;dag_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unique identifier&lt;/td&gt;
&lt;td&gt;'daily_etl'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;start_date&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;First execution date&lt;/td&gt;
&lt;td&gt;datetime(2024,1,1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;schedule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When to run&lt;/td&gt;
&lt;td&gt;'@daily', CronTimetable, Dataset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catchup&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Backfill missed runs&lt;/td&gt;
&lt;td&gt;True/False&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_runs&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent runs allowed&lt;/td&gt;
&lt;td&gt;1-5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;default_args&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task defaults&lt;/td&gt;
&lt;td&gt;retries, timeout, owner&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;description&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;What DAG does&lt;/td&gt;
&lt;td&gt;"Daily customer ETL"&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tags&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Categorization&lt;/td&gt;
&lt;td&gt;['etl', 'daily']&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Operators (Task Types)
&lt;/h3&gt;

&lt;p&gt;Operators are classes that execute actual work. Each operator is a reusable task that does something specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Built-in Operators&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Common Parameters&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PythonOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executes Python function&lt;/td&gt;
&lt;td&gt;python_callable, op_kwargs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BashOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Runs bash commands&lt;/td&gt;
&lt;td&gt;bash_command&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Executes SQL query&lt;/td&gt;
&lt;td&gt;sql, conn_id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EmailOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sends emails&lt;/td&gt;
&lt;td&gt;to, subject, html_content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3Operator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 file operations&lt;/td&gt;
&lt;td&gt;bucket, key, source/dest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Makes HTTP requests&lt;/td&gt;
&lt;td&gt;endpoint, method, data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SlackOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Sends Slack messages&lt;/td&gt;
&lt;td&gt;text, channel, token&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BranchOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Conditional branching&lt;/td&gt;
&lt;td&gt;python_callable returning task_id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DummyOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Placeholder/no-op&lt;/td&gt;
&lt;td&gt;Used for flow control&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;When to Use Each&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PythonOperator&lt;/strong&gt;: Data processing, API calls, complex logic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BashOperator&lt;/strong&gt;: Run scripts, system commands, existing tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SqlOperator&lt;/strong&gt;: Database transformations, queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EmailOperator&lt;/strong&gt;: Notifications, reports&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;S3Operator&lt;/strong&gt;: Data movement, file management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HttpOperator&lt;/strong&gt;: External service integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;BranchOperator&lt;/strong&gt;: Conditional workflows&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Sensors (Waiting for Conditions)
&lt;/h3&gt;

&lt;p&gt;Sensors are operators that wait for something to happen. They poll a condition until it becomes true, blocking further execution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Sensors&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sensor&lt;/th&gt;
&lt;th&gt;Waits For&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FileSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File creation on disk&lt;/td&gt;
&lt;td&gt;Wait for uploaded data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3KeySensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 object existence&lt;/td&gt;
&lt;td&gt;Wait for data in S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL query condition&lt;/td&gt;
&lt;td&gt;Wait for record count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific time of day&lt;/td&gt;
&lt;td&gt;Wait until 2 PM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External DAG task&lt;/td&gt;
&lt;td&gt;Cross-DAG dependency&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP endpoint response&lt;/td&gt;
&lt;td&gt;Wait for API ready&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeDeltaSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time duration elapsed&lt;/td&gt;
&lt;td&gt;Wait 30 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sensor Behavior&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Poke Mode&lt;/strong&gt; (default): Worker thread continuously checks condition, holds worker slot&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reschedule Mode&lt;/strong&gt;: Worker released, task rescheduled when ready&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smart Mode&lt;/strong&gt;: Multiple sensors batched for efficiency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Performance Consideration&lt;/strong&gt;: Many idle sensors waste resources. Use deferrable operators for long waits.&lt;/p&gt;
&lt;h3&gt;
  
  
  Hooks (Connection Abstractions)
&lt;/h3&gt;

&lt;p&gt;Hooks encapsulate connection and authentication logic to external systems. They're the reusable layer between operators and external services.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Hooks&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hook&lt;/th&gt;
&lt;th&gt;Connects To&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;PostgresHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL&lt;/td&gt;
&lt;td&gt;Query execution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MySqlHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MySQL&lt;/td&gt;
&lt;td&gt;Database operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3Hook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS S3&lt;/td&gt;
&lt;td&gt;File operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST APIs&lt;/td&gt;
&lt;td&gt;HTTP requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SlackHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack&lt;/td&gt;
&lt;td&gt;Send messages&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SSHHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remote servers&lt;/td&gt;
&lt;td&gt;Execute commands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DockerHook&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Docker daemon&lt;/td&gt;
&lt;td&gt;Container operations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Benefits of Hooks&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralize credential management (stored in Airflow connections)&lt;/li&gt;
&lt;li&gt;Reusable across multiple tasks&lt;/li&gt;
&lt;li&gt;No credentials in DAG code&lt;/li&gt;
&lt;li&gt;Easier testing (mock connections)&lt;/li&gt;
&lt;li&gt;Connection pooling support&lt;/li&gt;
&lt;li&gt;Error handling abstraction&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Tasks
&lt;/h3&gt;

&lt;p&gt;A task is an instance of an operator within a DAG. It's the smallest unit of work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;task_id&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Unique identifier&lt;/td&gt;
&lt;td&gt;'extract_customers'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retries&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retry attempts on failure&lt;/td&gt;
&lt;td&gt;2-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;retry_delay&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Wait between retries&lt;/td&gt;
&lt;td&gt;timedelta(minutes=5)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;timeout&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Maximum execution time&lt;/td&gt;
&lt;td&gt;timedelta(hours=1)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pool&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrency control&lt;/td&gt;
&lt;td&gt;'default_pool'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;priority_weight&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Execution priority&lt;/td&gt;
&lt;td&gt;1-100 (higher = more urgent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;trigger_rule&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;When to execute&lt;/td&gt;
&lt;td&gt;'all_success' (default)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;depends_on_past&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Depend on previous run&lt;/td&gt;
&lt;td&gt;False (usually)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  XCom (Cross-Communication)
&lt;/h3&gt;

&lt;p&gt;XCom (cross-communication) allows tasks to pass data to each other.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How XCom Works&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tasks push values: &lt;code&gt;task_instance.xcom_push(key='count', value=1000)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Tasks pull values: &lt;code&gt;task_instance.xcom_pull(task_ids='upstream', key='count')&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Automatically stored in metadata database&lt;/li&gt;
&lt;li&gt;Serialized to JSON by default&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;XCom Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Database-backed (not for large data)&lt;/li&gt;
&lt;li&gt;Intended for metadata, not big data&lt;/li&gt;
&lt;li&gt;Consider external storage for large datasets&lt;/li&gt;
&lt;li&gt;Can cause database bloat if overused&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 4: PLUGINS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What are Plugins?
&lt;/h3&gt;

&lt;p&gt;Plugins allow you to extend Airflow with custom components without modifying core code. They're organized Python modules that register custom functionality.&lt;/p&gt;
&lt;h3&gt;
  
  
  Types of Plugins
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Plugin Type&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;When to Create&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Operators&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New task types&lt;/td&gt;
&lt;td&gt;Business-specific operations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Sensors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New wait conditions&lt;/td&gt;
&lt;td&gt;Proprietary systems&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Hooks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New connections&lt;/td&gt;
&lt;td&gt;Proprietary APIs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Executors&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New execution patterns&lt;/td&gt;
&lt;td&gt;Specialized hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Custom Macros&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New template variables&lt;/td&gt;
&lt;td&gt;Common business calculations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI Extensions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dashboard widgets&lt;/td&gt;
&lt;td&gt;Custom monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Plugin Organization
&lt;/h3&gt;

&lt;p&gt;Plugins live in the &lt;code&gt;plugins/&lt;/code&gt; directory with this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;plugins/
├── operators/
│   └── custom_operator.py
├── sensors/
│   └── custom_sensor.py
├── hooks/
│   └── custom_hook.py
├── macros/
│   └── custom_macros.py
└── airflow_plugins.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When to Create Plugins
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;CREATE plugins for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Integrating proprietary/internal systems&lt;/li&gt;
&lt;li&gt;Reusable patterns across organization&lt;/li&gt;
&lt;li&gt;Complex business logic encapsulation&lt;/li&gt;
&lt;li&gt;Specialized monitoring/alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DON'T create plugins for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One-off tasks (use standard operators)&lt;/li&gt;
&lt;li&gt;Simple data transformations (use PythonOperator)&lt;/li&gt;
&lt;li&gt;Ad-hoc scripts (use BashOperator)&lt;/li&gt;
&lt;li&gt;If built-in operator already exists&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Plugin Development
&lt;/h3&gt;

&lt;p&gt;Custom operators extend BaseOperator and override &lt;code&gt;execute()&lt;/code&gt; method.&lt;br&gt;
Custom sensors extend BaseSensorOperator and override &lt;code&gt;poke()&lt;/code&gt; method.&lt;br&gt;
Custom hooks extend BaseHook with connection logic.&lt;/p&gt;

&lt;p&gt;Plugins are loaded automatically from &lt;code&gt;plugins/&lt;/code&gt; directory on scheduler startup.&lt;/p&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 5: TASKFLOW API
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What is TaskFlow API?
&lt;/h3&gt;

&lt;p&gt;TaskFlow API (Airflow 2.0+) is the modern, Pythonic way to write DAGs using decorators. It simplifies DAG creation by using functions instead of operator instantiation.&lt;/p&gt;
&lt;h3&gt;
  
  
  Comparison: Traditional vs TaskFlow
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Traditional&lt;/th&gt;
&lt;th&gt;TaskFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Syntax&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Verbose operator creation&lt;/td&gt;
&lt;td&gt;Clean function decorators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Passing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Manual XCom push/pull&lt;/td&gt;
&lt;td&gt;Implicit via return values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dependencies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Explicit &amp;gt;&amp;gt; operators&lt;/td&gt;
&lt;td&gt;Implicit via parameters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Code Length&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;More lines, repetitive&lt;/td&gt;
&lt;td&gt;Fewer lines, concise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Type Hints&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;td&gt;Strongly recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Learning Curve&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Easier for Python devs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IDE Support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Basic&lt;/td&gt;
&lt;td&gt;Excellent (autocomplete)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Key Decorators
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;a class="mentioned-user" href="https://dev.to/dag"&gt;@dag&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decorates function that contains tasks&lt;/li&gt;
&lt;li&gt;Function body defines workflow&lt;/li&gt;
&lt;li&gt;Returns DAG instance&lt;/li&gt;
&lt;li&gt;Replaces &lt;code&gt;DAG()&lt;/code&gt; class creation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Decorates function that becomes a task&lt;/li&gt;
&lt;li&gt;Function is task logic&lt;/li&gt;
&lt;li&gt;Return value becomes XCom value&lt;/li&gt;
&lt;li&gt;Parameters are upstream dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task_group&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Groups related tasks together&lt;/li&gt;
&lt;li&gt;Creates visual subgraph in UI&lt;/li&gt;
&lt;li&gt;Simplifies large DAGs&lt;/li&gt;
&lt;li&gt;Can be nested&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;@task.expand&lt;/strong&gt; (Dynamic Mapping)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creates task for each item in list&lt;/li&gt;
&lt;li&gt;Reduces repetitive task creation&lt;/li&gt;
&lt;li&gt;Dynamic fan-out pattern&lt;/li&gt;
&lt;li&gt;Available in Airflow 2.3+&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Advantages of TaskFlow
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tasks are just Python functions&lt;/li&gt;
&lt;li&gt;Type hints enable IDE autocomplete&lt;/li&gt;
&lt;li&gt;Natural data flow through function parameters&lt;/li&gt;
&lt;li&gt;Automatic XCom handling (no manual push/pull)&lt;/li&gt;
&lt;li&gt;Cleaner, more Pythonic code&lt;/li&gt;
&lt;li&gt;Easier testing (function-based)&lt;/li&gt;
&lt;li&gt;Less boilerplate&lt;/li&gt;
&lt;li&gt;Better code organization&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  When to Use TaskFlow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New DAGs&lt;/li&gt;
&lt;li&gt;Complex dependencies&lt;/li&gt;
&lt;li&gt;Python-savvy teams&lt;/li&gt;
&lt;li&gt;Modern deployments (2.0+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alternative if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legacy Airflow 1.10 requirement&lt;/li&gt;
&lt;li&gt;Using operators not available as tasks&lt;/li&gt;
&lt;li&gt;Team unfamiliar with decorators&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 6: DEFERRABLE OPERATORS &amp;amp; TRIGGERS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  The Problem: Resource Waste
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Traditional Sensors (Blocking)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sensor holds worker slot while waiting&lt;/li&gt;
&lt;li&gt;Worker CPU ~5-10% even during idle&lt;/li&gt;
&lt;li&gt;Memory ~50MB per idle task&lt;/li&gt;
&lt;li&gt;Can handle ~50-100 concurrent waits&lt;/li&gt;
&lt;li&gt;Resource waste for long waits (hours/days)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: 1000 tasks waiting 1 hour = 1000 worker slots held continuously = massive waste&lt;/p&gt;
&lt;h3&gt;
  
  
  The Solution: Deferrable Operators
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Deferrable Approach (Non-blocking)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task releases worker slot immediately after deferring&lt;/li&gt;
&lt;li&gt;Minimal CPU usage (~0.1%) during wait&lt;/li&gt;
&lt;li&gt;Memory ~1MB per deferred task&lt;/li&gt;
&lt;li&gt;Can handle 1000+ concurrent waits&lt;/li&gt;
&lt;li&gt;~95% resource savings compared to traditional&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  How Deferrable Operators Work
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;4-Phase Execution:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Execution Phase&lt;/strong&gt;: Task executes initial logic&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Setup, validation, prerequisites&lt;/li&gt;
&lt;li&gt;Then: Defer to trigger&lt;/li&gt;
&lt;li&gt;Worker slot RELEASED&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Deferred Phase&lt;/strong&gt;: Trigger monitors event&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Async/event-based monitoring&lt;/li&gt;
&lt;li&gt;Triggerer service handles&lt;/li&gt;
&lt;li&gt;Very lightweight&lt;/li&gt;
&lt;li&gt;No worker slot needed&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Completion Phase&lt;/strong&gt;: Event occurs&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Trigger detects condition met&lt;/li&gt;
&lt;li&gt;Returns event data&lt;/li&gt;
&lt;li&gt;Trigger fires&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Resume Phase&lt;/strong&gt;: Task completes&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker acquires new slot&lt;/li&gt;
&lt;li&gt;Task resumes with trigger result&lt;/li&gt;
&lt;li&gt;Final processing&lt;/li&gt;
&lt;li&gt;Task completes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  Available Deferrable Operators
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operator&lt;/th&gt;
&lt;th&gt;Defers For&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Specific time&lt;/td&gt;
&lt;td&gt;Waiting until business hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3KeySensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3 object&lt;/td&gt;
&lt;td&gt;Long-running file uploads&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;External DAG&lt;/td&gt;
&lt;td&gt;Cross-DAG long waits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SqlSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQL condition&lt;/td&gt;
&lt;td&gt;Database state changes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HttpSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTP response&lt;/td&gt;
&lt;td&gt;External service availability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TimeDeltaSensorAsync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Duration&lt;/td&gt;
&lt;td&gt;Fixed time wait&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Triggerer Service
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Manages all active triggers&lt;/li&gt;
&lt;li&gt;Single lightweight async service per deployment&lt;/li&gt;
&lt;li&gt;Default capacity: 1000 triggers&lt;/li&gt;
&lt;li&gt;Requires: &lt;code&gt;airflow triggerer&lt;/code&gt; process&lt;/li&gt;
&lt;li&gt;Monitors events, fires when ready&lt;/li&gt;
&lt;li&gt;Stateless (can restart without losing state)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  When to Use Deferrable
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Use when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait duration &amp;gt; 5 minutes&lt;/li&gt;
&lt;li&gt;Many concurrent waits&lt;/li&gt;
&lt;li&gt;Resource-constrained environment&lt;/li&gt;
&lt;li&gt;Can tolerate event latency (&amp;lt;1 second)&lt;/li&gt;
&lt;li&gt;Long-running waits (hours, days)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Don't use when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wait &amp;lt; 5 minutes (overhead not worth it)&lt;/li&gt;
&lt;li&gt;Real-time detection critical (sub-second)&lt;/li&gt;
&lt;li&gt;Simple local operations&lt;/li&gt;
&lt;li&gt;Frequent detection required&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Performance Impact
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1000 Tasks Waiting 1 Hour&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Traditional: 1000 worker slots × 1 hour = 1000 slot-hours wasted&lt;br&gt;
Deferrable: 0 worker slots, minimal CPU in trigger service&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Savings&lt;/strong&gt;: ~95% for typical scenarios&lt;/p&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 7: DEPENDENCIES &amp;amp; RELATIONSHIPS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Task Dependencies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Linear Chain&lt;/strong&gt;: Task A → Task B → Task C → Task D&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sequential execution&lt;/li&gt;
&lt;li&gt;Each waits for previous&lt;/li&gt;
&lt;li&gt;Easiest to understand&lt;/li&gt;
&lt;li&gt;Best for: ETL pipelines with stages&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Fan-out / Fan-in&lt;/strong&gt;: One task triggers many, then merge&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      ├─ Process A ─┐
Extract ┤            ├─ Merge
      └─ Process B ─┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Parallel processing&lt;/li&gt;
&lt;li&gt;Best for: Multiple independent operations&lt;/li&gt;
&lt;li&gt;Example: Extract once, transform multiple ways&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Branching&lt;/strong&gt;: Conditional execution&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;      ┌─ Success path
Branch ┤
      └─ Error path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Different paths based on condition&lt;/li&gt;
&lt;li&gt;Best for: Alternative workflows&lt;/li&gt;
&lt;li&gt;Example: If data valid, load; else alert&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Trigger Rules
&lt;/h3&gt;

&lt;p&gt;Trigger rules determine WHEN downstream tasks execute based on UPSTREAM status.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Rule&lt;/th&gt;
&lt;th&gt;Executes When&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_success&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All upstream succeed&lt;/td&gt;
&lt;td&gt;Default, sequential flow&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All upstream fail&lt;/td&gt;
&lt;td&gt;Cleanup after failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;all_done&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;All complete (any status)&lt;/td&gt;
&lt;td&gt;Final reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;one_success&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At least one succeeds&lt;/td&gt;
&lt;td&gt;Alternative paths&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;one_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;At least one fails&lt;/td&gt;
&lt;td&gt;Error alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none_failed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No failures (skips OK)&lt;/td&gt;
&lt;td&gt;Robust pipelines ✓ Recommended&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;none_failed_or_skipped&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No failures/skips&lt;/td&gt;
&lt;td&gt;Strict validation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: Use &lt;code&gt;none_failed&lt;/code&gt; for fault-tolerant pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  DAG-to-DAG Dependencies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Methods to Create&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ExternalTaskSensor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Poll for completion&lt;/td&gt;
&lt;td&gt;Different schedules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TriggerDagRunOperator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Actively trigger&lt;/td&gt;
&lt;td&gt;Orchestrate workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event-driven&lt;/td&gt;
&lt;td&gt;Data pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Datasets (Airflow 2.3+)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Producer DAG updates dataset&lt;/li&gt;
&lt;li&gt;Consumer DAG triggered automatically&lt;/li&gt;
&lt;li&gt;Real-time responsiveness (&amp;lt;1 second)&lt;/li&gt;
&lt;li&gt;Low latency, efficient&lt;/li&gt;
&lt;li&gt;Best for data-driven workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ETL DAG extracts and updates &lt;code&gt;orders_dataset&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Analytics DAG triggered automatically when dataset updates&lt;/li&gt;
&lt;li&gt;No waiting, real-time processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 8: SCHEDULING
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Overview of Scheduling Methods
&lt;/h3&gt;

&lt;p&gt;Three ways to trigger DAG runs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Time-based&lt;/strong&gt;: Specific times/intervals&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event-driven&lt;/strong&gt;: When data arrives/updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual&lt;/strong&gt;: User-triggered via UI&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Time-Based: Cron Expressions (Legacy)
&lt;/h3&gt;

&lt;p&gt;Cron format: &lt;code&gt;minute hour day month day_of_week&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Common Patterns&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;@hourly&lt;/code&gt;: Every hour&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@daily&lt;/code&gt;: Every day at midnight&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@weekly&lt;/code&gt;: Every Sunday at midnight&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;@monthly&lt;/code&gt;: First day of month&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;0 9 * * MON-FRI&lt;/code&gt;: 9 AM on weekdays&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;*/15 * * * *&lt;/code&gt;: Every 15 minutes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple time-only scheduling&lt;/li&gt;
&lt;li&gt;Limited to time expressions&lt;/li&gt;
&lt;li&gt;No business logic&lt;/li&gt;
&lt;li&gt;Timezone handling tricky&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Time-Based: Timetable API (Modern)
&lt;/h3&gt;

&lt;p&gt;Timetables (Airflow 2.2+) provide object-oriented scheduling with unlimited flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built-in Timetables&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CronDataTimetable&lt;/code&gt;: Cron-based (replaces schedule_interval)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DailyTimetable&lt;/code&gt;: Daily runs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;IntervalTimetable&lt;/code&gt;: Fixed intervals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom Timetables Allow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business day scheduling (Mon-Fri)&lt;/li&gt;
&lt;li&gt;Holiday-aware schedules (skip holidays)&lt;/li&gt;
&lt;li&gt;Timezone-specific runs&lt;/li&gt;
&lt;li&gt;Complex business logic&lt;/li&gt;
&lt;li&gt;End-of-month runs&lt;/li&gt;
&lt;li&gt;Business hour runs (9 AM - 5 PM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Advantages&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unlimited flexibility&lt;/li&gt;
&lt;li&gt;Timezone support&lt;/li&gt;
&lt;li&gt;Readable code&lt;/li&gt;
&lt;li&gt;Testable&lt;/li&gt;
&lt;li&gt;Documented rationale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Event-Driven: Datasets (Airflow 2.3+)
&lt;/h3&gt;

&lt;p&gt;Datasets enable data-driven workflows instead of time-driven.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it Works&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Producer task completes and updates dataset&lt;/li&gt;
&lt;li&gt;Scheduler detects dataset update&lt;/li&gt;
&lt;li&gt;Consumer DAG triggered immediately&lt;/li&gt;
&lt;li&gt;No scheduled waiting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Benefits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Real-time responsiveness&lt;/li&gt;
&lt;li&gt;Resource efficient (no idle waits)&lt;/li&gt;
&lt;li&gt;Data-quality focused&lt;/li&gt;
&lt;li&gt;Reduced latency (&amp;lt;1 second)&lt;/li&gt;
&lt;li&gt;Perfect for data pipelines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to Use&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multi-stage ETL (extract → transform → load)&lt;/li&gt;
&lt;li&gt;Event streaming&lt;/li&gt;
&lt;li&gt;Real-time processing&lt;/li&gt;
&lt;li&gt;Data-dependent workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Hybrid Scheduling
&lt;/h3&gt;

&lt;p&gt;Combine time-based and event-driven:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DAG runs on schedule OR when data arrives&lt;/li&gt;
&lt;li&gt;Whichever comes first&lt;/li&gt;
&lt;li&gt;Best of both worlds&lt;/li&gt;
&lt;li&gt;Flexible and responsive&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduling Decision Tree
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TIME-BASED?
├─ YES → Complex logic?
│        ├─ YES → Custom Timetable
│        └─ NO → Cron Expression
│
└─ NO → Event-driven?
         ├─ YES → Use Datasets
         └─ NO → Manual trigger or reconsider design
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 9: FAILURE HANDLING
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Retry Mechanisms
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Allow tasks to recover from transient failures (network timeout, API rate limit, etc.)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;retries&lt;/code&gt;: Number of retry attempts&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry_delay&lt;/code&gt;: Wait time between retries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;retry_exponential_backoff&lt;/code&gt;: Exponentially increasing delays&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_retry_delay&lt;/code&gt;: Cap on maximum wait&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example Retry Sequence&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attempt 1: Fail → Wait 1 minute
Attempt 2: Fail → Wait 2 minutes (1 × 2^1)
Attempt 3: Fail → Wait 4 minutes (1 × 2^2)
Attempt 4: Fail → Wait 8 minutes (1 × 2^3)
Attempt 5: Fail → Wait 16 minutes (1 × 2^4)
Attempt 6: Fail → Max 1 hour (capped)
Attempt 7: Fail → Task FAILED
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Best Practice&lt;/strong&gt;: 2-3 retries with exponential backoff for most tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Trigger Rules (Already Covered in Section 7)
&lt;/h3&gt;

&lt;p&gt;But worth repeating: Use &lt;code&gt;none_failed&lt;/code&gt; for robust pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Callbacks
&lt;/h3&gt;

&lt;p&gt;Execute custom logic when task state changes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;on_failure_callback&lt;/code&gt;: When task fails&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_retry_callback&lt;/code&gt;: When task retries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;on_success_callback&lt;/code&gt;: When task succeeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common Uses&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send notifications (Slack, email, PagerDuty)&lt;/li&gt;
&lt;li&gt;Log to external monitoring systems&lt;/li&gt;
&lt;li&gt;Trigger incident management&lt;/li&gt;
&lt;li&gt;Update dashboards&lt;/li&gt;
&lt;li&gt;Cleanup resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Error Handling Patterns
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: Idempotency&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Make tasks safe to retry&lt;/li&gt;
&lt;li&gt;Delete-then-insert instead of insert&lt;/li&gt;
&lt;li&gt;Same result regardless of executions&lt;/li&gt;
&lt;li&gt;Safe for retries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: Circuit Breaker&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stop retrying after N attempts&lt;/li&gt;
&lt;li&gt;Raise critical exceptions&lt;/li&gt;
&lt;li&gt;Alert operations team&lt;/li&gt;
&lt;li&gt;Prevent cascading failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: Graceful Degradation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continue with partial data&lt;/li&gt;
&lt;li&gt;Log warnings&lt;/li&gt;
&lt;li&gt;Don't fail entire pipeline&lt;/li&gt;
&lt;li&gt;Better UX than all-or-nothing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pattern 4: Skip on Condition&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Raise &lt;code&gt;AirflowSkipException&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Skip task without marking as failure&lt;/li&gt;
&lt;li&gt;Allow downstream to continue&lt;/li&gt;
&lt;li&gt;Best for conditional workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 10: SLA MANAGEMENT
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is SLA?
&lt;/h3&gt;

&lt;p&gt;SLA (Service Level Agreement) is a guarantee that a task will complete within a specified time window.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If SLA is missed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task marked as "SLA Missed"&lt;/li&gt;
&lt;li&gt;Notifications triggered&lt;/li&gt;
&lt;li&gt;Callbacks executed&lt;/li&gt;
&lt;li&gt;Logged for auditing/metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Setting SLAs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;At DAG Level&lt;/strong&gt;: All tasks inherit SLA&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;default_args = {
    'sla': timedelta(hours=1)
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;At Task Level&lt;/strong&gt;: Specific to individual task&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;task = PythonOperator(
    task_id='critical_task',
    sla=timedelta(minutes=30)
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Per Task Overrides&lt;/strong&gt;: Task SLA overrides DAG default&lt;/p&gt;

&lt;h3&gt;
  
  
  SLA Monitoring
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SLA compliance rate (%)&lt;/li&gt;
&lt;li&gt;Average overdue duration&lt;/li&gt;
&lt;li&gt;Most violated tasks&lt;/li&gt;
&lt;li&gt;Impact on downstream&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alerts&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send notifications when SLA at risk (80%)&lt;/li&gt;
&lt;li&gt;Escalate on missed SLA&lt;/li&gt;
&lt;li&gt;Create incidents for critical SLAs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track compliance trends&lt;/li&gt;
&lt;li&gt;Identify problematic tasks&lt;/li&gt;
&lt;li&gt;Alert history&lt;/li&gt;
&lt;li&gt;Remediation tracking&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SLA Best Practices
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Practice&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Set realistic SLAs&lt;/td&gt;
&lt;td&gt;Avoid false positives&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Different tiers&lt;/td&gt;
&lt;td&gt;Critical &amp;lt; Normal &amp;lt; Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Review quarterly&lt;/td&gt;
&lt;td&gt;Account for growth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Document rationale&lt;/td&gt;
&lt;td&gt;Team alignment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Escalation policy&lt;/td&gt;
&lt;td&gt;Proper incident response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Callbacks&lt;/td&gt;
&lt;td&gt;Automated notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitor compliance&lt;/td&gt;
&lt;td&gt;Identify trends&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 11: CATCHUP BEHAVIOR
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What is Catchup?
&lt;/h3&gt;

&lt;p&gt;Catchup is automatic backfill of missed DAG runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Scenario&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Start date: 2024-01-01
Today: 2024-05-15
Schedule: @daily (every day)
Catchup: True

RESULT: Airflow creates 135 DAG runs instantly!
(One for each day from Jan 1 to May 14)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Catchup Settings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;th&gt;When to Use&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catchup=True&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Backfill all missed runs&lt;/td&gt;
&lt;td&gt;Historical data loading&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;catchup=False&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;No backfill, start fresh&lt;/td&gt;
&lt;td&gt;New pipelines&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Catchup Risks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Potential Issues&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource exhaustion (CPU, memory, database)&lt;/li&gt;
&lt;li&gt;Network bandwidth spike&lt;/li&gt;
&lt;li&gt;Third-party API rate limits exceeded&lt;/li&gt;
&lt;li&gt;Database connection pool exhausted&lt;/li&gt;
&lt;li&gt;Worker node crashes&lt;/li&gt;
&lt;li&gt;Scheduler paralysis (busy processing backfill)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Safe Catchup Strategy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Use Separate Backfill DAG&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Main DAG: catchup=False (normal operations)&lt;/li&gt;
&lt;li&gt;Backfill DAG: catchup=True, end_date set (historical only)&lt;/li&gt;
&lt;li&gt;Run backfill during maintenance window&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Control Parallelism&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;max_active_runs&lt;/code&gt;: Limit concurrent DAG runs (1-3)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;max_active_tasks_per_dag&lt;/code&gt;: Limit concurrent tasks (16-32)&lt;/li&gt;
&lt;li&gt;Prevents resource exhaustion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Staged Approach&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test with subset of dates first&lt;/li&gt;
&lt;li&gt;Monitor resource usage&lt;/li&gt;
&lt;li&gt;Gradually increase pace&lt;/li&gt;
&lt;li&gt;Have rollback plan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Verify data completeness&lt;/li&gt;
&lt;li&gt;Check for duplicates&lt;/li&gt;
&lt;li&gt;Validate transformations&lt;/li&gt;
&lt;li&gt;Compare with expected results&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Catchup Checklist
&lt;/h3&gt;

&lt;p&gt;Before enabling catchup, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Database capacity available&lt;/li&gt;
&lt;li&gt;Worker resources adequate&lt;/li&gt;
&lt;li&gt;Source data exists for all periods&lt;/li&gt;
&lt;li&gt;Data schema hasn't changed&lt;/li&gt;
&lt;li&gt;No API rate limits exceeded&lt;/li&gt;
&lt;li&gt;Network bandwidth available&lt;/li&gt;
&lt;li&gt;Stakeholders informed&lt;/li&gt;
&lt;li&gt;Rollback procedure ready&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 12: CONFIGURATION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Configuration Hierarchy
&lt;/h3&gt;

&lt;p&gt;Airflow reads configuration in this order (first found wins):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Environment Variables&lt;/strong&gt; (AIRFLOW___ prefix)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;airflow.cfg&lt;/strong&gt; file (main config)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Default configuration&lt;/strong&gt; (hardcoded)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Command-line arguments&lt;/strong&gt; (runtime)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Critical Settings
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setting&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Production Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;sql_alchemy_conn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Metadata database&lt;/td&gt;
&lt;td&gt;PostgreSQL connection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;executor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Task execution&lt;/td&gt;
&lt;td&gt;CeleryExecutor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;parallelism&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Max tasks globally&lt;/td&gt;
&lt;td&gt;32-128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_runs_per_dag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent DAG runs&lt;/td&gt;
&lt;td&gt;1-3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;max_active_tasks_per_dag&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Concurrent tasks&lt;/td&gt;
&lt;td&gt;16-32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;remote_logging&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Centralized logs&lt;/td&gt;
&lt;td&gt;True (to S3)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;email_backend&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Email provider&lt;/td&gt;
&lt;td&gt;SMTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;base_log_folder&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Log storage&lt;/td&gt;
&lt;td&gt;NFS mount&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Database Configuration
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Production Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL 12+ (recommended) or MySQL 8+&lt;/li&gt;
&lt;li&gt;Dedicated database server (not shared)&lt;/li&gt;
&lt;li&gt;Daily backups automated&lt;/li&gt;
&lt;li&gt;Connection pooling enabled&lt;/li&gt;
&lt;li&gt;Monitoring for slow queries&lt;/li&gt;
&lt;li&gt;Regular maintenance (VACUUM, ANALYZE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Not SQLite&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Not concurrent-access safe&lt;/li&gt;
&lt;li&gt;Locks on writes&lt;/li&gt;
&lt;li&gt;Single machine only&lt;/li&gt;
&lt;li&gt;Not for production&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Executor Configuration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Executor&lt;/th&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Local&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;[local_executor]&lt;/td&gt;
&lt;td&gt;Development&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Celery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;[celery]&lt;/td&gt;
&lt;td&gt;Distributed, high volume&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kubernetes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;[kubernetes]&lt;/td&gt;
&lt;td&gt;Cloud-native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sequential&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;(internal)&lt;/td&gt;
&lt;td&gt;Testing only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Performance Tuning
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Database&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connection pooling (pool_size=5, max_overflow=10)&lt;/li&gt;
&lt;li&gt;Indexes on frequently queried columns&lt;/li&gt;
&lt;li&gt;Archive old logs/task instances&lt;/li&gt;
&lt;li&gt;Query optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scheduler&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase DAG parsing frequency if needed&lt;/li&gt;
&lt;li&gt;Optimize DAG file count&lt;/li&gt;
&lt;li&gt;Use pool resources wisely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Web Server&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache results&lt;/li&gt;
&lt;li&gt;Limit search scope&lt;/li&gt;
&lt;li&gt;Monitor response times&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 13: AWS INTEGRATION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Connection Setup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Three Methods&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Airflow UI&lt;/strong&gt;: Admin → Connections → Create&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Environment Variables&lt;/strong&gt;: &lt;code&gt;AIRFLOW_CONN_AWS_DEFAULT&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Programmatic&lt;/strong&gt;: Python script creating Connection objects&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Authentication Types&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Access Key / Secret Key (discouraged)&lt;/li&gt;
&lt;li&gt;IAM Roles (recommended for EC2)&lt;/li&gt;
&lt;li&gt;AssumeRole (cross-account access)&lt;/li&gt;
&lt;li&gt;Temporary credentials&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Common AWS Services
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;What It Does&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File storage&lt;/td&gt;
&lt;td&gt;Data lake, logs, backups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Redshift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data warehouse&lt;/td&gt;
&lt;td&gt;Analytics, aggregations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless compute&lt;/td&gt;
&lt;td&gt;Lightweight processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SNS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Notifications&lt;/td&gt;
&lt;td&gt;Alerts, messaging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Message queue&lt;/td&gt;
&lt;td&gt;Task decoupling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Glue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ETL service&lt;/td&gt;
&lt;td&gt;Spark/Python jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RDS/Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed database&lt;/td&gt;
&lt;td&gt;Transactional data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EMR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Big data&lt;/td&gt;
&lt;td&gt;Spark, Hadoop clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudWatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitoring&lt;/td&gt;
&lt;td&gt;Metrics, logs, alarms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Operators for AWS Services
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;S3 Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Upload/download files&lt;/li&gt;
&lt;li&gt;Copy within S3&lt;/li&gt;
&lt;li&gt;List objects&lt;/li&gt;
&lt;li&gt;Delete files&lt;/li&gt;
&lt;li&gt;Check existence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Redshift Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execute SQL queries&lt;/li&gt;
&lt;li&gt;Load data from S3&lt;/li&gt;
&lt;li&gt;Create/delete snapshots&lt;/li&gt;
&lt;li&gt;Pause/resume clusters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Lambda Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Invoke functions&lt;/li&gt;
&lt;li&gt;Async execution&lt;/li&gt;
&lt;li&gt;Sync execution with polling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SQS/SNS Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Send messages to queue&lt;/li&gt;
&lt;li&gt;Receive from queue&lt;/li&gt;
&lt;li&gt;Publish to topics&lt;/li&gt;
&lt;li&gt;Batch operations&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AWS Best Practices
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Practice&lt;/th&gt;
&lt;th&gt;Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Use IAM roles&lt;/td&gt;
&lt;td&gt;No credentials in code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Secrets Manager&lt;/td&gt;
&lt;td&gt;Centralized credential storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enable CloudTrail&lt;/td&gt;
&lt;td&gt;Audit all API calls&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Use VPC endpoints&lt;/td&gt;
&lt;td&gt;S3 access without internet&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Monitor costs&lt;/td&gt;
&lt;td&gt;CloudWatch cost alerts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Implement budgets&lt;/td&gt;
&lt;td&gt;Prevent overspending&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regular backups&lt;/td&gt;
&lt;td&gt;Disaster recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Region strategy&lt;/td&gt;
&lt;td&gt;Latency, compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 14: VERSION COMPARISON
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Version Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1.10.x (Deprecated, EOL 2021)
    ↓ Major Redesign
2.0 (March 2021) - TaskFlow API introduced
    ↓
2.1-2.2 (Improvements &amp;amp; Deferrable)
    ↓
2.3 (Nov 2022) - Datasets event-driven scheduling
    ↓
2.4-2.5 (Current stable, production-ready)
    ↓
3.0 (Latest) - Performance focus, async improvements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Feature Comparison
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;1.10&lt;/th&gt;
&lt;th&gt;2.x&lt;/th&gt;
&lt;th&gt;3.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TaskFlow API&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deferrable Ops&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Datasets&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;3-5x faster&lt;/td&gt;
&lt;td&gt;5-6x faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Python 3.10+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅ Required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Async/Await&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;✅ Full&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Performance Improvements
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;1.10&lt;/th&gt;
&lt;th&gt;2.5&lt;/th&gt;
&lt;th&gt;3.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DAG parsing (1000 DAGs)&lt;/td&gt;
&lt;td&gt;~120s&lt;/td&gt;
&lt;td&gt;~45s&lt;/td&gt;
&lt;td&gt;~15s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Task scheduling latency&lt;/td&gt;
&lt;td&gt;~5-10s&lt;/td&gt;
&lt;td&gt;~2-3s&lt;/td&gt;
&lt;td&gt;&amp;lt;1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory (idle scheduler)&lt;/td&gt;
&lt;td&gt;~500MB&lt;/td&gt;
&lt;td&gt;~300MB&lt;/td&gt;
&lt;td&gt;~150MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web UI response&lt;/td&gt;
&lt;td&gt;~800ms&lt;/td&gt;
&lt;td&gt;~200ms&lt;/td&gt;
&lt;td&gt;~50ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Migration Path
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;From 1.10 to 2.x&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Python 3.6+ required&lt;/li&gt;
&lt;li&gt;Import paths changed (reorganized)&lt;/li&gt;
&lt;li&gt;Operators moved to providers&lt;/li&gt;
&lt;li&gt;Test in staging thoroughly&lt;/li&gt;
&lt;li&gt;Gradual production rollout (1-2 weeks)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;From 2.x to 3.0&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most DAGs work without changes&lt;/li&gt;
&lt;li&gt;Drop Python &amp;lt; 3.10&lt;/li&gt;
&lt;li&gt;Update async patterns&lt;/li&gt;
&lt;li&gt;Leverage new performance features&lt;/li&gt;
&lt;li&gt;Simpler migration (backward compatible)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Version Selection
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;New projects&lt;/td&gt;
&lt;td&gt;2.5+ or 3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Production (stable)&lt;/td&gt;
&lt;td&gt;2.4.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;td&gt;2.5+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance-critical&lt;/td&gt;
&lt;td&gt;3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning&lt;/td&gt;
&lt;td&gt;2.5+ or 3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SECTION 15: BEST PRACTICES
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DAG Design DO's
&lt;/h3&gt;

&lt;p&gt;✓ Use TaskFlow API for modern code&lt;br&gt;
✓ Keep tasks small and focused&lt;br&gt;
✓ Name tasks descriptively&lt;br&gt;
✓ Document complex logic&lt;br&gt;
✓ Use type hints&lt;br&gt;
✓ Add comprehensive logging&lt;br&gt;
✓ Handle errors gracefully&lt;br&gt;
✓ Test DAGs before production&lt;/p&gt;
&lt;h3&gt;
  
  
  DAG Design DON'Ts
&lt;/h3&gt;

&lt;p&gt;✗ Put heavy processing in DAG definition&lt;br&gt;
✗ Create 1000s of tasks per DAG&lt;br&gt;
✗ Hardcode credentials&lt;br&gt;
✗ Ignore data validation&lt;br&gt;
✗ Use DAG-level database connections&lt;br&gt;
✗ Forget about idempotency&lt;br&gt;
✗ Skip error handling&lt;br&gt;
✗ Assume tasks always succeed&lt;/p&gt;
&lt;h3&gt;
  
  
  Task Design DO's
&lt;/h3&gt;

&lt;p&gt;✓ Make tasks idempotent&lt;br&gt;
✓ Set appropriate timeouts&lt;br&gt;
✓ Implement retries with backoff&lt;br&gt;
✓ Use pools for resource control&lt;br&gt;
✓ Skip tasks on errors (when appropriate)&lt;br&gt;
✓ Log important events&lt;br&gt;
✓ Clean up resources&lt;/p&gt;
&lt;h3&gt;
  
  
  Task Design DON'Ts
&lt;/h3&gt;

&lt;p&gt;✗ Assume tasks always succeed&lt;br&gt;
✗ Run long operations without timeout&lt;br&gt;
✗ Hold resources unnecessarily&lt;br&gt;
✗ Store large data in XCom&lt;br&gt;
✗ Use blocking operations&lt;br&gt;
✗ Ignore upstream failures&lt;br&gt;
✗ Forget cleanup&lt;/p&gt;
&lt;h3&gt;
  
  
  Scheduling DO's
&lt;/h3&gt;

&lt;p&gt;✓ Match frequency to data needs&lt;br&gt;
✓ Account for upstream dependencies&lt;br&gt;
✓ Plan for failures&lt;br&gt;
✓ Test schedule changes&lt;br&gt;
✓ Document rationale&lt;br&gt;
✓ Monitor adherence&lt;/p&gt;
&lt;h3&gt;
  
  
  Scheduling DON'Ts
&lt;/h3&gt;

&lt;p&gt;✗ Schedule too frequently&lt;br&gt;
✗ Ignore resource constraints&lt;br&gt;
✗ Forget about catchup&lt;br&gt;
✗ Use ambiguous expressions&lt;br&gt;
✗ Create tight couplings&lt;br&gt;
✗ Skip monitoring&lt;/p&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 16: MONITORING
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Key Metrics
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;What It Shows&lt;/th&gt;
&lt;th&gt;Alert Threshold&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scheduler heartbeat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Scheduler running&lt;/td&gt;
&lt;td&gt;&amp;gt;30s = ALERT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task failure rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pipeline health&lt;/td&gt;
&lt;td&gt;&amp;gt;5% = INVESTIGATE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SLA violations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Performance issues&lt;/td&gt;
&lt;td&gt;&amp;gt;10/day = REVIEW&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DAG parsing time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;System load&lt;/td&gt;
&lt;td&gt;&amp;gt;60s = OPTIMIZE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Task queue depth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Worker capacity&lt;/td&gt;
&lt;td&gt;&amp;gt;1000 = ADD WORKERS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DB connections&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connection health&lt;/td&gt;
&lt;td&gt;&amp;gt;80% = INCREASE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Log size&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Storage usage&lt;/td&gt;
&lt;td&gt;Monitor growth&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Common Issues &amp;amp; Solutions
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Likely Cause&lt;/th&gt;
&lt;th&gt;Solution&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tasks stuck&lt;/td&gt;
&lt;td&gt;Worker crashed&lt;/td&gt;
&lt;td&gt;Restart workers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scheduler lagging&lt;/td&gt;
&lt;td&gt;High load&lt;/td&gt;
&lt;td&gt;Add resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory leaks&lt;/td&gt;
&lt;td&gt;Bad code&lt;/td&gt;
&lt;td&gt;Profile and fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database slow&lt;/td&gt;
&lt;td&gt;No indexes&lt;/td&gt;
&lt;td&gt;Optimize queries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Triggers not firing&lt;/td&gt;
&lt;td&gt;Triggerer down&lt;/td&gt;
&lt;td&gt;Check service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLA misses&lt;/td&gt;
&lt;td&gt;Unrealistic goals&lt;/td&gt;
&lt;td&gt;Adjust SLA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Catchup explosion&lt;/td&gt;
&lt;td&gt;Too many runs&lt;/td&gt;
&lt;td&gt;Limit with max_active&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  Monitoring Stack Components
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Metrics Collection&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus for collection&lt;/li&gt;
&lt;li&gt;Custom StatsD metrics&lt;/li&gt;
&lt;li&gt;CloudWatch for AWS&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Visualization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grafana dashboards&lt;/li&gt;
&lt;li&gt;CloudWatch dashboards&lt;/li&gt;
&lt;li&gt;Custom visualizations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alerting&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Threshold-based alerts&lt;/li&gt;
&lt;li&gt;Anomaly detection&lt;/li&gt;
&lt;li&gt;Escalation policies&lt;/li&gt;
&lt;li&gt;Integration with incident systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Logging&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralized logs (ELK, Splunk)&lt;/li&gt;
&lt;li&gt;Remote logging to S3&lt;/li&gt;
&lt;li&gt;Log aggregation&lt;/li&gt;
&lt;li&gt;Searchable archives&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  SECTION 17: CONNECTIONS AND SECRETS
&lt;/h2&gt;
&lt;h3&gt;
  
  
  What are Connections?
&lt;/h3&gt;

&lt;p&gt;Connections store authentication credentials to external systems. Instead of hardcoding database passwords or API keys in DAGs, Airflow stores these centrally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Components:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Connection ID&lt;/strong&gt;: Unique identifier (e.g., &lt;code&gt;postgres_default&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connection Type&lt;/strong&gt;: System type (postgres, aws, http, mysql, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host&lt;/strong&gt;: Server address&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Login&lt;/strong&gt;: Username&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Password&lt;/strong&gt;: Secret credential&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port&lt;/strong&gt;: Service port&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema&lt;/strong&gt;: Database or default path&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extra&lt;/strong&gt;: JSON for additional parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Use Connections?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Credentials not in DAG code (can't accidentally commit to git)&lt;/li&gt;
&lt;li&gt;Centralized credential management&lt;/li&gt;
&lt;li&gt;Easy credential rotation&lt;/li&gt;
&lt;li&gt;Different credentials per environment (dev, staging, prod)&lt;/li&gt;
&lt;li&gt;Audit trail of access&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Connection Types
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Used For&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;postgres&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PostgreSQL databases&lt;/td&gt;
&lt;td&gt;Analytics warehouse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mysql&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MySQL databases&lt;/td&gt;
&lt;td&gt;Transactional data&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;snowflake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Snowflake warehouse&lt;/td&gt;
&lt;td&gt;Cloud data platform&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;aws&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AWS services (S3, Redshift, Lambda)&lt;/td&gt;
&lt;td&gt;Cloud infrastructure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;gcp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Google Cloud services&lt;/td&gt;
&lt;td&gt;BigQuery, Cloud Storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;http&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;REST APIs&lt;/td&gt;
&lt;td&gt;External services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ssh&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Remote server access&lt;/td&gt;
&lt;td&gt;Bastion hosts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;slack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Slack messaging&lt;/td&gt;
&lt;td&gt;Notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sftp&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SFTP file transfer&lt;/td&gt;
&lt;td&gt;Remote servers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  What are Secrets/Variables?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Variables&lt;/strong&gt; store non-sensitive configuration values.&lt;br&gt;
&lt;strong&gt;Secrets&lt;/strong&gt; store sensitive data (marked as secret).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Variables&lt;/th&gt;
&lt;th&gt;Secrets&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purpose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configuration settings&lt;/td&gt;
&lt;td&gt;Passwords, API keys&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Visibility&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visible in UI&lt;/td&gt;
&lt;td&gt;Masked/hidden in UI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metadata database&lt;/td&gt;
&lt;td&gt;External backend (Vault, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;batch_size, email&lt;/td&gt;
&lt;td&gt;api_key, db_password&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h2&gt;
  
  
  How to Add Connections
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Method 1: Web UI (Easiest)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Airflow UI → Admin → Connections → Create Connection&lt;/li&gt;
&lt;li&gt;Fill form with credentials&lt;/li&gt;
&lt;li&gt;Click Save&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Method 2: Environment Variables
&lt;/h3&gt;

&lt;p&gt;Format: &lt;code&gt;AIRFLOW_CONN_{CONNECTION_ID}={CONN_TYPE}://{USER}:{PASSWORD}@{HOST}:{PORT}/{SCHEMA}&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_CONN_POSTGRES_DEFAULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgresql://user:password@localhost:5432/mydb
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_CONN_AWS_DEFAULT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;aws://ACCESS_KEY:SECRET_KEY@
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Method 3: Python Script
&lt;/h3&gt;

&lt;p&gt;Create Connection objects programmatically and add to database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 4: Airflow CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow connections add postgres_prod &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-type&lt;/span&gt; postgres &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-host&lt;/span&gt; localhost &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-login&lt;/span&gt; user &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conn-password&lt;/span&gt; password
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How to Add Variables/Secrets
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Method 1: Web UI
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Airflow UI → Admin → Variables → Create Variable&lt;/li&gt;
&lt;li&gt;Enter Key and Value&lt;/li&gt;
&lt;li&gt;Check "Is Secret" for sensitive data&lt;/li&gt;
&lt;li&gt;Click Save&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Method 2: Environment Variables
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_VAR_BATCH_SIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1000
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AIRFLOW_VAR_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;secret123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Method 3: Python Script
&lt;/h3&gt;

&lt;p&gt;Create Variable objects and add to database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Method 4: Airflow CLI
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow variables &lt;span class="nb"&gt;set &lt;/span&gt;batch_size 1000
airflow variables &lt;span class="nb"&gt;set &lt;/span&gt;api_key secret123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Using Connections in DAGs
&lt;/h2&gt;

&lt;p&gt;Connections are accessed through &lt;strong&gt;Hooks&lt;/strong&gt; (connection abstraction layer).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.hooks.postgres_hook&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PostgresHook&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_database&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PostgresHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postgres_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgres_default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_records&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SELECT * FROM users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefits of Hooks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encapsulate connection logic&lt;/li&gt;
&lt;li&gt;Reusable across tasks&lt;/li&gt;
&lt;li&gt;Handle connection pooling&lt;/li&gt;
&lt;li&gt;Error handling built-in&lt;/li&gt;
&lt;li&gt;No credentials in task code&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Using Variables in DAGs
&lt;/h2&gt;

&lt;p&gt;Variables are accessed directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;

&lt;span class="n"&gt;batch_size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;batch_size&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;api_key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;is_production&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;is_production&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default_var&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# With JSON parsing
&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Variable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;deserialize_json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Secrets Backend: Advanced Security
&lt;/h2&gt;

&lt;p&gt;For production, use external secrets backends instead of storing in Airflow database.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Security Level&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Metadata DB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Development only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AWS Secrets Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;AWS deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HashiCorp Vault&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;Enterprise&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Google Secret Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;GCP deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Azure Key Vault&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Azure deployments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configure Airflow to use external backend&lt;/li&gt;
&lt;li&gt;Airflow queries backend when credential needed&lt;/li&gt;
&lt;li&gt;Secrets never stored in database&lt;/li&gt;
&lt;li&gt;Automatic rotation supported&lt;/li&gt;
&lt;li&gt;Audit logging in backend&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Security Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  DO ✅
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use external secrets backend in production&lt;/li&gt;
&lt;li&gt;Rotate credentials quarterly&lt;/li&gt;
&lt;li&gt;Use IAM roles instead of hardcoded keys&lt;/li&gt;
&lt;li&gt;Encrypt connections in transit (SSL/TLS)&lt;/li&gt;
&lt;li&gt;Encrypt secrets at rest (database encryption)&lt;/li&gt;
&lt;li&gt;Limit who can see connections (RBAC)&lt;/li&gt;
&lt;li&gt;Use short-lived credentials (temporary tokens)&lt;/li&gt;
&lt;li&gt;Audit access to secrets&lt;/li&gt;
&lt;li&gt;Never commit credentials to git&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  DON'T ❌
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Hardcode credentials in DAG code&lt;/li&gt;
&lt;li&gt;Store passwords in comments or docstrings&lt;/li&gt;
&lt;li&gt;Share credentials via Slack/Email&lt;/li&gt;
&lt;li&gt;Use default passwords&lt;/li&gt;
&lt;li&gt;Store credentials in config files in git&lt;/li&gt;
&lt;li&gt;Forget to revoke old credentials&lt;/li&gt;
&lt;li&gt;Use same credentials everywhere&lt;/li&gt;
&lt;li&gt;Log credential values&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Common Issues
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Cause&lt;/th&gt;
&lt;th&gt;Fix&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connection refused&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service not running&lt;/td&gt;
&lt;td&gt;Start the service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Authentication failed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wrong credentials&lt;/td&gt;
&lt;td&gt;Verify in source system&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Host not found&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wrong hostname/DNS&lt;/td&gt;
&lt;td&gt;Check network, DNS resolution&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Timeout&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network unreachable&lt;/td&gt;
&lt;td&gt;Check firewall, routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Variable not found&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Wrong key name&lt;/td&gt;
&lt;td&gt;Check variable ID spelling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Permission denied&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User lacks access&lt;/td&gt;
&lt;td&gt;Check user RBAC permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Testing Connections
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Via Web UI:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to Admin → Connections&lt;/li&gt;
&lt;li&gt;Click on connection&lt;/li&gt;
&lt;li&gt;Click "Test Connection"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Via CLI:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;airflow connections &lt;span class="nb"&gt;test &lt;/span&gt;postgres_default
airflow connections &lt;span class="nb"&gt;test &lt;/span&gt;aws_default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Via Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;airflow.hooks.postgres_hook&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;PostgresHook&lt;/span&gt;

&lt;span class="n"&gt;hook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PostgresHook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;postgres_conn_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;postgres_default&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_conn&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# Will raise error if connection fails
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Production Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Minimum Requirements:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;✅ All connections stored centrally (not hardcoded)&lt;br&gt;
✅ Credentials in external secrets backend&lt;br&gt;
✅ RBAC enabled (limit connection visibility)&lt;br&gt;
✅ No credentials in DAG code&lt;br&gt;
✅ No credentials in git history&lt;br&gt;
✅ Quarterly credential rotation&lt;br&gt;
✅ SSL/TLS enabled for external connections&lt;br&gt;
✅ Audit logging for credential access&lt;br&gt;
✅ Monitoring/alerts for failed authentication&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  SUMMARY &amp;amp; CHECKLIST
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Production Deployment Checklist
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;INFRASTRUCTURE&lt;/strong&gt;&lt;br&gt;
□ PostgreSQL 12+ database&lt;br&gt;
□ Separate scheduler machine&lt;br&gt;
□ Multiple web server instances (load balanced)&lt;br&gt;
□ Multiple worker machines (if Celery)&lt;br&gt;
□ Message broker (Redis/RabbitMQ)&lt;br&gt;
□ Triggerer service running&lt;br&gt;
□ NFS for logs (centralized)&lt;br&gt;
□ Backup systems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CONFIGURATION&lt;/strong&gt;&lt;br&gt;
□ Connection pooling configured&lt;br&gt;
□ Resource pools created&lt;br&gt;
□ Email/Slack configured&lt;br&gt;
□ Remote logging to S3/cloud&lt;br&gt;
□ Monitoring setup (Prometheus/Grafana)&lt;br&gt;
□ Alert thresholds defined&lt;br&gt;
□ Timezone configured correctly&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SECURITY&lt;/strong&gt;&lt;br&gt;
□ RBAC enabled in UI&lt;br&gt;
□ Credentials in Secrets Manager&lt;br&gt;
□ SSL/TLS enabled&lt;br&gt;
□ Database encryption at rest&lt;br&gt;
□ VPC configured&lt;br&gt;
□ Firewall rules tight&lt;br&gt;
□ Regular security updates&lt;br&gt;
□ No credentials in DAG code&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RELIABILITY&lt;/strong&gt;&lt;br&gt;
□ Metadata backup automated (daily)&lt;br&gt;
□ DAG code in Git repository&lt;br&gt;
□ Disaster recovery plan documented&lt;br&gt;
□ Rollback procedure tested&lt;br&gt;
□ SLA enforcement active&lt;br&gt;
□ Callback notifications configured&lt;br&gt;
□ Data validation implemented&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MONITORING&lt;/strong&gt;&lt;br&gt;
□ Scheduler health check&lt;br&gt;
□ Worker heartbeat monitoring&lt;br&gt;
□ Database performance monitoring&lt;br&gt;
□ Trigger service health&lt;br&gt;
□ DAG/task failure alerts&lt;br&gt;
□ SLA violation alerts&lt;br&gt;
□ Resource usage dashboards&lt;br&gt;
□ Cost tracking (if cloud)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PERFORMANCE&lt;/strong&gt;&lt;br&gt;
□ Database indexes optimized&lt;br&gt;
□ Connection pool sized appropriately&lt;br&gt;
□ Max active runs limited&lt;br&gt;
□ Task priorities set&lt;br&gt;
□ Deferrable operators for long waits&lt;br&gt;
□ SmartSensor enabled&lt;br&gt;
□ DAG parsing optimized&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DOCUMENTATION&lt;/strong&gt;&lt;br&gt;
□ DAG documentation complete&lt;br&gt;
□ SLA rationale documented&lt;br&gt;
□ Runbooks created for issues&lt;br&gt;
□ Team trained on platform&lt;br&gt;
□ Changes logged in git&lt;br&gt;
□ Disaster recovery plan shared&lt;br&gt;
□ Escalation procedures documented&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  RECOMMENDATIONS
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For New Projects Starting Today
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Choose Airflow 2.5+ or 3.0&lt;/strong&gt; (not 1.10)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use TaskFlow API&lt;/strong&gt; (modern, clean code)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Timetable API&lt;/strong&gt; for scheduling (not cron strings)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use Deferrable operators&lt;/strong&gt; for long waits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for scale&lt;/strong&gt; from day one (don't start with LocalExecutor)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up monitoring&lt;/strong&gt; immediately (not later)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use event-driven scheduling&lt;/strong&gt; (Datasets) where possible&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store credentials externally&lt;/strong&gt; (Secrets Manager, not hardcoded)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Teams Upgrading from 1.10
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plan 2-3 months&lt;/strong&gt; for migration (not quick)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test thoroughly&lt;/strong&gt; in staging environment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep 1.10 backup&lt;/strong&gt; for emergency rollback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update DAGs incrementally&lt;/strong&gt; (not all at once)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor closely&lt;/strong&gt; first month post-upgrade&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document breaking changes&lt;/strong&gt; for team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celebrate milestones&lt;/strong&gt; (keep team motivated)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Large Enterprise Deployments
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use Kubernetes executor&lt;/strong&gt; (scalability, isolation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement comprehensive monitoring&lt;/strong&gt; (critical for troubleshooting)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up disaster recovery&lt;/strong&gt; (business continuity)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regular chaos engineering tests&lt;/strong&gt; (find weak points before production)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capacity planning quarterly&lt;/strong&gt; (predict growth)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance optimization ongoing&lt;/strong&gt; (never "done")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security audits regular&lt;/strong&gt; (compliance, risk)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  For Data Pipeline Use Cases
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use event-driven scheduling&lt;/strong&gt; (Datasets, not time-based)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement data quality checks&lt;/strong&gt; (validate early)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor SLAs closely&lt;/strong&gt; (data freshness critical)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement idempotent tasks&lt;/strong&gt; (safe retries)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan catchup carefully&lt;/strong&gt; (historical data loads)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version control everything&lt;/strong&gt; (DAGs, configurations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use separate environment&lt;/strong&gt; (dev, staging, prod)&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  CONCLUSION
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Airflow Journey
&lt;/h3&gt;

&lt;p&gt;Apache Airflow has evolved from a simple scheduler into a mature, production-grade workflow orchestration platform. The progression from version 1.10 through modern versions demonstrates significant improvements in usability, performance, flexibility, and reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Takeaways
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Architecture &amp;amp; Design&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TaskFlow API provides clean, Pythonic DAG definitions&lt;/li&gt;
&lt;li&gt;Deferrable operators enable massive resource savings for long-running waits&lt;/li&gt;
&lt;li&gt;Plugins allow endless extensibility without modifying core&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scheduling &amp;amp; Flexibility&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose appropriate scheduling method (time, event, or hybrid)&lt;/li&gt;
&lt;li&gt;Event-driven scheduling (Datasets) perfect for data pipelines&lt;/li&gt;
&lt;li&gt;Timetable API enables business logic in scheduling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Reliability &amp;amp; Operations&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Retries with exponential backoff handle transient failures&lt;/li&gt;
&lt;li&gt;Proper error handling prevents cascading failures&lt;/li&gt;
&lt;li&gt;SLAs enforce performance agreements&lt;/li&gt;
&lt;li&gt;Comprehensive monitoring catches issues early&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scalability &amp;amp; Performance&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple executor options for different scales&lt;/li&gt;
&lt;li&gt;Deferrable operators handle 1000s of concurrent waits&lt;/li&gt;
&lt;li&gt;Performance improvements: 1.10 → 2.5 → 3.0 shows 5-6x speedup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real-World Success&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Airflow powers ETL/ELT pipelines at hundreds of companies&lt;/li&gt;
&lt;li&gt;Used for data engineering, ML orchestration, DevOps automation&lt;/li&gt;
&lt;li&gt;Solves real problems: dependency management, scheduling, monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Starting Your Airflow Journey
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;First Step&lt;/strong&gt;: Understand architecture and core concepts (DAGs, operators, scheduling)&lt;br&gt;
&lt;strong&gt;Next Step&lt;/strong&gt;: Build simple DAGs with TaskFlow API&lt;br&gt;
&lt;strong&gt;Then&lt;/strong&gt;: Add error handling, retries, monitoring&lt;br&gt;
&lt;strong&gt;Finally&lt;/strong&gt;: Optimize for scale, implement production patterns&lt;/p&gt;

&lt;h3&gt;
  
  
  Airflow is Not**
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A replacement for application code&lt;/li&gt;
&lt;li&gt;Suitable for real-time streaming (use Kafka, Spark Streaming)&lt;/li&gt;
&lt;li&gt;A replacement for data warehouses (use Snowflake, BigQuery)&lt;/li&gt;
&lt;li&gt;Simple enough to learn in an hour (takes weeks/months)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Airflow is Perfect For**
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Scheduled batch processing (daily, hourly, etc.)&lt;/li&gt;
&lt;li&gt;Multi-stage ETL/ELT workflows&lt;/li&gt;
&lt;li&gt;Cross-system orchestration&lt;/li&gt;
&lt;li&gt;Data quality validation&lt;/li&gt;
&lt;li&gt;DevOps automation&lt;/li&gt;
&lt;li&gt;ML pipeline orchestration&lt;/li&gt;
&lt;li&gt;Report generation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Final Wisdom
&lt;/h3&gt;

&lt;p&gt;Success with Airflow depends not just on technical setup, but on organizational practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat DAGs as code (version control, code review, testing)&lt;/li&gt;
&lt;li&gt;Monitor continuously (you can't fix what you don't see)&lt;/li&gt;
&lt;li&gt;Document thoroughly (future you and teammates will thank you)&lt;/li&gt;
&lt;li&gt;Automate responses (reduce manual incident response)&lt;/li&gt;
&lt;li&gt;Iterate and improve (perfect is enemy of done)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Apache Airflow's strength lies in its flexibility, active community, and ecosystem. Whether you're building simple daily pipelines or complex multi-stage data workflows orchestrating hundreds of systems, Airflow provides the tools, patterns, and reliability to succeed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start small. Learn thoroughly. Scale thoughtfully. Monitor obsessively.&lt;/strong&gt;&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>devops</category>
      <category>python</category>
      <category>airflow</category>
    </item>
    <item>
      <title>Bookmark - AWS Services with Key Features</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Mon, 05 Jan 2026 05:40:29 +0000</pubDate>
      <link>https://forem.com/datatechbridge/bookmark-aws-services-with-key-features-565i</link>
      <guid>https://forem.com/datatechbridge/bookmark-aws-services-with-key-features-565i</guid>
      <description>&lt;h2&gt;
  
  
  📑 Table of Contents
&lt;/h2&gt;

&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Services Covered&lt;/th&gt;
&lt;th&gt;Jump To&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MSK, Kinesis, Glue, Redshift, Athena, EMR, QuickSight, Lake Formation, DataZone, OpenSearch, CloudSearch, Data Pipeline&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Compute&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;EC2, Lambda, Fargate, Elastic Beanstalk, App Runner, Batch, Lightsail, Outposts, Wavelength, Image Builder&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;S3, EBS, EFS, FSx, Glacier, Storage Gateway, Snow Family, DataSync, Backup, Elastic Disaster Recovery&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Database&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RDS, Aurora, DynamoDB, DocumentDB, ElastiCache, MemoryDB, Neptune, QLDB, Timestream, Keyspaces, DMS&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Networking &amp;amp; Content Delivery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VPC, CloudFront, Route 53, API Gateway, Direct Connect, ELB, Global Accelerator, Transit Gateway, App Mesh, VPN&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Security, Identity &amp;amp; Compliance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IAM, Cognito, Secrets Manager, GuardDuty, Inspector, Macie, Security Hub, Shield, WAF, KMS, CloudHSM, Certificate Manager&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Machine Learning / AI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SageMaker, Rekognition, Comprehend, Polly, Transcribe, Translate, Lex, Personalize, Forecast, Textract, Kendra, Bedrock, CodeWhisperer&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Developer Tools&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CodeCommit, CodeBuild, CodeDeploy, CodePipeline, Cloud9, CloudShell, X-Ray, CodeArtifact, CodeGuru, CloudFormation, CDK, CLI&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Management &amp;amp; Governance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CloudWatch, CloudFormation, CloudTrail, Config, Systems Manager, OpsWorks, Service Catalog, Trusted Advisor, Control Tower, Organizations&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Application Integration&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SQS, SNS, Step Functions, EventBridge, MQ, AppFlow, AppSync&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Containers&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ECS, EKS, ECR, Fargate, App2Container, Copilot&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Migration &amp;amp; Transfer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Migration Hub, Application Migration Service, Application Discovery Service, DataSync, Transfer Family, Snow Family, Mainframe Modernization&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Frontend Web &amp;amp; Mobile&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Amplify, Device Farm, Location Service, AppSync&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Business Applications&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;WorkMail, WorkDocs, SES, Chime, Connect, Pinpoint, Supply Chain, Wickr&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;End User Computing&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;WorkSpaces, AppStream 2.0, WorkSpaces Web&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Internet of Things (IoT)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IoT Core, IoT Greengrass, IoT Device Management, IoT Device Defender, IoT Analytics, IoT Events, IoT SiteWise, IoT TwinMaker, IoT FleetWise&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;17&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Media Services&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Elastic Transcoder, MediaConvert, MediaLive, MediaConnect, MediaPackage, MediaTailor, MediaStore, Kinesis Video Streams, IVS, Nimble Studio&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Game Development&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GameLift&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Robotics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;RoboMaker&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Blockchain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Blockchain&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Quantum Technologies&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Braket&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Satellite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ground Station&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;23&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Customer Engagement&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Activate&lt;/td&gt;
&lt;td&gt;View ↓&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Quick Reference Summary&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Total Services&lt;/th&gt;
&lt;th&gt;Primary Use Cases&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Analytics&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;Data warehousing, ETL, real-time streaming, BI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Virtual servers, serverless, containers, batch processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storage&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;Object storage, block storage, file systems, backup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Database&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Relational, NoSQL, in-memory, graph, time-series&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Networking&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;VPC, CDN, DNS, load balancing, connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Security&lt;/td&gt;
&lt;td&gt;19&lt;/td&gt;
&lt;td&gt;Identity, encryption, threat detection, compliance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML/AI&lt;/td&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;Model training, computer vision, NLP, generative AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Developer Tools&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;CI/CD, IDE, debugging, IaC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Management&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;Monitoring, configuration, optimization, governance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Integration&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Messaging, workflows, event-driven architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Containers&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Orchestration, registry, serverless containers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Migration&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Server migration, data transfer, modernization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web &amp;amp; Mobile&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Full-stack development, testing, location services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Business Apps&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Email, collaboration, contact center, marketing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End User Computing&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Virtual desktops, application streaming&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IoT&lt;/td&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Device connectivity, edge computing, analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Media&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Video transcoding, live streaming, content delivery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Other&lt;/td&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Gaming, robotics, blockchain, quantum, satellite&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;1. ANALYTICS&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Data Streaming &amp;amp; Real-time Processing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kinesis Data Streams&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time data streaming platform&lt;/td&gt;
&lt;td&gt;• Serverless scaling&lt;br&gt;• Sub-second latency&lt;br&gt;• Durable data retention (1-365 days)&lt;br&gt;• Real-time processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kinesis Data Firehose&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Load streaming data to destinations&lt;/td&gt;
&lt;td&gt;• Zero administration&lt;br&gt;• Auto-scaling&lt;br&gt;• Data transformation&lt;br&gt;• Direct S3/Redshift/OpenSearch integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kinesis Data Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Real-time stream analytics&lt;/td&gt;
&lt;td&gt;• SQL/Apache Flink support&lt;br&gt;• Serverless&lt;br&gt;• Pre-built ML algorithms&lt;br&gt;• Real-time dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kinesis Video Streams&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Stream video from devices&lt;/td&gt;
&lt;td&gt;• Live/on-demand playback&lt;br&gt;• ML integration&lt;br&gt;• Encrypted storage&lt;br&gt;• Time-indexed storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MSK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Apache Kafka&lt;/td&gt;
&lt;td&gt;• Fully compatible with Apache Kafka&lt;br&gt;• Auto-healing&lt;br&gt;• Multi-AZ deployment&lt;br&gt;• Serverless option available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Data Warehouse &amp;amp; Querying
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Redshift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Petabyte-scale data warehouse&lt;/td&gt;
&lt;td&gt;• Columnar storage&lt;br&gt;• Massively parallel processing&lt;br&gt;• ML integration&lt;br&gt;• Redshift Spectrum (query S3 directly)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Athena&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless SQL queries on S3&lt;/td&gt;
&lt;td&gt;• Pay per query&lt;br&gt;• Standard SQL&lt;br&gt;• JDBC/ODBC drivers&lt;br&gt;• No infrastructure management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  ETL &amp;amp; Data Preparation
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Glue&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless ETL service&lt;/td&gt;
&lt;td&gt;• Auto schema discovery&lt;br&gt;• Data Catalog&lt;br&gt;• Job scheduling&lt;br&gt;• Python/Scala support&lt;br&gt;• Built-in transforms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data Pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workflow orchestration&lt;/td&gt;
&lt;td&gt;• Data movement automation&lt;br&gt;• Schedule-based execution&lt;br&gt;• Fault-tolerant&lt;br&gt;• On-premises integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EMR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Big data processing framework&lt;/td&gt;
&lt;td&gt;• Hadoop/Spark/Hive/Presto&lt;br&gt;• Auto-scaling&lt;br&gt;• Spot instance support&lt;br&gt;• Notebook integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Data Lake &amp;amp; Governance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lake Formation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Build and manage data lakes&lt;/td&gt;
&lt;td&gt;• Centralized permissions&lt;br&gt;• Data ingestion workflows&lt;br&gt;• Column-level security&lt;br&gt;• Blueprint templates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataZone&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data governance and cataloging&lt;/td&gt;
&lt;td&gt;• Business data catalog&lt;br&gt;• Access management&lt;br&gt;• Data discovery&lt;br&gt;• Cross-account sharing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Search &amp;amp; Visualization
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenSearch Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Search and log analytics&lt;/td&gt;
&lt;td&gt;• Full-text search&lt;br&gt;• Log analytics&lt;br&gt;• Application monitoring&lt;br&gt;• Kibana dashboards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudSearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed search service&lt;/td&gt;
&lt;td&gt;• Auto-scaling&lt;br&gt;• 34 languages support&lt;br&gt;• Faceting and highlighting&lt;br&gt;• Geospatial search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QuickSight&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business intelligence&lt;/td&gt;
&lt;td&gt;• ML-powered insights&lt;br&gt;• Embedded analytics&lt;br&gt;• SPICE in-memory engine&lt;br&gt;• Pay-per-session pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;2. COMPUTE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Virtual Servers
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EC2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtual servers in cloud&lt;/td&gt;
&lt;td&gt;• 500+ instance types&lt;br&gt;• Multiple pricing models (On-Demand, Reserved, Spot)&lt;br&gt;• Auto Scaling&lt;br&gt;• Elastic IPs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lightsail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Simplified VPS&lt;/td&gt;
&lt;td&gt;• Predictable pricing&lt;br&gt;• Pre-configured stacks&lt;br&gt;• Easy DNS management&lt;br&gt;• 1-click applications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Outposts&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-premises AWS infrastructure&lt;/td&gt;
&lt;td&gt;• Local compute/storage&lt;br&gt;• AWS APIs on-premises&lt;br&gt;• Hybrid deployment&lt;br&gt;• Consistent experience&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Serverless Compute
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lambda&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event-driven serverless functions&lt;/td&gt;
&lt;td&gt;• Pay per invocation&lt;br&gt;• Auto-scaling&lt;br&gt;• 15-minute max execution&lt;br&gt;• Multiple runtime support&lt;br&gt;• VPC integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fargate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless containers&lt;/td&gt;
&lt;td&gt;• No server management&lt;br&gt;• Right-sized resources&lt;br&gt;• ECS/EKS compatible&lt;br&gt;• Per-second billing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Container &amp;amp; App Deployment
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic Beanstalk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;PaaS application deployment&lt;/td&gt;
&lt;td&gt;• Multiple platforms (Java, .NET, Node.js, etc.)&lt;br&gt;• Auto-scaling&lt;br&gt;• Monitoring included&lt;br&gt;• Full control of resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;App Runner&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Containerized web apps&lt;/td&gt;
&lt;td&gt;• Source-to-production&lt;br&gt;• Auto-scaling&lt;br&gt;• Load balancing included&lt;br&gt;• CI/CD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Batch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Batch processing at scale&lt;/td&gt;
&lt;td&gt;• Job scheduling&lt;br&gt;• Dynamic resource provisioning&lt;br&gt;• Spot instance support&lt;br&gt;• Array jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Edge Computing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wavelength&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ultra-low latency for 5G&lt;/td&gt;
&lt;td&gt;• 5G edge computing&lt;br&gt;• Single-digit millisecond latency&lt;br&gt;• VPC extension&lt;br&gt;• AWS services at edge&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Image Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EC2 Image Builder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated AMI creation&lt;/td&gt;
&lt;td&gt;• Pipeline automation&lt;br&gt;• Vulnerability testing&lt;br&gt;• Multi-region distribution&lt;br&gt;• Version management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. STORAGE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Object Storage
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Object storage&lt;/td&gt;
&lt;td&gt;• 11 9's durability&lt;br&gt;• Storage classes (Standard, IA, Glacier, etc.)&lt;br&gt;• Versioning&lt;br&gt;• Lifecycle policies&lt;br&gt;• Event notifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S3 Glacier&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Long-term archival&lt;/td&gt;
&lt;td&gt;• Extremely low cost&lt;br&gt;• 3 retrieval options (Instant, Flexible, Deep)&lt;br&gt;• Vault lock compliance&lt;br&gt;• Encrypted by default&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Block Storage
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EBS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;EC2 block storage&lt;/td&gt;
&lt;td&gt;• Multiple volume types (SSD/HDD)&lt;br&gt;• Snapshots&lt;br&gt;• Encryption&lt;br&gt;• Up to 64 TB per volume&lt;br&gt;• Multi-Attach option&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  File Storage
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EFS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Elastic NFS file system&lt;/td&gt;
&lt;td&gt;• Shared file storage&lt;br&gt;• Auto-scaling&lt;br&gt;• Linux workloads&lt;br&gt;• Multiple storage classes&lt;br&gt;• Lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSx for Windows&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Windows file server&lt;/td&gt;
&lt;td&gt;• SMB protocol&lt;br&gt;• Active Directory integration&lt;br&gt;• DFS namespaces&lt;br&gt;• SSD/HDD options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSx for Lustre&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;High-performance computing&lt;/td&gt;
&lt;td&gt;• Sub-millisecond latency&lt;br&gt;• S3 integration&lt;br&gt;• Parallel file system&lt;br&gt;• ML/HPC optimized&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSx for NetApp ONTAP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NetApp file system&lt;/td&gt;
&lt;td&gt;• Multi-protocol (NFS, SMB, iSCSI)&lt;br&gt;• Snapshots &amp;amp; cloning&lt;br&gt;• Data compression&lt;br&gt;• Hybrid cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FSx for OpenZFS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;OpenZFS file system&lt;/td&gt;
&lt;td&gt;• Linux workloads&lt;br&gt;• Point-in-time snapshots&lt;br&gt;• Data compression&lt;br&gt;• Up to 12.5 GB/s throughput&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Hybrid &amp;amp; Migration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hybrid cloud storage&lt;/td&gt;
&lt;td&gt;• File Gateway (NFS/SMB)&lt;br&gt;• Volume Gateway (iSCSI)&lt;br&gt;• Tape Gateway&lt;br&gt;• Local cache&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snow Family&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Physical data transfer&lt;/td&gt;
&lt;td&gt;• Snowcone (8 TB)&lt;br&gt;• Snowball Edge (80 TB)&lt;br&gt;• Snowmobile (100 PB)&lt;br&gt;• Edge computing capable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataSync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Automated data transfer&lt;/td&gt;
&lt;td&gt;• 10x faster than open-source&lt;br&gt;• Bandwidth throttling&lt;br&gt;• Data validation&lt;br&gt;• Schedule-based transfers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Backup &amp;amp; Disaster Recovery
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Backup&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Centralized backup&lt;/td&gt;
&lt;td&gt;• Policy-based backup&lt;br&gt;• Cross-region backup&lt;br&gt;• Compliance reporting&lt;br&gt;• Lifecycle management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic Disaster Recovery&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application recovery&lt;/td&gt;
&lt;td&gt;• Continuous replication&lt;br&gt;• Point-in-time recovery&lt;br&gt;• Non-disruptive testing&lt;br&gt;• RPO of seconds&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;4. DATABASE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Relational Databases
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RDS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed relational DB&lt;/td&gt;
&lt;td&gt;• 6 engines (MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Db2)&lt;br&gt;• Automated backups&lt;br&gt;• Multi-AZ deployment&lt;br&gt;• Read replicas&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud-native relational DB&lt;/td&gt;
&lt;td&gt;• 5x faster than MySQL&lt;br&gt;• 3x faster than PostgreSQL&lt;br&gt;• Auto-scaling storage (up to 128 TB)&lt;br&gt;• Up to 15 read replicas&lt;br&gt;• Global database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aurora Serverless&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;On-demand Aurora&lt;/td&gt;
&lt;td&gt;• Auto-scaling compute&lt;br&gt;• Pay per second&lt;br&gt;• Automatic pause/resume&lt;br&gt;• ACU-based pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  NoSQL Databases
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DynamoDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Key-value &amp;amp; document DB&lt;/td&gt;
&lt;td&gt;• Single-digit millisecond latency&lt;br&gt;• Auto-scaling&lt;br&gt;• Global tables&lt;br&gt;• Point-in-time recovery&lt;br&gt;• DAX (in-memory cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DocumentDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;MongoDB-compatible&lt;/td&gt;
&lt;td&gt;• Document model&lt;br&gt;• Auto-scaling storage&lt;br&gt;• 6 read replicas&lt;br&gt;• Global clusters&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Keyspaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Cassandra&lt;/td&gt;
&lt;td&gt;• CQL API compatible&lt;br&gt;• Serverless&lt;br&gt;• Single-digit millisecond latency&lt;br&gt;• Encryption at rest&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  In-Memory Databases
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ElastiCache&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;In-memory caching&lt;/td&gt;
&lt;td&gt;• Redis &amp;amp; Memcached support&lt;br&gt;• Microsecond latency&lt;br&gt;• Cluster mode&lt;br&gt;• Backup &amp;amp; restore (Redis)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MemoryDB for Redis&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Durable in-memory DB&lt;/td&gt;
&lt;td&gt;• Redis-compatible&lt;br&gt;• Microsecond latency&lt;br&gt;• Durable storage&lt;br&gt;• Multi-AZ transactions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Specialized Databases
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Neptune&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Graph database&lt;/td&gt;
&lt;td&gt;• Property graph &amp;amp; RDF support&lt;br&gt;• SPARQL &amp;amp; Gremlin queries&lt;br&gt;• 15 read replicas&lt;br&gt;• Point-in-time recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;QLDB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ledger database&lt;/td&gt;
&lt;td&gt;• Immutable journal&lt;br&gt;• Cryptographic verification&lt;br&gt;• SQL-like API&lt;br&gt;• Serverless&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Timestream&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time-series database&lt;/td&gt;
&lt;td&gt;• 1000x faster than relational&lt;br&gt;• 1/10th cost&lt;br&gt;• Auto-scaling&lt;br&gt;• Built-in analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Database Migration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DMS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Database migration&lt;/td&gt;
&lt;td&gt;• Minimal downtime&lt;br&gt;• Homogeneous &amp;amp; heterogeneous migrations&lt;br&gt;• Continuous replication&lt;br&gt;• Schema conversion (SCT)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;5. NETWORKING &amp;amp; CONTENT DELIVERY&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Networking
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VPC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtual private cloud&lt;/td&gt;
&lt;td&gt;• Isolated network&lt;br&gt;• Subnets (public/private)&lt;br&gt;• Route tables&lt;br&gt;• Internet/NAT Gateways&lt;br&gt;• VPC Peering&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transit Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Network hub&lt;/td&gt;
&lt;td&gt;• Connect VPCs &amp;amp; on-premises&lt;br&gt;• Centralized routing&lt;br&gt;• Multi-region support&lt;br&gt;• VPN support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Direct Connect&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dedicated network connection&lt;/td&gt;
&lt;td&gt;• 1/10/100 Gbps connections&lt;br&gt;• Consistent bandwidth&lt;br&gt;• Lower egress costs&lt;br&gt;• Private connectivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VPN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encrypted connection&lt;/td&gt;
&lt;td&gt;• Site-to-Site VPN&lt;br&gt;• Client VPN&lt;br&gt;• IPsec encryption&lt;br&gt;• Accelerated VPN option&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Private 5G&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Private mobile network&lt;/td&gt;
&lt;td&gt;• On-premises 5G&lt;br&gt;• Automated deployment&lt;br&gt;• SIM management&lt;br&gt;• Pay-as-you-go&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Load Balancing &amp;amp; Scaling
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ELB&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Load balancing&lt;/td&gt;
&lt;td&gt;• Application LB (Layer 7)&lt;br&gt;• Network LB (Layer 4)&lt;br&gt;• Gateway LB (Layer 3)&lt;br&gt;• Auto-scaling integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Global Accelerator&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global traffic management&lt;/td&gt;
&lt;td&gt;• Static anycast IPs&lt;br&gt;• AWS global network&lt;br&gt;• Health checks&lt;br&gt;• DDoS protection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Content Delivery &amp;amp; DNS
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudFront&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CDN service&lt;/td&gt;
&lt;td&gt;• 400+ edge locations&lt;br&gt;• Real-time metrics&lt;br&gt;• Lambda@Edge&lt;br&gt;• Origin shield&lt;br&gt;• WebSocket support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Route 53&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DNS service&lt;/td&gt;
&lt;td&gt;• 100% uptime SLA&lt;br&gt;• Health checks&lt;br&gt;• Multiple routing policies&lt;br&gt;• Domain registration&lt;br&gt;• DNSSEC&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  API &amp;amp; Service Mesh
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API management&lt;/td&gt;
&lt;td&gt;• REST/HTTP/WebSocket APIs&lt;br&gt;• Throttling &amp;amp; caching&lt;br&gt;• API versioning&lt;br&gt;• Custom domains&lt;br&gt;• Authorization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;App Mesh&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service mesh&lt;/td&gt;
&lt;td&gt;• Traffic control&lt;br&gt;• Service discovery&lt;br&gt;• Observability&lt;br&gt;• Envoy proxy-based&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud Map&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Service discovery&lt;/td&gt;
&lt;td&gt;• DNS &amp;amp; API-based discovery&lt;br&gt;• Health checking&lt;br&gt;• Auto registration&lt;br&gt;• Multi-region&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;6. SECURITY, IDENTITY &amp;amp; COMPLIANCE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Identity &amp;amp; Access Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Access management&lt;/td&gt;
&lt;td&gt;• Users, groups, roles&lt;br&gt;• Policies (JSON)&lt;br&gt;• MFA support&lt;br&gt;• Federation&lt;br&gt;• Fine-grained permissions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IAM Identity Center (SSO)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single sign-on&lt;/td&gt;
&lt;td&gt;• Centralized access&lt;br&gt;• Multiple accounts&lt;br&gt;• SAML 2.0&lt;br&gt;• Active Directory integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cognito&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;User authentication&lt;/td&gt;
&lt;td&gt;• User pools&lt;br&gt;• Identity pools&lt;br&gt;• Social login&lt;br&gt;• MFA&lt;br&gt;• SAML/OIDC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Directory Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Active Directory&lt;/td&gt;
&lt;td&gt;• AWS Managed Microsoft AD&lt;br&gt;• AD Connector&lt;br&gt;• Simple AD&lt;br&gt;• Trust relationships&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Encryption &amp;amp; Key Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;KMS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Encryption key management&lt;/td&gt;
&lt;td&gt;• Create/manage keys&lt;br&gt;• Envelope encryption&lt;br&gt;• Audit with CloudTrail&lt;br&gt;• AWS service integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudHSM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hardware security module&lt;/td&gt;
&lt;td&gt;• FIPS 140-2 Level 3&lt;br&gt;• Single-tenant&lt;br&gt;• Full control of keys&lt;br&gt;• Cluster support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Secrets Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secret rotation&lt;/td&gt;
&lt;td&gt;• Auto-rotation&lt;br&gt;• RDS integration&lt;br&gt;• Versioning&lt;br&gt;• Cross-region replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Certificate Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SSL/TLS certificates&lt;/td&gt;
&lt;td&gt;• Free public certificates&lt;br&gt;• Auto-renewal&lt;br&gt;• Private CA&lt;br&gt;• AWS service integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Threat Detection &amp;amp; Protection
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GuardDuty&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Threat detection&lt;/td&gt;
&lt;td&gt;• ML-based anomaly detection&lt;br&gt;• VPC Flow Logs analysis&lt;br&gt;• CloudTrail analysis&lt;br&gt;• DNS logs monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inspector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vulnerability assessment&lt;/td&gt;
&lt;td&gt;• EC2 &amp;amp; container scanning&lt;br&gt;• Network reachability&lt;br&gt;• CVE detection&lt;br&gt;• Continuous scanning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Macie&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data privacy protection&lt;/td&gt;
&lt;td&gt;• Sensitive data discovery&lt;br&gt;• PII detection&lt;br&gt;• ML-powered&lt;br&gt;• S3 bucket scanning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Detective&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security investigation&lt;/td&gt;
&lt;td&gt;• Automatic data collection&lt;br&gt;• Visualizations&lt;br&gt;• Root cause analysis&lt;br&gt;• 1-year data retention&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Network Security
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WAF&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Web application firewall&lt;/td&gt;
&lt;td&gt;• SQL injection protection&lt;br&gt;• XSS protection&lt;br&gt;• Rate limiting&lt;br&gt;• Custom rules&lt;br&gt;• Managed rule groups&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Shield&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;DDoS protection&lt;/td&gt;
&lt;td&gt;• Shield Standard (free)&lt;br&gt;• Shield Advanced (paid)&lt;br&gt;• 24/7 DDoS Response Team&lt;br&gt;• Cost protection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network Firewall&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VPC network protection&lt;/td&gt;
&lt;td&gt;• Stateful/stateless rules&lt;br&gt;• IPS capabilities&lt;br&gt;• Domain filtering&lt;br&gt;• TLS inspection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Firewall Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Centralized firewall management&lt;/td&gt;
&lt;td&gt;• Multi-account&lt;br&gt;• WAF/Shield/Network Firewall&lt;br&gt;• Policy compliance&lt;br&gt;• Auto-remediation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Compliance &amp;amp; Governance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Security Hub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security posture management&lt;/td&gt;
&lt;td&gt;• Multi-account aggregation&lt;br&gt;• Compliance checks&lt;br&gt;• Finding prioritization&lt;br&gt;• Integration with 30+ services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Audit Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Audit compliance&lt;/td&gt;
&lt;td&gt;• Pre-built frameworks&lt;br&gt;• Evidence collection&lt;br&gt;• Assessment reports&lt;br&gt;• Continuous auditing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Artifact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compliance reports&lt;/td&gt;
&lt;td&gt;• On-demand access&lt;br&gt;• SOC reports&lt;br&gt;• PCI reports&lt;br&gt;• ISO certifications&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource sharing&lt;/td&gt;
&lt;td&gt;• Cross-account sharing&lt;br&gt;• No data transfer charges&lt;br&gt;• 30+ resource types&lt;br&gt;• Organization integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Verified Permissions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fine-grained authorization&lt;/td&gt;
&lt;td&gt;• Policy-based access control&lt;br&gt;• Cedar policy language&lt;br&gt;• Real-time decisions&lt;br&gt;• Audit logging&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;7. MACHINE LEARNING / AI&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ML Platform &amp;amp; Training
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SageMaker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;ML platform&lt;/td&gt;
&lt;td&gt;• Jupyter notebooks&lt;br&gt;• Built-in algorithms&lt;br&gt;• Auto model tuning&lt;br&gt;• Model deployment&lt;br&gt;• MLOps tools&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SageMaker Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IDE for ML&lt;/td&gt;
&lt;td&gt;• Visual interface&lt;br&gt;• Experiment tracking&lt;br&gt;• Model registry&lt;br&gt;• Feature store&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SageMaker Ground Truth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Data labeling&lt;/td&gt;
&lt;td&gt;• Human labeling&lt;br&gt;• Active learning&lt;br&gt;• Custom workflows&lt;br&gt;• Quality control&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Computer Vision
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Rekognition&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Image/video analysis&lt;/td&gt;
&lt;td&gt;• Object detection&lt;br&gt;• Facial analysis&lt;br&gt;• Text extraction&lt;br&gt;• Content moderation&lt;br&gt;• Celebrity recognition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lookout for Vision&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual defect detection&lt;/td&gt;
&lt;td&gt;• Industrial inspection&lt;br&gt;• Custom models&lt;br&gt;• Edge deployment&lt;br&gt;• Anomaly detection&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Natural Language Processing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Comprehend&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text analysis&lt;/td&gt;
&lt;td&gt;• Entity recognition&lt;br&gt;• Sentiment analysis&lt;br&gt;• Language detection&lt;br&gt;• Topic modeling&lt;br&gt;• Custom classification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transcribe&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Speech-to-text&lt;/td&gt;
&lt;td&gt;• Automatic punctuation&lt;br&gt;• Speaker identification&lt;br&gt;• Custom vocabulary&lt;br&gt;• Real-time streaming&lt;br&gt;• Medical transcription&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Translate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Language translation&lt;/td&gt;
&lt;td&gt;• 75+ languages&lt;br&gt;• Custom terminology&lt;br&gt;• Real-time &amp;amp; batch&lt;br&gt;• Auto language detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Textract&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Document text extraction&lt;/td&gt;
&lt;td&gt;• OCR&lt;br&gt;• Form extraction&lt;br&gt;• Table extraction&lt;br&gt;• Identity documents&lt;br&gt;• Invoice processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kendra&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Enterprise search&lt;/td&gt;
&lt;td&gt;• Natural language queries&lt;br&gt;• 90+ data sources&lt;br&gt;• ML-powered relevance&lt;br&gt;• Incremental learning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Speech &amp;amp; Conversational AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Polly&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Text-to-speech&lt;/td&gt;
&lt;td&gt;• 60+ voices&lt;br&gt;• 30+ languages&lt;br&gt;• Neural TTS&lt;br&gt;• SSML support&lt;br&gt;• Speech marks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lex&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Chatbot building&lt;/td&gt;
&lt;td&gt;• Automatic speech recognition&lt;br&gt;• NLU&lt;br&gt;• Multi-language&lt;br&gt;• Lambda integration&lt;br&gt;• Analytics&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Personalization &amp;amp; Forecasting
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Personalize&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Recommendation engine&lt;/td&gt;
&lt;td&gt;• Real-time personalization&lt;br&gt;• User segmentation&lt;br&gt;• Similar items&lt;br&gt;• Cold start handling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Forecast&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Time-series forecasting&lt;/td&gt;
&lt;td&gt;• AutoML&lt;br&gt;• Multiple algorithms&lt;br&gt;• What-if analysis&lt;br&gt;• Explainability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Generative AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bedrock&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Foundation models&lt;/td&gt;
&lt;td&gt;• Claude, Llama, Stable Diffusion&lt;br&gt;• Fine-tuning&lt;br&gt;• RAG support&lt;br&gt;• Knowledge bases&lt;br&gt;• Agents&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeWhisperer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI code assistant&lt;/td&gt;
&lt;td&gt;• Real-time suggestions&lt;br&gt;• Security scanning&lt;br&gt;• Reference tracking&lt;br&gt;• Multiple IDEs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Anomaly Detection &amp;amp; Fraud
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Lookout for Metrics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Anomaly detection&lt;/td&gt;
&lt;td&gt;• Auto-anomaly detection&lt;br&gt;• Root cause analysis&lt;br&gt;• Alert configuration&lt;br&gt;• API/SNS integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fraud Detector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fraud detection&lt;/td&gt;
&lt;td&gt;• Pre-built models&lt;br&gt;• Custom rules&lt;br&gt;• Real-time scoring&lt;br&gt;• Account takeover detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Monitron&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Equipment monitoring&lt;/td&gt;
&lt;td&gt;• End-to-end system&lt;br&gt;• Predictive maintenance&lt;br&gt;• Wireless sensors&lt;br&gt;• ML-based analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Healthcare AI
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;HealthLake&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Healthcare data management&lt;/td&gt;
&lt;td&gt;• FHIR support&lt;br&gt;• Medical NLP&lt;br&gt;• Patient timeline&lt;br&gt;• HIPAA eligible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Comprehend Medical&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Medical text analysis&lt;/td&gt;
&lt;td&gt;• Medical entity extraction&lt;br&gt;• PHI detection&lt;br&gt;• ICD-10-CM coding&lt;br&gt;• RxNorm linking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Operations &amp;amp; DevOps ML
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DevOps Guru&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Operational insights&lt;/td&gt;
&lt;td&gt;• Anomaly detection&lt;br&gt;• Resource analysis&lt;br&gt;• Recommendations&lt;br&gt;• CloudFormation integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;8. DEVELOPER TOOLS&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Source Control &amp;amp; CI/CD
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeCommit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Git repositories&lt;/td&gt;
&lt;td&gt;• Private Git repos&lt;br&gt;• Unlimited repositories&lt;br&gt;• Encryption at rest/transit&lt;br&gt;• Pull requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeBuild&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Build service&lt;/td&gt;
&lt;td&gt;• Fully managed&lt;br&gt;• Custom build environments&lt;br&gt;• Parallel builds&lt;br&gt;• Build caching&lt;br&gt;• Docker support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeDeploy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deployment automation&lt;/td&gt;
&lt;td&gt;• EC2, Lambda, ECS&lt;br&gt;• Blue/green deployments&lt;br&gt;• Rollback capability&lt;br&gt;• On-premises support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodePipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CI/CD pipeline&lt;/td&gt;
&lt;td&gt;• Visual workflow&lt;br&gt;• Source/build/deploy stages&lt;br&gt;• 3rd party integration&lt;br&gt;• Manual approvals&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeStar&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Project management&lt;/td&gt;
&lt;td&gt;• Project templates&lt;br&gt;• Team collaboration&lt;br&gt;• Unified dashboard&lt;br&gt;• JIRA integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Development Environments
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cloud9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud IDE&lt;/td&gt;
&lt;td&gt;• Browser-based&lt;br&gt;• Collaborative editing&lt;br&gt;• Lambda development&lt;br&gt;• Terminal access&lt;br&gt;• Debugging&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudShell&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser CLI&lt;/td&gt;
&lt;td&gt;• Pre-authenticated&lt;br&gt;• 1 GB storage&lt;br&gt;• Pre-installed tools&lt;br&gt;• Session persistence&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Code Quality &amp;amp; Debugging
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeGuru&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Code reviews&lt;/td&gt;
&lt;td&gt;• ML-powered reviews&lt;br&gt;• Performance recommendations&lt;br&gt;• Security vulnerability detection&lt;br&gt;• Java/Python support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X-Ray&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Distributed tracing&lt;/td&gt;
&lt;td&gt;• Service map&lt;br&gt;• Request tracing&lt;br&gt;• Performance insights&lt;br&gt;• Error analysis&lt;br&gt;• Annotations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Artifact Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CodeArtifact&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Artifact repository&lt;/td&gt;
&lt;td&gt;• Maven/npm/PyPI support&lt;br&gt;• Package versioning&lt;br&gt;• Upstream repositories&lt;br&gt;• Domain structure&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Infrastructure as Code
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudFormation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure as code&lt;/td&gt;
&lt;td&gt;• JSON/YAML templates&lt;br&gt;• Stack management&lt;br&gt;• Change sets&lt;br&gt;• Drift detection&lt;br&gt;• StackSets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CDK&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud Development Kit&lt;/td&gt;
&lt;td&gt;• TypeScript/Python/Java/C#&lt;br&gt;• High-level constructs&lt;br&gt;• AWS construct library&lt;br&gt;• Synthesize to CloudFormation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Application Composer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual serverless design&lt;/td&gt;
&lt;td&gt;• Drag-and-drop&lt;br&gt;• IaC generation&lt;br&gt;• Architecture visualization&lt;br&gt;• Local testing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  CLI &amp;amp; SDKs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CLI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Command-line interface&lt;/td&gt;
&lt;td&gt;• Unified tool&lt;br&gt;• Scripting support&lt;br&gt;• Auto-completion&lt;br&gt;• Multiple profiles&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SDKs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Programming SDKs&lt;/td&gt;
&lt;td&gt;• 10+ languages&lt;br&gt;• Service-specific APIs&lt;br&gt;• Automatic retries&lt;br&gt;• Credential management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;9. MANAGEMENT &amp;amp; GOVERNANCE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Monitoring &amp;amp; Observability
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudWatch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Monitoring service&lt;/td&gt;
&lt;td&gt;• Metrics&lt;br&gt;• Logs&lt;br&gt;• Alarms&lt;br&gt;• Dashboards&lt;br&gt;• Events&lt;br&gt;• Container insights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudWatch Logs Insights&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Log analysis&lt;/td&gt;
&lt;td&gt;• Query language&lt;br&gt;• Visualizations&lt;br&gt;• Pattern detection&lt;br&gt;• Anomaly detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;X-Ray&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application tracing&lt;/td&gt;
&lt;td&gt;• Request tracing&lt;br&gt;• Service maps&lt;br&gt;• Performance analysis&lt;br&gt;• Error detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed Grafana&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grafana dashboards&lt;/td&gt;
&lt;td&gt;• Fully managed&lt;br&gt;• Multiple data sources&lt;br&gt;• Team workspaces&lt;br&gt;• SAML/LDAP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed Prometheus&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container monitoring&lt;/td&gt;
&lt;td&gt;• Prometheus-compatible&lt;br&gt;• Auto-scaling&lt;br&gt;• PromQL support&lt;br&gt;• Long-term storage&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Configuration &amp;amp; Compliance
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Config&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource configuration tracking&lt;/td&gt;
&lt;td&gt;• Configuration history&lt;br&gt;• Compliance rules&lt;br&gt;• Resource relationships&lt;br&gt;• Change notifications&lt;br&gt;• Multi-account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudTrail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;API activity logging&lt;/td&gt;
&lt;td&gt;• Account activity history&lt;br&gt;• Governance &amp;amp; compliance&lt;br&gt;• Security analysis&lt;br&gt;• Event history (90 days)&lt;br&gt;• Organization trails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Systems Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Operational management&lt;/td&gt;
&lt;td&gt;• Run Command&lt;br&gt;• Patch Manager&lt;br&gt;• Session Manager&lt;br&gt;• Parameter Store&lt;br&gt;• Automation&lt;br&gt;• OpsCenter&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Infrastructure Provisioning
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CloudFormation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Infrastructure as code&lt;/td&gt;
&lt;td&gt;• Template-based&lt;br&gt;• Stack management&lt;br&gt;• Drift detection&lt;br&gt;• Change sets&lt;br&gt;• StackSets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpsWorks&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Configuration management&lt;/td&gt;
&lt;td&gt;• Chef/Puppet&lt;br&gt;• Stack layers&lt;br&gt;• Auto-healing&lt;br&gt;• Lifecycle events&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Catalog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IT service catalog&lt;/td&gt;
&lt;td&gt;• Approved products&lt;br&gt;• Self-service&lt;br&gt;• Governance&lt;br&gt;• Portfolio management&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Proton&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container/serverless deployment&lt;/td&gt;
&lt;td&gt;• Service templates&lt;br&gt;• Environment management&lt;br&gt;• CI/CD integration&lt;br&gt;• Infrastructure standards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Multi-Account Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Organizations&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Account management&lt;/td&gt;
&lt;td&gt;• Consolidated billing&lt;br&gt;• Organizational units&lt;br&gt;• Service control policies&lt;br&gt;• Account creation API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Control Tower&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Multi-account governance&lt;/td&gt;
&lt;td&gt;• Account factory&lt;br&gt;• Guardrails&lt;br&gt;• Dashboard&lt;br&gt;• Drift detection&lt;br&gt;• Integrated AWS services&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resource Groups&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource organization&lt;/td&gt;
&lt;td&gt;• Tag-based groups&lt;br&gt;• Bulk operations&lt;br&gt;• Custom views&lt;br&gt;• Automation targets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Optimization &amp;amp; Recommendations
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Trusted Advisor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best practice checks&lt;/td&gt;
&lt;td&gt;• Cost optimization&lt;br&gt;• Performance&lt;br&gt;• Security&lt;br&gt;• Fault tolerance&lt;br&gt;• Service limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Compute Optimizer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resource recommendations&lt;/td&gt;
&lt;td&gt;• EC2, Lambda, EBS, ECS&lt;br&gt;• ML-based recommendations&lt;br&gt;• Cost savings projection&lt;br&gt;• Performance risk assessment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Well-Architected Tool&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Architecture review&lt;/td&gt;
&lt;td&gt;• 6 pillars&lt;br&gt;• Workload review&lt;br&gt;• Improvement plans&lt;br&gt;• Lens support&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Quotas&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quota management&lt;/td&gt;
&lt;td&gt;• View quotas&lt;br&gt;• Request increases&lt;br&gt;• CloudWatch alarms&lt;br&gt;• Automated increase requests&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Application Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Launch Wizard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application deployment&lt;/td&gt;
&lt;td&gt;• Size/configure/deploy&lt;br&gt;• SAP/SQL Server/Active Directory&lt;br&gt;• Cost estimates&lt;br&gt;• CloudFormation templates&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Resilience Hub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Resilience management&lt;/td&gt;
&lt;td&gt;• Application resilience&lt;br&gt;• RTO/RPO tracking&lt;br&gt;• Recommendations&lt;br&gt;• Compliance reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;License Manager&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;License management&lt;/td&gt;
&lt;td&gt;• License tracking&lt;br&gt;• Rule enforcement&lt;br&gt;• Usage reporting&lt;br&gt;• Hybrid environments&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Health &amp;amp; Status
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Personal Health Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Account health&lt;/td&gt;
&lt;td&gt;• Service event notifications&lt;br&gt;• Scheduled changes&lt;br&gt;• Proactive notifications&lt;br&gt;• Resource-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Service Health Dashboard&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Global service status&lt;/td&gt;
&lt;td&gt;• Regional status&lt;br&gt;• Historical data&lt;br&gt;• RSS feeds&lt;br&gt;• Public access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;10. APPLICATION INTEGRATION&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Messaging
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Message queue&lt;/td&gt;
&lt;td&gt;• Standard/FIFO queues&lt;br&gt;• Unlimited throughput&lt;br&gt;• Message retention (14 days)&lt;br&gt;• Dead letter queues&lt;br&gt;• Batching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SNS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pub/sub messaging&lt;/td&gt;
&lt;td&gt;• Fan-out pattern&lt;br&gt;• Multiple protocols (HTTP, email, SMS, Lambda)&lt;br&gt;• FIFO topics&lt;br&gt;• Message filtering&lt;br&gt;• Mobile push&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MQ&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed message broker&lt;/td&gt;
&lt;td&gt;• ActiveMQ &amp;amp; RabbitMQ&lt;br&gt;• Protocol support (AMQP, STOMP, MQTT)&lt;br&gt;• Multi-AZ&lt;br&gt;• CloudWatch integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Workflow &amp;amp; Orchestration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Step Functions&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Workflow orchestration&lt;/td&gt;
&lt;td&gt;• Visual workflow&lt;br&gt;• State machines&lt;br&gt;• Error handling&lt;br&gt;• Express/Standard workflows&lt;br&gt;• AWS service integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EventBridge&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event bus&lt;/td&gt;
&lt;td&gt;• Event routing&lt;br&gt;• Schema registry&lt;br&gt;• Archive/replay&lt;br&gt;• 90+ AWS sources&lt;br&gt;• Custom applications&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Application Integration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AppFlow&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;SaaS integration&lt;/td&gt;
&lt;td&gt;• 50+ applications (Salesforce, SAP, etc.)&lt;br&gt;• Bidirectional sync&lt;br&gt;• Data transformation&lt;br&gt;• Encryption&lt;br&gt;• No code&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AppSync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GraphQL service&lt;/td&gt;
&lt;td&gt;• Real-time data sync&lt;br&gt;• Offline support&lt;br&gt;• Multiple data sources&lt;br&gt;• Subscriptions&lt;br&gt;• Caching&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;11. CONTAINERS&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Container Orchestration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ECS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container orchestration&lt;/td&gt;
&lt;td&gt;• Serverless (Fargate) option&lt;br&gt;• EC2 launch type&lt;br&gt;• Service discovery&lt;br&gt;• Auto-scaling&lt;br&gt;• AWS integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;EKS&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Managed Kubernetes&lt;/td&gt;
&lt;td&gt;• Certified Kubernetes&lt;br&gt;• Multi-AZ control plane&lt;br&gt;• Fargate support&lt;br&gt;• IAM integration&lt;br&gt;• Add-ons&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Container Registry
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;ECR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container registry&lt;/td&gt;
&lt;td&gt;• Docker/OCI images&lt;br&gt;• Image scanning&lt;br&gt;• Lifecycle policies&lt;br&gt;• Encryption&lt;br&gt;• Replication&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Container Compute
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fargate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Serverless containers&lt;/td&gt;
&lt;td&gt;• No EC2 management&lt;br&gt;• Right-sized resources&lt;br&gt;• ECS/EKS compatible&lt;br&gt;• Per-second billing&lt;br&gt;• Spot pricing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Container Tools
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;App2Container&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Containerize apps&lt;/td&gt;
&lt;td&gt;• .NET/Java apps&lt;br&gt;• Auto-discovery&lt;br&gt;• ECS/EKS deployment&lt;br&gt;• CI/CD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Copilot&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Container CLI&lt;/td&gt;
&lt;td&gt;• Simple deployment&lt;br&gt;• Environment management&lt;br&gt;• Service mesh&lt;br&gt;• Pipeline creation&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;12. MIGRATION &amp;amp; TRANSFER&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Migration Planning &amp;amp; Tracking
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Migration Hub&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Migration tracking&lt;/td&gt;
&lt;td&gt;• Centralized tracking&lt;br&gt;• Application grouping&lt;br&gt;• Strategy recommendations&lt;br&gt;• Progress dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Application Discovery Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Discovery &amp;amp; planning&lt;/td&gt;
&lt;td&gt;• Agentless/agent-based&lt;br&gt;• Server inventory&lt;br&gt;• Dependency mapping&lt;br&gt;• TCO estimates&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Server Migration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Application Migration Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lift-and-shift migration&lt;/td&gt;
&lt;td&gt;• Continuous replication&lt;br&gt;• Minimal downtime&lt;br&gt;• Automated conversion&lt;br&gt;• Non-disruptive testing&lt;br&gt;• Cut-over orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Server Migration Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;VM migration&lt;/td&gt;
&lt;td&gt;• Incremental replication&lt;br&gt;• Multi-server migrations&lt;br&gt;• Automated AMI creation&lt;br&gt;• SMS Connector&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Data Transfer
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DataSync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Online data transfer&lt;/td&gt;
&lt;td&gt;• 10x faster&lt;br&gt;• On-premises to AWS&lt;br&gt;• Between AWS services&lt;br&gt;• Bandwidth throttling&lt;br&gt;• Verification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transfer Family&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File transfers&lt;/td&gt;
&lt;td&gt;• SFTP/FTPS/FTP/AS2&lt;br&gt;• S3/EFS storage&lt;br&gt;• Custom identity&lt;br&gt;• Managed workflows&lt;br&gt;• HIPAA eligible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Snow Family&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Offline data transfer&lt;/td&gt;
&lt;td&gt;• Snowcone (8 TB)&lt;br&gt;• Snowball Edge (80 TB)&lt;br&gt;• Snowmobile (100 PB)&lt;br&gt;• Edge computing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Mainframe Migration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Mainframe Modernization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mainframe migration&lt;/td&gt;
&lt;td&gt;• Automated refactoring&lt;br&gt;• Replatform option&lt;br&gt;• Runtime environment&lt;br&gt;• Testing tools&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;13. FRONTEND WEB &amp;amp; MOBILE&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Web &amp;amp; Mobile Development
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Amplify&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full-stack development&lt;/td&gt;
&lt;td&gt;• Frontend hosting&lt;br&gt;• Backend provisioning&lt;br&gt;• CI/CD&lt;br&gt;• Authentication&lt;br&gt;• API/Storage libraries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Amplify Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Visual development&lt;/td&gt;
&lt;td&gt;• Figma integration&lt;br&gt;• Component library&lt;br&gt;• Data modeling&lt;br&gt;• User management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Mobile Testing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Device Farm&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mobile app testing&lt;/td&gt;
&lt;td&gt;• Real devices&lt;br&gt;• Automated testing&lt;br&gt;• Remote access&lt;br&gt;• Performance metrics&lt;br&gt;• iOS/Android&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Location Services
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Location Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Location-based services&lt;/td&gt;
&lt;td&gt;• Maps&lt;br&gt;• Geocoding&lt;br&gt;• Routing&lt;br&gt;• Geofencing&lt;br&gt;• Tracking&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  API Development
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AppSync&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GraphQL API&lt;/td&gt;
&lt;td&gt;• Real-time&lt;br&gt;• Offline sync&lt;br&gt;• Conflict resolution&lt;br&gt;• Multiple data sources&lt;br&gt;• Subscriptions&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;14. BUSINESS APPLICATIONS&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Email &amp;amp; Collaboration
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WorkMail&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Business email&lt;/td&gt;
&lt;td&gt;• 50 GB mailbox&lt;br&gt;• Calendar&lt;br&gt;• Contacts&lt;br&gt;• Mobile access&lt;br&gt;• AD integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WorkDocs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Document collaboration&lt;/td&gt;
&lt;td&gt;• 1 TB storage/user&lt;br&gt;• Version control&lt;br&gt;• Comments/feedback&lt;br&gt;• Mobile apps&lt;br&gt;• Activity feed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SES&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Email sending/receiving&lt;/td&gt;
&lt;td&gt;• Transactional email&lt;br&gt;• Marketing email&lt;br&gt;• Deliverability dashboard&lt;br&gt;• Reputation metrics&lt;br&gt;• Inbound email&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Communication &amp;amp; Contact Center
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chime&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Online meetings&lt;/td&gt;
&lt;td&gt;• HD video/audio&lt;br&gt;• Screen sharing&lt;br&gt;• Chat&lt;br&gt;• Meeting dial-in&lt;br&gt;• SDK for custom apps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Connect&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cloud contact center&lt;/td&gt;
&lt;td&gt;• Omnichannel&lt;br&gt;• AI-powered&lt;br&gt;• Pay-per-use&lt;br&gt;• CRM integration&lt;br&gt;• Voice/chat/task routing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Wickr&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Secure communications&lt;/td&gt;
&lt;td&gt;• End-to-end encryption&lt;br&gt;• Compliance&lt;br&gt;• File sharing&lt;br&gt;• Video calling&lt;br&gt;• Retention policies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Marketing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pinpoint&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Marketing campaigns&lt;/td&gt;
&lt;td&gt;• Email/SMS/push/voice&lt;br&gt;• Segmentation&lt;br&gt;• A/B testing&lt;br&gt;• Analytics&lt;br&gt;• Journey orchestration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Supply Chain
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Supply Chain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Supply chain visibility&lt;/td&gt;
&lt;td&gt;• Data unification&lt;br&gt;• ML insights&lt;br&gt;• Risk monitoring&lt;br&gt;• Demand planning&lt;br&gt;• Collaboration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;15. END USER COMPUTING&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Virtual Desktops
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WorkSpaces&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Virtual desktops&lt;/td&gt;
&lt;td&gt;• Windows/Linux&lt;br&gt;• Persistent storage&lt;br&gt;• Various bundles&lt;br&gt;• Hourly/monthly billing&lt;br&gt;• BYOL&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;WorkSpaces Web&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Browser-based access&lt;/td&gt;
&lt;td&gt;• No VPN needed&lt;br&gt;• Secure browser&lt;br&gt;• SAML 2.0&lt;br&gt;• Session recording&lt;br&gt;• Data loss prevention&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Application Streaming
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AppStream 2.0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Application streaming&lt;/td&gt;
&lt;td&gt;• Any HTML5 browser&lt;br&gt;• Persistent storage&lt;br&gt;• Fleet auto-scaling&lt;br&gt;• User pool/federation&lt;br&gt;• Windows apps&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;16. INTERNET OF THINGS (IoT)&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  IoT Core Services
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Core&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Connect IoT devices&lt;/td&gt;
&lt;td&gt;• Device gateway&lt;br&gt;• Message broker&lt;br&gt;• Rules engine&lt;br&gt;• Device shadows&lt;br&gt;• Registry&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Greengrass&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Edge computing&lt;/td&gt;
&lt;td&gt;• Local compute&lt;br&gt;• ML inference&lt;br&gt;• Offline operation&lt;br&gt;• Lambda at edge&lt;br&gt;• Device sync&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  IoT Management
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Device Management&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Device lifecycle&lt;/td&gt;
&lt;td&gt;• Bulk registration&lt;br&gt;• Remote access&lt;br&gt;• OTA updates&lt;br&gt;• Fleet indexing&lt;br&gt;• Jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Device Defender&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IoT security&lt;/td&gt;
&lt;td&gt;• Audit policies&lt;br&gt;• Anomaly detection&lt;br&gt;• Alerts&lt;br&gt;• Mitigation actions&lt;br&gt;• ML Detect&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  IoT Analytics &amp;amp; Applications
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Analytics&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;IoT data analytics&lt;/td&gt;
&lt;td&gt;• Data collection&lt;br&gt;• Processing pipelines&lt;br&gt;• Time-series store&lt;br&gt;• SQL queries&lt;br&gt;• Jupyter notebooks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT Events&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Event detection&lt;/td&gt;
&lt;td&gt;• Input detectors&lt;br&gt;• State machines&lt;br&gt;• Actions&lt;br&gt;• Alarm models&lt;br&gt;• Complex event processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT SiteWise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Industrial data collection&lt;/td&gt;
&lt;td&gt;• Asset modeling&lt;br&gt;• Edge gateways&lt;br&gt;• Dashboards&lt;br&gt;• Metrics computation&lt;br&gt;• SiteWise Monitor&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT TwinMaker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Digital twins&lt;/td&gt;
&lt;td&gt;• 3D visualizations&lt;br&gt;• Entity modeling&lt;br&gt;• Data integration&lt;br&gt;• Scene composer&lt;br&gt;• Real-time sync&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Automotive IoT
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IoT FleetWise&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Vehicle data collection&lt;/td&gt;
&lt;td&gt;• Data modeling&lt;br&gt;• Campaign management&lt;br&gt;• Edge processing&lt;br&gt;• Cloud transfer&lt;br&gt;• Signal decoding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;17. MEDIA SERVICES&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Video Processing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Elastic Transcoder&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Media transcoding&lt;/td&gt;
&lt;td&gt;• Pre-set templates&lt;br&gt;• Multiple formats&lt;br&gt;• Thumbnails&lt;br&gt;• Watermarks&lt;br&gt;• Cost-effective&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaConvert&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;File-based transcoding&lt;/td&gt;
&lt;td&gt;• Broadcast-grade&lt;br&gt;• HDR support&lt;br&gt;• Audio normalization&lt;br&gt;• Automated QC&lt;br&gt;• Job queue&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Live Video
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaLive&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Live video encoding&lt;/td&gt;
&lt;td&gt;• Broadcast-grade&lt;br&gt;• Redundant pipelines&lt;br&gt;• Multiple outputs&lt;br&gt;• Ad insertion&lt;br&gt;• Channel scheduling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaConnect&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Live video transport&lt;/td&gt;
&lt;td&gt;• Reliable streaming&lt;br&gt;• Encryption&lt;br&gt;• Multi-source failover&lt;br&gt;• Entitlements&lt;br&gt;• VPC integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IVS (Interactive Video Service)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Live streaming&lt;/td&gt;
&lt;td&gt;• Ultra-low latency&lt;br&gt;• Auto-scaling&lt;br&gt;• Time-shifting&lt;br&gt;• Recording&lt;br&gt;• Chat integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Video Packaging &amp;amp; Distribution
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaPackage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Video packaging&lt;/td&gt;
&lt;td&gt;• Just-in-time packaging&lt;br&gt;• DRM support&lt;br&gt;• Catch-up TV&lt;br&gt;• Live-to-VOD&lt;br&gt;• CDN integration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaTailor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ad insertion&lt;/td&gt;
&lt;td&gt;• Server-side ad insertion&lt;br&gt;• Personalized ads&lt;br&gt;• Channel assembly&lt;br&gt;• VAST/VPAID support&lt;br&gt;• CloudWatch metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MediaStore&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Video storage&lt;/td&gt;
&lt;td&gt;• Low latency&lt;br&gt;• Predictable performance&lt;br&gt;• Object storage&lt;br&gt;• HTTPS endpoints&lt;br&gt;• Lifecycle policies&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Video Streaming
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Kinesis Video Streams&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Video ingestion&lt;/td&gt;
&lt;td&gt;• Real-time streaming&lt;br&gt;• Playback&lt;br&gt;• ML integration&lt;br&gt;• HLS streaming&lt;br&gt;• Fragment-level access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Content Creation
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Nimble Studio&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Content creation studio&lt;/td&gt;
&lt;td&gt;• Virtual workstations&lt;br&gt;• Render farm&lt;br&gt;• Shared storage&lt;br&gt;• Active Directory&lt;br&gt;• License management&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;18. GAME DEVELOPMENT&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GameLift&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Game server hosting&lt;/td&gt;
&lt;td&gt;• Auto-scaling&lt;br&gt;• FlexMatch matchmaking&lt;br&gt;• Spot instances&lt;br&gt;• Multi-region fleets&lt;br&gt;• Queue placement&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;19. ROBOTICS&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RoboMaker&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Robot development&lt;/td&gt;
&lt;td&gt;• Simulation&lt;br&gt;• Fleet management&lt;br&gt;• ROS support&lt;br&gt;• Gazebo integration&lt;br&gt;• CloudWatch monitoring&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;20. BLOCKCHAIN&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Managed Blockchain&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Blockchain networks&lt;/td&gt;
&lt;td&gt;• Hyperledger Fabric&lt;br&gt;• Ethereum&lt;br&gt;• Managed nodes&lt;br&gt;• Voting API&lt;br&gt;• VPC integration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;21. QUANTUM TECHNOLOGIES&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Braket&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Quantum computing&lt;/td&gt;
&lt;td&gt;• Multiple quantum hardware&lt;br&gt;• Simulators&lt;br&gt;• Hybrid algorithms&lt;br&gt;• Jupyter notebooks&lt;br&gt;• Pay-per-use&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;22. SATELLITE&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ground Station&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Satellite communications&lt;/td&gt;
&lt;td&gt;• Global ground stations&lt;br&gt;• Pay-per-minute&lt;br&gt;• EC2 integration&lt;br&gt;• Data downlink&lt;br&gt;• No upfront investment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;p&gt;&lt;a&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;23. CUSTOMER ENGAGEMENT&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Features&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Activate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Startup support&lt;/td&gt;
&lt;td&gt;• Credits&lt;br&gt;• Technical support&lt;br&gt;• Training&lt;br&gt;• Go-to-market&lt;br&gt;• Community&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;🔝 Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Additional Resources
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Service Count by Category
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total AWS Services Covered&lt;/strong&gt;: 200+&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Largest Categories&lt;/strong&gt;: Machine Learning (22), Security (19), Management &amp;amp; Governance (18)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Most Common Use Cases&lt;/strong&gt;: Web applications, data analytics, machine learning, hybrid cloud&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Navigation Tips
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;strong&gt;Ctrl+F&lt;/strong&gt; (Windows) or &lt;strong&gt;Cmd+F&lt;/strong&gt; (Mac) to search for specific services&lt;/li&gt;
&lt;li&gt;Click on &lt;strong&gt;[View ↓]&lt;/strong&gt; links in the Table of Contents to jump to sections&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;[🔝 Back to Table of Contents]&lt;/strong&gt; links to return to the top&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Last Updated&lt;/strong&gt;: 2024&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Document Version&lt;/strong&gt;: 2.1&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cloud</category>
      <category>productivity</category>
      <category>software</category>
    </item>
    <item>
      <title>Kubernetes &amp; EKS Deep Dive: From Zero to Production</title>
      <dc:creator>Data Tech Bridge</dc:creator>
      <pubDate>Mon, 05 Jan 2026 04:56:53 +0000</pubDate>
      <link>https://forem.com/datatechbridge/kubernetes-eks-deep-dive-from-zero-to-production-33km</link>
      <guid>https://forem.com/datatechbridge/kubernetes-eks-deep-dive-from-zero-to-production-33km</guid>
      <description>&lt;p&gt;&lt;em&gt;Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Beginning of Your EKS Adventure&lt;/strong&gt; 🚀☸️&lt;br&gt;
&lt;a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  📚 Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Understanding Kubernetes: The Philosophy&lt;/li&gt;
&lt;li&gt;The Kubernetes Architecture&lt;/li&gt;
&lt;li&gt;Enter Amazon EKS&lt;/li&gt;
&lt;li&gt;
Core Kubernetes Concepts: The Building Blocks

&lt;ul&gt;
&lt;li&gt;Pods: The Atomic Unit&lt;/li&gt;
&lt;li&gt;ReplicaSets: Maintaining Desired Count&lt;/li&gt;
&lt;li&gt;Deployments: The Smart Way to Manage Replicas&lt;/li&gt;
&lt;li&gt;Services: Stable Networking&lt;/li&gt;
&lt;li&gt;ConfigMaps and Secrets: Configuration Management&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Networking in EKS: The Complex Beast

&lt;ul&gt;
&lt;li&gt;Ingress: Advanced Routing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Storage in Kubernetes&lt;/li&gt;
&lt;li&gt;
Security in EKS: Locking It Down

&lt;ul&gt;
&lt;li&gt;IAM Roles for Service Accounts (IRSA)&lt;/li&gt;
&lt;li&gt;Pod Security Standards&lt;/li&gt;
&lt;li&gt;Network Policies&lt;/li&gt;
&lt;li&gt;Secrets Management&lt;/li&gt;
&lt;li&gt;Image Security&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Setting Up Your First EKS Cluster

&lt;ul&gt;
&lt;li&gt;Option 1: eksctl (Easiest)&lt;/li&gt;
&lt;li&gt;Option 2: eksctl with Config File (Better)&lt;/li&gt;
&lt;li&gt;Option 3: Terraform (Most Control)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Essential Add-ons and Tools

&lt;ul&gt;
&lt;li&gt;AWS Load Balancer Controller&lt;/li&gt;
&lt;li&gt;EBS CSI Driver&lt;/li&gt;
&lt;li&gt;Metrics Server&lt;/li&gt;
&lt;li&gt;Cluster Autoscaler&lt;/li&gt;
&lt;li&gt;Karpenter&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Observability: Knowing What's Happening

&lt;ul&gt;
&lt;li&gt;Metrics: CloudWatch Container Insights&lt;/li&gt;
&lt;li&gt;Logs: FluentBit to CloudWatch&lt;/li&gt;
&lt;li&gt;Traces: AWS X-Ray&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Real-World Application Deployment&lt;/li&gt;
&lt;li&gt;
Advanced Patterns and Best Practices

&lt;ul&gt;
&lt;li&gt;Init Containers&lt;/li&gt;
&lt;li&gt;Pod Disruption Budgets&lt;/li&gt;
&lt;li&gt;Resource Quotas&lt;/li&gt;
&lt;li&gt;Affinity and Anti-Affinity&lt;/li&gt;
&lt;li&gt;Jobs and CronJobs&lt;/li&gt;
&lt;li&gt;Service Mesh&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
Troubleshooting Common Issues

&lt;ul&gt;
&lt;li&gt;Pods Not Starting&lt;/li&gt;
&lt;li&gt;Service Not Reachable&lt;/li&gt;
&lt;li&gt;Ingress Not Working&lt;/li&gt;
&lt;li&gt;High Memory/CPU Usage&lt;/li&gt;
&lt;li&gt;PVC Not Binding&lt;/li&gt;
&lt;li&gt;Debugging Toolkit&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;GitOps and CI/CD&lt;/li&gt;
&lt;li&gt;Cost Optimization&lt;/li&gt;
&lt;li&gt;Production Readiness Checklist&lt;/li&gt;
&lt;li&gt;Wrapping Up&lt;/li&gt;
&lt;li&gt;Quick Reference Guide&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Two weeks have passed after the &lt;a href="https://dev.to/datatechbridge/aws-container-services-demystified-a-coffee-shop-conversation-3p0f"&gt;discussion on AWS Container Services&lt;/a&gt;. Alex is back at the coffee shop, laptop open, with a slightly worried expression. Sam walks in with two coffees.&lt;/em&gt;&lt;/p&gt;



&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;sets down coffee&lt;/em&gt; Okay, that text you sent at 2 AM saying "I think I need Kubernetes after all" was concerning. What happened?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;laughs nervously&lt;/em&gt; Don't worry, nothing's on fire. The ECS setup is working great actually! But... our CTO just announced we're acquiring a startup. They're running everything on Kubernetes. And our new lead developer keeps talking about "the K8s ecosystem" and "cloud-native patterns."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Ah, the classic acquisition scenario. So now you need to actually understand Kubernetes, not just know it exists.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Exactly. And honestly, after working with ECS, I'm curious. What am I missing? Why do people love Kubernetes so much they tattoo the logo on their arms?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;spits out coffee&lt;/em&gt; Someone actually did that?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; I saw it on Twitter!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;shakes head&lt;/em&gt; Okay, well... let's dive deep. Fair warning: this is going to be a longer conversation. Kubernetes isn't something you grasp in ten minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; I've got all afternoon. Teach me, sensei.&lt;/p&gt;
&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;
&lt;h2&gt;
  
  
  Understanding Kubernetes: The Philosophy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Before we touch EKS, you need to understand Kubernetes itself. Think back to our last conversation – I said Kubernetes is complex. But there's a reason for that complexity. It's not accidental.&lt;/p&gt;

&lt;p&gt;Kubernetes was built to solve Google's problems: running billions of containers across thousands of machines. It's designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Declarative configuration&lt;/strong&gt;: You say what you want, K8s makes it happen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-healing&lt;/strong&gt;: Things fail, K8s automatically recovers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensibility&lt;/strong&gt;: You can add functionality without changing core code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API-driven&lt;/strong&gt;: Everything is an API resource&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; You lost me at "declarative configuration."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Right, let me explain with an example. Remember ECS?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In ECS (Imperative thinking):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# You tell ECS what to DO&lt;/span&gt;
aws ecs create-service &lt;span class="nt"&gt;--service-name&lt;/span&gt; my-app &lt;span class="nt"&gt;--desired-count&lt;/span&gt; 3
aws ecs update-service &lt;span class="nt"&gt;--service-name&lt;/span&gt; my-app &lt;span class="nt"&gt;--desired-count&lt;/span&gt; 5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;In Kubernetes (Declarative thinking):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# You tell K8s what you WANT&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;  &lt;span class="c1"&gt;# I want 5 replicas, K8s makes it happen&lt;/span&gt;
  &lt;span class="c1"&gt;# ... rest of config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You apply this file, and Kubernetes continuously works to maintain that state. If a pod dies, K8s starts a new one. If you change replicas from 5 to 10, K8s smoothly adds 5 more.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So it's like... I'm describing my desired state, not giving commands?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Exactly! That's the Kubernetes mindset. Everything is about desired state vs. current state.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│        Kubernetes Control Loop          │
│                                         │
│  ┌──────────────┐    ┌──────────────┐  │
│  │   Desired    │    │   Current    │  │
│  │    State     │    │    State     │  │
│  │ (your YAML)  │    │ (reality)    │  │
│  └──────┬───────┘    └───────┬──────┘  │
│         │                    │         │
│         │   ┌────────────┐   │         │
│         └──►│ Controller │◄──┘         │
│             │ (reconcile)│             │
│             └────────────┘             │
│                                         │
│  If desired ≠ current, take action     │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  The Kubernetes Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Okay, I'm buying into the philosophy. Show me how it actually works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;draws on napkin&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Kubernetes has two main parts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Control Plane&lt;/strong&gt; (the brains)&lt;br&gt;
&lt;strong&gt;2. Worker Nodes&lt;/strong&gt; (the muscle)&lt;/p&gt;

&lt;p&gt;Let me break this down:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────────────────────┐
│               KUBERNETES CLUSTER                    │
│                                                     │
│  ┌──────────────────────────────────────────────┐  │
│  │         CONTROL PLANE (Master)                │  │
│  │                                               │  │
│  │  ┌──────────────┐  ┌──────────────┐         │  │
│  │  │  API Server  │  │    etcd      │         │  │
│  │  │  (kubectl)   │  │  (database)  │         │  │
│  │  └──────┬───────┘  └──────────────┘         │  │
│  │         │                                     │  │
│  │  ┌──────┴───────┐  ┌──────────────┐         │  │
│  │  │  Scheduler   │  │  Controller  │         │  │
│  │  │ (placement)  │  │   Manager    │         │  │
│  │  └──────────────┘  └──────────────┘         │  │
│  └───────────────────────┬────────────────────┬─┘  │
│                          │                    │    │
│  ┌───────────────────────▼─────┐  ┌──────────▼──┐ │
│  │      WORKER NODE 1          │  │ WORKER NODE 2│ │
│  │                             │  │              │ │
│  │  ┌────────┐  ┌────────┐    │  │  ┌────────┐ │ │
│  │  │  Pod   │  │  Pod   │    │  │  │  Pod   │ │ │
│  │  │┌──────┐│  │┌──────┐│    │  │  │┌──────┐│ │ │
│  │  ││ App  ││  ││ App  ││    │  │  ││ App  ││ │ │
│  │  │└──────┘│  │└──────┘│    │  │  │└──────┘│ │ │
│  │  └────────┘  └────────┘    │  │  └────────┘ │ │
│  │                             │  │              │ │
│  │  kubelet | kube-proxy      │  │  kubelet...  │ │
│  └─────────────────────────────┘  └──────────────┘ │
└────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Walk me through what each piece does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Great! Let's follow a request:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. API Server&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The front door to Kubernetes&lt;/li&gt;
&lt;li&gt;When you run &lt;code&gt;kubectl apply&lt;/code&gt;, it hits the API Server&lt;/li&gt;
&lt;li&gt;Validates requests, authenticates users&lt;/li&gt;
&lt;li&gt;Everything goes through here – it's the only component that talks to etcd&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. etcd&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distributed key-value database&lt;/li&gt;
&lt;li&gt;Stores ALL cluster state&lt;/li&gt;
&lt;li&gt;If etcd is lost, the cluster doesn't know what it should be running&lt;/li&gt;
&lt;li&gt;This is why backups are critical!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Scheduler&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Watches for new pods that don't have a node assigned&lt;/li&gt;
&lt;li&gt;Decides which node should run the pod based on:

&lt;ul&gt;
&lt;li&gt;Resource requirements (CPU, memory)&lt;/li&gt;
&lt;li&gt;Node constraints (labels, taints, tolerations)&lt;/li&gt;
&lt;li&gt;Affinity/anti-affinity rules&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Controller Manager&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Runs multiple controllers (Deployment controller, ReplicaSet controller, etc.)&lt;/li&gt;
&lt;li&gt;Each controller watches for changes and acts on them&lt;/li&gt;
&lt;li&gt;Example: Deployment controller sees replicas: 5, counts current pods, creates more if needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. kubelet&lt;/strong&gt; (on each worker node):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ensures containers are running in pods&lt;/li&gt;
&lt;li&gt;Talks to container runtime (Docker, containerd)&lt;/li&gt;
&lt;li&gt;Reports node and pod status back to API Server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;6. kube-proxy&lt;/strong&gt; (on each worker node):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Handles networking&lt;/li&gt;
&lt;li&gt;Implements Service abstraction&lt;/li&gt;
&lt;li&gt;Routes traffic to appropriate pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So when I run &lt;code&gt;kubectl apply -f deployment.yaml&lt;/code&gt;, what happens?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Perfect question! Let's trace it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step-by-Step: kubectl apply -f deployment.yaml

1. kubectl → API Server
   "Here's a Deployment resource"

2. API Server → etcd
   "Store this Deployment spec"

3. Deployment Controller (watching API Server)
   "New Deployment! I need to create a ReplicaSet"

4. Deployment Controller → API Server
   "Create this ReplicaSet"

5. ReplicaSet Controller (watching)
   "New ReplicaSet! I need to create Pods"

6. ReplicaSet Controller → API Server
   "Create 3 Pods please"

7. Scheduler (watching for unassigned Pods)
   "These Pods need homes... Node 2 has space!"

8. Scheduler → API Server
   "Assign these Pods to Node 2"

9. kubelet on Node 2 (watching API Server)
   "I have new Pods to run! Starting containers..."

10. kubelet → Container Runtime
    "Pull image, create container, start it"

11. kubelet → API Server (continuous)
    "Status update: Pod is Running"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Wow. That's... a lot of moving parts for just starting a container.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; It is! But that complexity gives you power. Each piece can be extended, replaced, or customized. That's why Kubernetes can do things ECS can't.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  Enter Amazon EKS
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Okay, so where does EKS fit into all this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;strong&gt;Amazon EKS&lt;/strong&gt; (Elastic Kubernetes Service) is AWS's managed Kubernetes. Here's what AWS handles for you:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Manages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Control Plane (API Server, etcd, scheduler, controller manager)&lt;/li&gt;
&lt;li&gt;Control plane HA across multiple AZs&lt;/li&gt;
&lt;li&gt;Control plane upgrades&lt;/li&gt;
&lt;li&gt;Control plane scaling&lt;/li&gt;
&lt;li&gt;etcd backups&lt;/li&gt;
&lt;li&gt;API Server endpoint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You Manage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Worker nodes (or use Fargate)&lt;/li&gt;
&lt;li&gt;Applications&lt;/li&gt;
&lt;li&gt;Add-ons (though some are managed now)&lt;/li&gt;
&lt;li&gt;Networking configuration&lt;/li&gt;
&lt;li&gt;Security policies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So it's like... they run the control plane, I run the workers?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Exactly! Here's the EKS architecture:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────┐
│                   AMAZON EKS                         │
│                                                      │
│  ┌────────────────────────────────────────────────┐ │
│  │    AWS-MANAGED CONTROL PLANE                   │ │
│  │    (Runs in AWS VPC, multi-AZ)                │ │
│  │                                                │ │
│  │  ┌─────┐  ┌─────┐  ┌──────┐  ┌──────────┐   │ │
│  │  │ API │  │etcd │  │Sched │  │Controller│   │ │
│  │  │     │  │     │  │     │  │ Manager  │   │ │
│  │  └─────┘  └─────┘  └──────┘  └──────────┘   │ │
│  │                                                │ │
│  │  Exposed via AWS-managed Load Balancer        │ │
│  └────────────────────┬───────────────────────────┘ │
│                       │                             │
│  ┌────────────────────▼───────────────────────────┐ │
│  │         YOUR VPC                               │ │
│  │                                                │ │
│  │  ┌──────────────────┐    ┌──────────────────┐ │ │
│  │  │  Subnet 1 (AZ-a) │    │ Subnet 2 (AZ-b) │ │ │
│  │  │                  │    │                  │ │ │
│  │  │  ┌────────────┐  │    │  ┌────────────┐ │ │ │
│  │  │  │   Node 1   │  │    │  │   Node 2   │ │ │ │
│  │  │  │ (EC2 or    │  │    │  │ (EC2 or    │ │ │ │
│  │  │  │  Fargate)  │  │    │  │  Fargate)  │ │ │ │
│  │  │  │            │  │    │  │            │ │ │ │
│  │  │  │ ┌────────┐ │  │    │  │ ┌────────┐ │ │ │
│  │  │  │ │  Pods  │ │  │    │  │ │  Pods  │ │ │ │
│  │  │  │ └────────┘ │  │    │  │ └────────┘ │ │ │
│  │  │  └────────────┘  │    │  └────────────┘ │ │ │
│  │  └──────────────────┘    └──────────────────┘ │ │
│  └────────────────────────────────────────────────┘ │
│                                                      │
│  AWS Services: ECR, IAM, CloudWatch, VPC, ELB, etc. │
└──────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So I never SSH into the control plane?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Nope! You can't even see those machines. AWS handles all that. You interact with the cluster through kubectl, which talks to the API Server endpoint AWS provides.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  Core Kubernetes Concepts: The Building Blocks
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Alright, I need to understand the actual resources. You mentioned Pods, Deployments, Services... break them all down for me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;cracks knuckles&lt;/em&gt; Here we go. This is where Kubernetes gets rich but also complex.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pods: The Atomic Unit
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; A &lt;strong&gt;Pod&lt;/strong&gt; is the smallest deployable unit in Kubernetes. Think of it as a wrapper around one or more containers that need to run together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-pod&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
    &lt;span class="na"&gt;tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-container&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789.dkr.ecr.us-east-1.amazonaws.com/my-app:v1.0&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres://db:5432"&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;250m"&lt;/span&gt;
      &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Why would I put multiple containers in one pod?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Great question! Common patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sidecar Pattern:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────┐
│         Pod             │
│                         │
│  ┌─────────┐           │
│  │  Main   │           │
│  │  App    │           │
│  └────┬────┘           │
│       │ shares         │
│  ┌────▼────┐  storage  │
│  │ Logging │           │
│  │ Sidecar │           │
│  └─────────┘           │
└─────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example: Main app writes logs to a shared volume, sidecar container ships logs to CloudWatch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ambassador Pattern:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────┐
│         Pod             │
│                         │
│  ┌─────────┐           │
│  │  Main   │───────┐   │
│  │  App    │       │   │
│  └─────────┘       │   │
│                    │   │
│  ┌─────────┐      │   │
│  │ Proxy/  │◄─────┘   │
│  │ Cache   │           │
│  └────┬────┘           │
└───────┼─────────────────┘
        │
      Network
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example: Main app talks to local proxy, proxy handles connection pooling to database.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; But usually it's just one container per pod?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Usually, yes! One container per pod is the most common pattern.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  ReplicaSets: Maintaining Desired Count
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;strong&gt;ReplicaSets&lt;/strong&gt; ensure a specified number of pod replicas are running at all times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ReplicaSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-rs&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a pod dies, ReplicaSet creates a new one. If you manually create extra pods with matching labels, ReplicaSet deletes them to maintain exactly 3.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So ReplicaSets are like the ECS Service desired count?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Yes! But here's the thing – you almost never create ReplicaSets directly. You use...&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Deployments: The Smart Way to Manage Replicas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;strong&gt;Deployments&lt;/strong&gt; are what you actually use. They manage ReplicaSets for you and provide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rolling updates&lt;/li&gt;
&lt;li&gt;Rollbacks&lt;/li&gt;
&lt;li&gt;Version history&lt;/li&gt;
&lt;li&gt;Pause/resume capabilities
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RollingUpdate&lt;/span&gt;
    &lt;span class="na"&gt;rollingUpdate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;maxSurge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;        &lt;span class="c1"&gt;# Max 1 extra pod during update&lt;/span&gt;
      &lt;span class="na"&gt;maxUnavailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# Max 1 pod down during update&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1.0&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
        &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/ready&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What happens when I update the image?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Magic! Watch this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current State (v1.0):
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (5 pods v1.0) │
│      ├─ Pod 1                    │
│      ├─ Pod 2                    │
│      ├─ Pod 3                    │
│      ├─ Pod 4                    │
│      └─ Pod 5                    │
└──────────────────────────────────┘

Update to v1.1:
$ kubectl set image deployment/my-app my-app=my-app:v1.1

During Rolling Update:
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (3 pods v1.0) │
│  │   ├─ Pod 1                    │
│  │   ├─ Pod 2                    │
│  │   └─ Pod 3                    │
│  └─ ReplicaSet-xyz (2 pods v1.1) │
│      ├─ Pod 6 (new!)             │
│      └─ Pod 7 (new!)             │
└──────────────────────────────────┘

Final State:
┌──────────────────────────────────┐
│  Deployment: my-app              │
│  ├─ ReplicaSet-abc (0 pods)      │
│  └─ ReplicaSet-xyz (5 pods v1.1) │
│      ├─ Pod 6                    │
│      ├─ Pod 7                    │
│      ├─ Pod 8                    │
│      ├─ Pod 9                    │
│      └─ Pod 10                   │
└──────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kubernetes gradually terminates old pods and starts new ones. If something goes wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl rollout undo deployment/my-app
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And you're back to v1.0!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; That's actually really cool. ECS can do blue/green, but this feels more gradual.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Services: Stable Networking
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Here's a problem: Pods are ephemeral. They get IPs when they start, but those IPs change if pods restart. How do you connect to them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Um... DNS?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Kind of! That's where &lt;strong&gt;Services&lt;/strong&gt; come in. A Service provides a stable IP and DNS name for a set of pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;  &lt;span class="c1"&gt;# Only accessible within cluster&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
    &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;        &lt;span class="c1"&gt;# Service port&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;  &lt;span class="c1"&gt;# Container port&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now any pod can reach your app at &lt;code&gt;my-app-service:80&lt;/code&gt; or &lt;code&gt;my-app-service.default.svc.cluster.local:80&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service Types:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. ClusterIP&lt;/strong&gt; (default):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internal only, not accessible from outside
┌──────────────────────────┐
│      Cluster             │
│  ┌────────┐             │
│  │ Pod A  │──────┐      │
│  └────────┘      │      │
│                  ▼      │
│  ┌────────────────────┐ │
│  │  Service ClusterIP │ │
│  │  10.100.200.50     │ │
│  └────────┬───────────┘ │
│           │             │
│      ┌────▼───┐         │
│      │ Pod B  │         │
│      └────────┘         │
└──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. NodePort&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Accessible on each node's IP at a static port
┌──────────────────────────┐
│      Cluster             │
│  ┌─────────────────────┐ │
│  │ Node 1              │ │
│  │ IP: 10.0.1.50       │ │
│  │ NodePort: 30080 ────┼─┼──► External traffic
│  └─────────┬───────────┘ │
│            │             │
│  ┌─────────▼───────────┐ │
│  │  Service            │ │
│  └─────────┬───────────┘ │
│            │             │
│       ┌────▼───┐         │
│       │  Pods  │         │
│       └────────┘         │
└──────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. LoadBalancer&lt;/strong&gt; (the one you'll use most in EKS):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;service.beta.kubernetes.io/aws-load-balancer-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nlb"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This automatically provisions an AWS Load Balancer!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   │
   ▼
┌─────────────────────┐
│ AWS Load Balancer   │
│ (NLB or ALB)        │
└────────┬────────────┘
         │
   ┌─────▼──────┐
   │ Service LB │
   └─────┬──────┘
         │
    ┌────▼────┐
    │  Pods   │
    └─────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Wait, so Kubernetes creates an actual AWS Load Balancer?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Yes! Through the &lt;strong&gt;AWS Load Balancer Controller&lt;/strong&gt;. We'll get to that. But this is where EKS's AWS integration shines.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  ConfigMaps and Secrets: Configuration Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Applications need configuration. In Kubernetes, you use &lt;strong&gt;ConfigMaps&lt;/strong&gt; for non-sensitive data and &lt;strong&gt;Secrets&lt;/strong&gt; for sensitive data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ConfigMap example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-config&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;database_host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;postgres.example.com"&lt;/span&gt;
  &lt;span class="na"&gt;log_level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;info"&lt;/span&gt;
  &lt;span class="na"&gt;app.properties&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;feature.new=true&lt;/span&gt;
    &lt;span class="s"&gt;timeout=30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Using it in a Pod:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
    &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# As environment variables&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_HOST&lt;/span&gt;
      &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;configMapKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-config&lt;/span&gt;
          &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database_host&lt;/span&gt;
    &lt;span class="c1"&gt;# Or mount as files&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;config-volume&lt;/span&gt;
      &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/etc/config&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;config-volume&lt;/span&gt;
    &lt;span class="na"&gt;configMap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Secrets (similar but encoded):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-secret&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Opaque&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;YWRtaW4=&lt;/span&gt;  &lt;span class="c1"&gt;# base64 encoded&lt;/span&gt;
  &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cGFzc3dvcmQ=&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Base64 isn't encryption though...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Correct! Secrets in Kubernetes are just base64 encoded by default. In EKS, you should use &lt;strong&gt;AWS Secrets Manager&lt;/strong&gt; or &lt;strong&gt;Parameter Store&lt;/strong&gt; integration for real security. We'll cover that.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  Networking in EKS: The Complex Beast
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Okay, I need to understand networking because I heard it's complicated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;deep breath&lt;/em&gt; You heard right. Kubernetes networking is... special. Let me break it down.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Kubernetes Networking Model has three rules:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Every pod gets its own IP address&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All pods can communicate with all other pods without NAT&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;All nodes can communicate with all pods without NAT&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; How does EKS implement this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Through the &lt;strong&gt;Amazon VPC CNI (Container Network Interface) plugin&lt;/strong&gt;. This is unique to EKS and actually pretty clever.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Traditional Kubernetes:
┌────────────────────────────┐
│  Node (EC2 Instance)       │
│  IP: 10.0.1.50             │
│                            │
│  ┌──────────────────────┐  │
│  │ Pod 1: 172.16.1.5    │  │
│  └──────────────────────┘  │
│  ┌──────────────────────┐  │
│  │ Pod 2: 172.16.1.6    │  │
│  └──────────────────────┘  │
│                            │
│  Pod IPs are virtual       │
│  (overlay network)         │
└────────────────────────────┘

EKS with VPC CNI:
┌────────────────────────────┐
│  Node (EC2 Instance)       │
│  Primary IP: 10.0.1.50     │
│                            │
│  ┌──────────────────────┐  │
│  │ Pod 1: 10.0.1.100    │  │ ← Real VPC IP!
│  └──────────────────────┘  │
│  ┌──────────────────────┐  │
│  │ Pod 2: 10.0.1.101    │  │ ← Real VPC IP!
│  └──────────────────────┘  │
│                            │
│  Pods get real ENI IPs     │
│  from your VPC!            │
└────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So pods get actual VPC IP addresses?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Yes! This means:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pods can directly communicate with VPC resources (RDS, ElastiCache, etc.)&lt;/li&gt;
&lt;li&gt;VPC Security Groups can apply to pods&lt;/li&gt;
&lt;li&gt;No overlay network performance penalty&lt;/li&gt;
&lt;li&gt;VPC Flow Logs work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You consume IP addresses quickly&lt;/li&gt;
&lt;li&gt;Need to plan VPC CIDR blocks carefully&lt;/li&gt;
&lt;li&gt;ENI limits per instance type matter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; How many pods can I run per node?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; It depends on the EC2 instance type and its ENI limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;t3.small:  3 ENIs × 4 IPs = 12 IP addresses
  - 1 for node
  - 11 for pods
  Maximum pods: 11

m5.large:  3 ENIs × 10 IPs = 30 IP addresses
  - 1 for node
  - 29 for pods
  Maximum pods: 29

m5.xlarge: 4 ENIs × 15 IPs = 60 IP addresses
  Maximum pods: 58
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What if I run out of IPs?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; You have options:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Use larger CIDR blocks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad:  10.0.1.0/28  (16 IPs - ouch!)
Better: 10.0.1.0/24  (256 IPs)
Best: 10.0.0.0/16    (65,536 IPs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Use secondary CIDR blocks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add a secondary CIDR to your VPC&lt;/span&gt;
10.0.0.0/16   &lt;span class="o"&gt;(&lt;/span&gt;primary&lt;span class="o"&gt;)&lt;/span&gt;
100.64.0.0/16 &lt;span class="o"&gt;(&lt;/span&gt;secondary &lt;span class="k"&gt;for &lt;/span&gt;pods&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Use Custom Networking:&lt;/strong&gt;&lt;br&gt;
Pods use different subnet than nodes&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Use Fargate:&lt;/strong&gt;&lt;br&gt;
No node IP limits!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Use IPv6:&lt;/strong&gt;&lt;br&gt;
Basically infinite addresses&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;
&lt;h3&gt;
  
  
  Ingress: Advanced Routing
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Services get you basic load balancing, but what if you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Path-based routing (&lt;code&gt;/api/users&lt;/code&gt; → User Service)&lt;/li&gt;
&lt;li&gt;Host-based routing (&lt;code&gt;api.example.com&lt;/code&gt; → API Service)&lt;/li&gt;
&lt;li&gt;SSL termination&lt;/li&gt;
&lt;li&gt;Advanced traffic management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's where &lt;strong&gt;Ingress&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Is this like ALB path-based routing?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Exactly! But Kubernetes Ingress is a standard API, and different controllers implement it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In EKS, you use the AWS Load Balancer Controller:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-ingress&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internet-facing&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/target-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ip&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/certificate-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:acm:...&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alb&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/users&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;user-service&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/orders&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;order-service&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This creates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   │
   ▼
┌─────────────────────────────────┐
│ Application Load Balancer (ALB) │
│ api.example.com                 │
└────┬─────────────────────┬──────┘
     │                     │
     │ /users              │ /orders
     ▼                     ▼
┌──────────┐         ┌──────────┐
│  User    │         │  Order   │
│  Service │         │  Service │
└────┬─────┘         └────┬─────┘
     │                     │
  ┌──▼──┐               ┌──▼──┐
  │Pods │               │Pods │
  └─────┘               └─────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; That's cleaner than managing ALB rules manually!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Absolutely! And the Ingress controller handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating the ALB&lt;/li&gt;
&lt;li&gt;Configuring listeners&lt;/li&gt;
&lt;li&gt;Updating target groups&lt;/li&gt;
&lt;li&gt;Health checks&lt;/li&gt;
&lt;li&gt;SSL certificates from ACM&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  Storage in Kubernetes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What about data that needs to persist? Databases, file uploads, etc.?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Kubernetes storage is built on these concepts:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Volumes&lt;/strong&gt;: Storage attached to a pod's lifecycle&lt;br&gt;
&lt;strong&gt;2. Persistent Volumes (PV)&lt;/strong&gt;: Storage that exists independently&lt;br&gt;
&lt;strong&gt;3. Persistent Volume Claims (PVC)&lt;/strong&gt;: Request for storage&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────┐
│         Storage Architecture            │
│                                         │
│  ┌───────────────┐                     │
│  │     Pod       │                     │
│  │  ┌─────────┐  │                     │
│  │  │Container│  │                     │
│  │  └────┬────┘  │                     │
│  │       │mount  │                     │
│  │  ┌────▼──────────────┐              │
│  │  │ PVC: my-app-storage│             │
│  │  │ Size: 10Gi         │             │
│  │  └────┬───────────────┘             │
│  └───────┼───────────────┘             │
│          │ bound to                    │
│  ┌───────▼──────────────┐              │
│  │ PV: pv-12345         │              │
│  │ Size: 10Gi           │              │
│  │ Type: EBS            │              │
│  └───────┬──────────────┘              │
│          │                             │
│  ┌───────▼──────────────┐              │
│  │ AWS EBS Volume       │              │
│  │ vol-abcdef123        │              │
│  └──────────────────────┘              │
└─────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Creating a PVC:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-pvc&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;  &lt;span class="c1"&gt;# Can be mounted by one node at a time&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Using it in a Pod:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:14&lt;/span&gt;
    &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-storage&lt;/span&gt;
      &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/lib/postgresql/data&lt;/span&gt;
  &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-storage&lt;/span&gt;
    &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-pvc&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Storage Classes in EKS:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;storage.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StorageClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fast-ssd&lt;/span&gt;
&lt;span class="na"&gt;provisioner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ebs.csi.aws.com&lt;/span&gt;
&lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
  &lt;span class="na"&gt;iops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;3000"&lt;/span&gt;
  &lt;span class="na"&gt;throughput&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;125"&lt;/span&gt;
&lt;span class="na"&gt;allowVolumeExpansion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;volumeBindingMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WaitForFirstConsumer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What storage should I use for different scenarios?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Good question! Here's my guide:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EBS (via EBS CSI Driver):
✓ Databases (PostgreSQL, MySQL)
✓ Single-writer applications
✗ Multi-node read/write
✗ Cross-AZ access

EFS (via EFS CSI Driver):
✓ Shared file storage
✓ Multi-pod read/write
✓ Cross-AZ access
✗ High-performance databases
✗ Block-level operations

FSx for Lustre:
✓ High-performance computing
✓ Machine learning training
✗ General purpose

S3 (via CSI drivers):
✓ Object storage
✓ Static assets
✗ POSIX filesystem operations
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;




&lt;h2&gt;
  
  
  Security in EKS: Locking It Down
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;nervously&lt;/em&gt; Okay, security. I know this is important but overwhelming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Deep breath. Let's tackle this systematically. Security in EKS has multiple layers:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. IAM Roles for Service Accounts (IRSA)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; This is CRUCIAL in EKS. It lets pods assume IAM roles without needing node-level permissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Old Way (bad):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────┐
│  EC2 Instance        │
│  IAM Role: SuperPower│  ← All pods get this!
│  ┌────┐  ┌────┐     │
│  │Pod1│  │Pod2│     │
│  └────┘  └────┘     │
└──────────────────────┘

Problem: Pod1 needs S3, Pod2 needs DynamoDB
But both get full access!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The EKS Way (good - IRSA):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────┐
│  EC2 Instance                    │
│  IAM Role: Minimal               │
│                                  │
│  ┌────────────┐  ┌────────────┐ │
│  │   Pod1     │  │   Pod2     │ │
│  │ IAM Role:  │  │ IAM Role:  │ │
│  │ S3Access   │  │ DynamoAccess│ │
│  └────────────┘  └────────────┘ │
└──────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Setting it up:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Create IAM role&lt;/span&gt;
eksctl create iamserviceaccount &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-app-sa &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; default &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cluster&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--attach-policy-arn&lt;/span&gt; arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--approve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# 2. Use it in your deployment&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-sa&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT:role/my-app-role&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-sa&lt;/span&gt;  &lt;span class="c1"&gt;# Pod gets this role!&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So the pod can just call AWS APIs now?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Exactly! The AWS SDKs automatically pick up the credentials from the service account.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Pod Security Standards
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; You need to prevent pods from running with dangerous settings:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/enforce&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/audit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
    &lt;span class="na"&gt;pod-security.kubernetes.io/warn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;restricted&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This prevents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running as root&lt;/li&gt;
&lt;li&gt;Privileged containers&lt;/li&gt;
&lt;li&gt;Host network access&lt;/li&gt;
&lt;li&gt;Unsafe volume types&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Network Policies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Control which pods can talk to which:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-network-policy&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
  &lt;span class="na"&gt;egress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This means:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────┐
│ Frontend │
│   Pod    │───────┐
└──────────┘       │
                   ▼ ✓ Allowed
              ┌──────────┐
              │   API    │
              │   Pod    │
              └────┬─────┘
                   │
                   ▼ ✓ Allowed
              ┌──────────┐
              │ Database │
              │   Pod    │
              └──────────┘

┌──────────┐
│ Random   │
│   Pod    │───X──► API  (Blocked!)
└──────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Secrets Management
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Don't use default Kubernetes secrets for production! Use AWS Secrets Manager:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Secrets Manager CSI Driver:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secrets-store.csi.x-k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SecretProviderClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-secrets&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws&lt;/span&gt;
  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;objects&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;- objectName: "prod/database/password"&lt;/span&gt;
        &lt;span class="s"&gt;objectType: "secretsmanager"&lt;/span&gt;
&lt;span class="s"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app-sa&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secrets&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/mnt/secrets"&lt;/span&gt;
          &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secrets&lt;/span&gt;
        &lt;span class="na"&gt;csi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;driver&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;secrets-store.csi.k8s.io&lt;/span&gt;
          &lt;span class="na"&gt;readOnly&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;volumeAttributes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;secretProviderClass&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;aws-secrets"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now your database password is securely fetched from AWS Secrets Manager and mounted as a file!&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Image Security
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Scan images for vulnerabilities:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In your CI/CD pipeline&lt;/span&gt;
&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build image&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t my-app:v1.0 .&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scan with ECR&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;aws ecr start-image-scan \&lt;/span&gt;
        &lt;span class="s"&gt;--repository-name my-app \&lt;/span&gt;
        &lt;span class="s"&gt;--image-id imageTag=v1.0&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check scan results&lt;/span&gt;
    &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;aws ecr describe-image-scan-findings \&lt;/span&gt;
        &lt;span class="s"&gt;--repository-name my-app \&lt;/span&gt;
        &lt;span class="s"&gt;--image-id imageTag=v1.0&lt;/span&gt;
      &lt;span class="s"&gt;# Fail if critical vulnerabilities found&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; This is a lot...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; I know. But security is layers. You don't need everything day one, but you should have a plan to implement them.&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Setting Up Your First EKS Cluster
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Alright, enough theory. How do I actually create an EKS cluster?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Great! Let's do this properly. You have a few options:&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: eksctl (Easiest)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install eksctl&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;eksctl  &lt;span class="c"&gt;# Mac&lt;/span&gt;
&lt;span class="c"&gt;# or download from github.com/weaveworks/eksctl&lt;/span&gt;

&lt;span class="c"&gt;# Create cluster&lt;/span&gt;
eksctl create cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodegroup-name&lt;/span&gt; standard-workers &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-type&lt;/span&gt; t3.medium &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodes&lt;/span&gt; 3 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodes-min&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodes-max&lt;/span&gt; 4 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--managed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This creates:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;EKS control plane&lt;/li&gt;
&lt;li&gt;VPC with public/private subnets&lt;/li&gt;
&lt;li&gt;Managed node group&lt;/li&gt;
&lt;li&gt;All necessary IAM roles&lt;/li&gt;
&lt;li&gt;VPC CNI, kube-proxy, CoreDNS add-ons&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; That's it?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; That's it! In about 15 minutes, you have a production-ready cluster.&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: eksctl with Config File (Better)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# cluster.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;

&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-cluster&lt;/span&gt;
  &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
  &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.28"&lt;/span&gt;

&lt;span class="c1"&gt;# Use existing VPC (recommended)&lt;/span&gt;
&lt;span class="na"&gt;vpc&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vpc-12345"&lt;/span&gt;
  &lt;span class="na"&gt;subnets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;private&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;us-east-1a&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;subnet-aaaa&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;us-east-1b&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;subnet-bbbb&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
      &lt;span class="na"&gt;us-east-1c&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;subnet-cccc&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# IAM OIDC provider for IRSA&lt;/span&gt;
&lt;span class="na"&gt;iam&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;withOIDC&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="c1"&gt;# Managed node groups&lt;/span&gt;
&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
    &lt;span class="na"&gt;instanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;t3.medium&lt;/span&gt;
    &lt;span class="na"&gt;minSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;6&lt;/span&gt;
    &lt;span class="na"&gt;desiredCapacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;volumeSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;20&lt;/span&gt;
    &lt;span class="na"&gt;privateNetworking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;workload-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
      &lt;span class="na"&gt;Team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory-optimized&lt;/span&gt;
    &lt;span class="na"&gt;instanceType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;r5.large&lt;/span&gt;
    &lt;span class="na"&gt;minSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;maxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
    &lt;span class="na"&gt;desiredCapacity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;privateNetworking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;workload-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory-intensive&lt;/span&gt;
    &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;workload-type&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory-intensive&lt;/span&gt;
        &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;

&lt;span class="c1"&gt;# Add-ons&lt;/span&gt;
&lt;span class="na"&gt;addons&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vpc-cni&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;coredns&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-proxy&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-ebs-csi-driver&lt;/span&gt;
    &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;latest&lt;/span&gt;

&lt;span class="c1"&gt;# CloudWatch logging&lt;/span&gt;
&lt;span class="na"&gt;cloudWatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;clusterLogging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;enableTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;audit&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;authenticator&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;controllerManager&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;scheduler&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;eksctl create cluster &lt;span class="nt"&gt;-f&lt;/span&gt; cluster.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 3: Terraform (Most Control)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.tf&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"eks"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-aws-modules/eks/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 19.0"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-cluster"&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1.28"&lt;/span&gt;

  &lt;span class="nx"&gt;cluster_endpoint_public_access&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_endpoint_private_access&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnets&lt;/span&gt;

  &lt;span class="c1"&gt;# EKS Managed Node Group(s)&lt;/span&gt;
  &lt;span class="nx"&gt;eks_managed_node_groups&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;general&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;min_size&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
      &lt;span class="nx"&gt;max_size&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
      &lt;span class="nx"&gt;desired_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;

      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;capacity_type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ON_DEMAND"&lt;/span&gt;

      &lt;span class="nx"&gt;labels&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
        &lt;span class="nx"&gt;Workload&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"general"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Team&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"platform"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;spot&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;min_size&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="nx"&gt;max_size&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;
      &lt;span class="nx"&gt;desired_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;

      &lt;span class="nx"&gt;instance_types&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"t3a.medium"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
      &lt;span class="nx"&gt;capacity_type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"SPOT"&lt;/span&gt;

      &lt;span class="nx"&gt;labels&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;Workload&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"batch"&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="nx"&gt;taints&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
        &lt;span class="nx"&gt;key&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spot"&lt;/span&gt;
        &lt;span class="nx"&gt;value&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
        &lt;span class="nx"&gt;effect&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"NoSchedule"&lt;/span&gt;
      &lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Enable IRSA&lt;/span&gt;
  &lt;span class="nx"&gt;enable_irsa&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="c1"&gt;# Add-ons&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_addons&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;vpc-cni&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;kube-proxy&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;coredns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;aws-ebs-csi-driver&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;most_recent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
    &lt;span class="nx"&gt;Terraform&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Which should I use?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;eksctl&lt;/strong&gt;: Quick start, learning, small teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;eksctl + config file&lt;/strong&gt;: Most common, good balance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform&lt;/strong&gt;: Enterprise, multi-cluster, infrastructure as code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'd start with eksctl config file. It's declarative, version-controllable, and easy to understand.&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Essential Add-ons and Tools
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; You mentioned add-ons. What else do I need?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Great question! A basic EKS cluster needs several additional components:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. AWS Load Balancer Controller
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Creates ALBs and NLBs from Ingress and Service resources&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install via Helm&lt;/span&gt;
helm repo add eks https://aws.github.io/eks-charts
helm &lt;span class="nb"&gt;install &lt;/span&gt;aws-load-balancer-controller eks/aws-load-balancer-controller &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;clusterName&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; serviceAccount.create&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; serviceAccount.name&lt;span class="o"&gt;=&lt;/span&gt;aws-load-balancer-controller
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Now you can create ALBs:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-ingress&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internet-facing&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alb&lt;/span&gt;
  &lt;span class="c1"&gt;# ... rest of config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  2. EBS CSI Driver
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Persistent volumes using EBS&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Already installed if you used addons in eksctl&lt;/span&gt;
&lt;span class="c"&gt;# Or install manually&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-k&lt;/span&gt; &lt;span class="s2"&gt;"github.com/kubernetes-sigs/aws-ebs-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.25"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Metrics Server
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; For &lt;code&gt;kubectl top&lt;/code&gt; and HorizontalPodAutoscaler&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Cluster Autoscaler
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Automatically scale node groups&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cluster-autoscaler&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cluster-autoscaler&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;registry.k8s.io/autoscaling/cluster-autoscaler:v1.28.0&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./cluster-autoscaler&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--cloud-provider=aws&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--namespace=kube-system&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--balance-similar-node-groups&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;--skip-nodes-with-system-pods=false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scenario: Need more pods than nodes can handle

1. Pods are pending (not enough resources)
2. Cluster Autoscaler sees pending pods
3. Checks node group can scale up
4. Triggers ASG to add nodes
5. New node joins cluster
6. Pods schedule on new node

Scenario: Nodes are underutilized

1. Node has &amp;lt;50% utilization for 10 minutes
2. All pods can fit elsewhere
3. Cluster Autoscaler cordons node
4. Drains pods (moves them)
5. Terminates node
6. ASG size decreases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Karpenter (Alternative to Cluster Autoscaler)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Karpenter is newer and smarter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Provisioner&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spot"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on-demand"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
      &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
      &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3.medium"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3.large"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;t3.xlarge"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt;
  &lt;span class="na"&gt;providerRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;ttlSecondsAfterEmpty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Karpenter vs Cluster Autoscaler:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cluster Autoscaler:
- Works with node groups
- Slower (waits for ASG)
- Less flexible instance types

Karpenter:
- Directly provisions EC2
- Faster (no ASG)
- Intelligently picks instance types
- Better bin-packing
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Which should I use?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Start with Cluster Autoscaler (simpler, well-tested). Move to Karpenter when you need faster scaling or better cost optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Observability: Knowing What's Happening
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; How do I monitor all this?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Observability in K8s has three pillars: Metrics, Logs, and Traces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metrics: CloudWatch Container Insights
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install CloudWatch Agent&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/quickstart/cwagent-fluentd-quickstart.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This gives you:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cluster metrics (CPU, memory, disk, network)&lt;/li&gt;
&lt;li&gt;Node metrics&lt;/li&gt;
&lt;li&gt;Pod metrics&lt;/li&gt;
&lt;li&gt;Namespace metrics&lt;/li&gt;
&lt;li&gt;Service metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Custom metrics:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Pod&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/scrape&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;8080"&lt;/span&gt;
    &lt;span class="na"&gt;prometheus.io/path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/metrics"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
    &lt;span class="c1"&gt;# App exposes metrics at :8080/metrics&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Logs: FluentBit to CloudWatch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fluent-bit-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;amazon-cloudwatch&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;fluent-bit.conf&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;[INPUT]&lt;/span&gt;
        &lt;span class="s"&gt;Name                tail&lt;/span&gt;
        &lt;span class="s"&gt;Path                /var/log/containers/*.log&lt;/span&gt;
        &lt;span class="s"&gt;Parser              docker&lt;/span&gt;
        &lt;span class="s"&gt;Tag                 kube.*&lt;/span&gt;

    &lt;span class="s"&gt;[FILTER]&lt;/span&gt;
        &lt;span class="s"&gt;Name                kubernetes&lt;/span&gt;
        &lt;span class="s"&gt;Match               kube.*&lt;/span&gt;
        &lt;span class="s"&gt;Kube_URL            https://kubernetes.default.svc:443&lt;/span&gt;

    &lt;span class="s"&gt;[OUTPUT]&lt;/span&gt;
        &lt;span class="s"&gt;Name                cloudwatch_logs&lt;/span&gt;
        &lt;span class="s"&gt;Match               *&lt;/span&gt;
        &lt;span class="s"&gt;region              us-east-1&lt;/span&gt;
        &lt;span class="s"&gt;log_group_name      /aws/eks/my-cluster/containers&lt;/span&gt;
        &lt;span class="s"&gt;auto_create_group   true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Can I see logs from kubectl?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Absolutely!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# View pod logs&lt;/span&gt;
kubectl logs my-pod

&lt;span class="c"&gt;# Follow logs (like tail -f)&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-f&lt;/span&gt; my-pod

&lt;span class="c"&gt;# Logs from specific container in pod&lt;/span&gt;
kubectl logs my-pod &lt;span class="nt"&gt;-c&lt;/span&gt; sidecar-container

&lt;span class="c"&gt;# Previous container logs (if crashed)&lt;/span&gt;
kubectl logs my-pod &lt;span class="nt"&gt;--previous&lt;/span&gt;

&lt;span class="c"&gt;# Logs from all pods with label&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my-app

&lt;span class="c"&gt;# Logs from last hour&lt;/span&gt;
kubectl logs my-pod &lt;span class="nt"&gt;--since&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Traces: AWS X-Ray
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-app:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AWS_XRAY_DAEMON_ADDRESS&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xray-service.default:2000&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;xray-daemon&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;amazon/aws-xray-daemon:latest&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2000&lt;/span&gt;
          &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UDP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The full observability stack:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────────────────────────────────┐
│         Your Application           │
│  - Emits metrics (Prometheus)      │
│  - Writes logs (stdout/stderr)     │
│  - Sends traces (X-Ray)            │
└──────┬──────────────┬──────────┬───┘
       │              │          │
       │ Metrics      │ Logs     │ Traces
       ▼              ▼          ▼
┌──────────────┐ ┌──────────┐ ┌──────────┐
│ CloudWatch   │ │FluentBit │ │ X-Ray    │
│ Container    │ │          │ │ Daemon   │
│ Insights     │ └────┬─────┘ └────┬─────┘
└──────┬───────┘      │            │
       │              ▼            ▼
       │         ┌─────────────────────┐
       └────────►│   CloudWatch &amp;amp;      │
                 │     X-Ray           │
                 └──────────┬──────────┘
                            │
                   ┌────────▼──────────┐
                   │   CloudWatch      │
                   │   Dashboards      │
                   └───────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Real-World Application Deployment
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Okay, let's put this all together. Walk me through deploying a real application.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Perfect! Let's deploy a three-tier application:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;React frontend&lt;/li&gt;
&lt;li&gt;Node.js API&lt;/li&gt;
&lt;li&gt;PostgreSQL database&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 1: Namespace and RBAC
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# namespace.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Service account for the app&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ServiceAccount&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-sa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;eks.amazonaws.com/role-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::ACCOUNT:role/myapp-role&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Database (PostgreSQL)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# postgres.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Secret&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-secret&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Opaque&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cG9zdGdyZXM=&lt;/span&gt;  &lt;span class="c1"&gt;# base64 encoded&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-pvc&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20Gi&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;StatefulSet&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres:14&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_DB&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_USER&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;POSTGRES_PASSWORD&lt;/span&gt;
          &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-secret&lt;/span&gt;
              &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
        &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-storage&lt;/span&gt;
          &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/lib/postgresql/data&lt;/span&gt;
          &lt;span class="na"&gt;subPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;250m"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;500m"&lt;/span&gt;
      &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-storage&lt;/span&gt;
        &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-pvc&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
  &lt;span class="na"&gt;clusterIP&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;None&lt;/span&gt;  &lt;span class="c1"&gt;# Headless service for StatefulSet&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Backend API
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# api.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-config&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_HOST&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres.myapp.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_PORT&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5432"&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_NAME&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;info&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
        &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-sa&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-api:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_HOST&lt;/span&gt;
          &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;configMapKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-config&lt;/span&gt;
              &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_HOST&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_PASSWORD&lt;/span&gt;
          &lt;span class="na"&gt;valueFrom&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;secretKeyRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-secret&lt;/span&gt;
              &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;password&lt;/span&gt;
        &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/ready&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
          &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
          &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;256Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;200m"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling/v2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HorizontalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-hpa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleTargetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;minReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;maxReplicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cpu&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
        &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;70&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Resource&lt;/span&gt;
    &lt;span class="na"&gt;resource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;memory&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Utilization&lt;/span&gt;
        &lt;span class="na"&gt;averageUtilization&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Frontend
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# frontend.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789.dkr.ecr.us-east-1.amazonaws.com/myapp-frontend:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
        &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;64Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50m"&lt;/span&gt;
          &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt;
            &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;100m"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 5: Ingress (ALB)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# ingress.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-ingress&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/scheme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;internet-facing&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/target-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ip&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/certificate-arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:acm:us-east-1:ACCOUNT:certificate/abc123&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/listen-ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[{"HTTP":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;80},&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{"HTTPS":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;443}]'&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/ssl-redirect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;443'&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/healthcheck-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
    &lt;span class="na"&gt;alb.ingress.kubernetes.io/success-codes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200'&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;alb&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp.example.com&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Deploy Everything
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create namespace&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; namespace.yaml

&lt;span class="c"&gt;# Deploy database&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; postgres.yaml

&lt;span class="c"&gt;# Wait for database to be ready&lt;/span&gt;
kubectl &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ready pod &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;postgres &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s

&lt;span class="c"&gt;# Deploy API&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; api.yaml

&lt;span class="c"&gt;# Wait for API&lt;/span&gt;
kubectl &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ready pod &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;api &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s

&lt;span class="c"&gt;# Deploy frontend&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; frontend.yaml

&lt;span class="c"&gt;# Create ingress&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; ingress.yaml

&lt;span class="c"&gt;# Check everything&lt;/span&gt;
kubectl get all &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The final architecture:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Internet
   │
   ▼
┌─────────────────────────────────────┐
│  Application Load Balancer (ALB)   │
│  myapp.example.com                  │
└──────┬────────────────────┬─────────┘
       │                    │
       │ /api               │ /
       ▼                    ▼
┌──────────────┐     ┌──────────────┐
│ API Service  │     │   Frontend   │
│ (ClusterIP)  │     │   Service    │
└──────┬───────┘     └──────┬───────┘
       │                    │
┌──────▼────────┐    ┌──────▼───────┐
│  API Pods x3  │    │Frontend Pods │
│  (Deployment) │    │      x2      │
└──────┬────────┘    └──────────────┘
       │
       │ connects to
       ▼
┌──────────────────┐
│ Postgres Service │
│   (Headless)     │
└──────┬───────────┘
       │
┌──────▼────────────┐
│ Postgres Pod      │
│ (StatefulSet)     │
└──────┬────────────┘
       │
┌──────▼────────────┐
│  EBS Volume       │
│  (20Gi gp3)       │
└───────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;impressed&lt;/em&gt; That's a full application!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; And it has:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High availability (multi-replica)&lt;/li&gt;
&lt;li&gt;Auto-scaling (HPA)&lt;/li&gt;
&lt;li&gt;Persistent storage (EBS)&lt;/li&gt;
&lt;li&gt;TLS/HTTPS (ACM certificate)&lt;/li&gt;
&lt;li&gt;Health checks&lt;/li&gt;
&lt;li&gt;Resource limits&lt;/li&gt;
&lt;li&gt;Proper networking&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Advanced Patterns and Best Practices
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What are some patterns I should know about as I get more advanced?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Let me share patterns I use in production:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Init Containers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; App needs database migrations before starting&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;initContainers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;migration&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-api:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;npm'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;run'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;migrate'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DATABASE_URL&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres://...&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-api:v1.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Flow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Pod created
2. Init container runs migrations
3. If migrations succeed, app container starts
4. If migrations fail, pod fails
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Pod Disruption Budgets
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Want to ensure minimum availability during updates&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;This prevents:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Draining too many nodes at once&lt;/li&gt;
&lt;li&gt;Updating too many pods simultaneously&lt;/li&gt;
&lt;li&gt;Cluster maintenance killing all pods&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Resource Quotas
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Teams overconsuming resources&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceQuota&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-quota&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-a&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;10"&lt;/span&gt;
    &lt;span class="na"&gt;requests.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;20Gi&lt;/span&gt;
    &lt;span class="na"&gt;limits.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;20"&lt;/span&gt;
    &lt;span class="na"&gt;limits.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;40Gi&lt;/span&gt;
    &lt;span class="na"&gt;persistentvolumeclaims&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5"&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Affinity and Anti-Affinity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Want to spread pods across nodes&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;affinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Prefer different nodes&lt;/span&gt;
        &lt;span class="na"&gt;podAntiAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;preferredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
            &lt;span class="na"&gt;podAffinityTerm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
              &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/hostname&lt;/span&gt;
        &lt;span class="c1"&gt;# Require different availability zones&lt;/span&gt;
        &lt;span class="na"&gt;podAntiAffinity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;requiredDuringSchedulingIgnoredDuringExecution&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
            &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/zone&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┬──────────────┬──────────────┐
│   AZ-1a      │    AZ-1b     │    AZ-1c     │
├──────────────┼──────────────┼──────────────┤
│ Node 1       │ Node 3       │ Node 5       │
│  - API Pod 1 │  - API Pod 3 │  - API Pod 5 │
│              │              │              │
│ Node 2       │ Node 4       │              │
│  - API Pod 2 │  - API Pod 4 │              │
└──────────────┴──────────────┴──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Jobs and CronJobs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Need to run batch tasks&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# One-time job&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data-import&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;import&lt;/span&gt;
        &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-importer:v1.0&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;import.py'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
  &lt;span class="na"&gt;backoffLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Scheduled job&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;daily-report&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;  &lt;span class="c1"&gt;# 2 AM daily&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;report&lt;/span&gt;
            &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-reporter:v1.0&lt;/span&gt;
            &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;python'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;report.py'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  6. Service Mesh (AWS App Mesh)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Need advanced traffic management, retries, circuit breaking&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;appmesh.k8s.aws/v1beta2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualNode&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-v1&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
      &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
  &lt;span class="na"&gt;listeners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;portMapping&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
  &lt;span class="na"&gt;serviceDiscovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;hostname&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api.myapp.svc.cluster.local&lt;/span&gt;
  &lt;span class="na"&gt;backends&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;virtualService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;virtualServiceRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;database&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;appmesh.k8s.aws/v1beta2&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VirtualRouter&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-router&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;listeners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;portMapping&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
        &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
  &lt;span class="na"&gt;routes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-route&lt;/span&gt;
      &lt;span class="na"&gt;httpRoute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;match&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;prefix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
        &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;weightedTargets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;virtualNodeRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-v1&lt;/span&gt;
              &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;90&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;virtualNodeRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-v2&lt;/span&gt;
              &lt;span class="na"&gt;weight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;  &lt;span class="c1"&gt;# Canary deployment!&lt;/span&gt;
        &lt;span class="na"&gt;retryPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;maxRetries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
          &lt;span class="na"&gt;perRetryTimeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;unit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;s&lt;/span&gt;
            &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Troubleshooting Common Issues
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; What about when things go wrong? How do I debug?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;pulls out troubleshooting playbook&lt;/em&gt; This is essential! Let me walk you through common issues:&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 1: Pods Not Starting
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check pod status&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Output shows:&lt;/span&gt;
NAME                  READY   STATUS              RESTARTS   AGE
api-7d4b6c9f8-abc123  0/1     ImagePullBackOff    0          5m

&lt;span class="c"&gt;# Get detailed info&lt;/span&gt;
kubectl describe pod api-7d4b6c9f8-abc123 &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Look for:&lt;/span&gt;
&lt;span class="c"&gt;# - Image pull errors (wrong image name, permissions)&lt;/span&gt;
&lt;span class="c"&gt;# - Resource constraints (not enough CPU/memory)&lt;/span&gt;
&lt;span class="c"&gt;# - Volume mount errors&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Common causes and fixes:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Problem: ImagePullBackOff&lt;/span&gt;
&lt;span class="c1"&gt;# Fix 1: Check image name&lt;/span&gt;
&lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:v1.0&lt;/span&gt;  &lt;span class="c1"&gt;# Correct region?&lt;/span&gt;

&lt;span class="c1"&gt;# Fix 2: Check IAM permissions for pulling from ECR&lt;/span&gt;
&lt;span class="c1"&gt;# Service account needs ecr:GetAuthorizationToken, ecr:BatchCheckLayerAvailability&lt;/span&gt;

&lt;span class="c1"&gt;# Problem: CrashLoopBackOff&lt;/span&gt;
&lt;span class="c1"&gt;# Check logs&lt;/span&gt;
&lt;span class="s"&gt;kubectl logs api-7d4b6c9f8-abc123 -n myapp --previous&lt;/span&gt;

&lt;span class="c1"&gt;# Problem: Pending (not scheduling)&lt;/span&gt;
&lt;span class="c1"&gt;# Describe pod to see why&lt;/span&gt;
&lt;span class="na"&gt;Events&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="s"&gt;Type     Reason            Message&lt;/span&gt;
  &lt;span class="s"&gt;----     ------            -------&lt;/span&gt;
  &lt;span class="s"&gt;Warning  FailedScheduling  0/3 nodes available&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;insufficient memory&lt;/span&gt;

&lt;span class="c1"&gt;# Fix: Either reduce pod requests or add more nodes&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 2: Service Not Reachable
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if service exists&lt;/span&gt;
kubectl get svc &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Check if endpoints exist (pods backing the service)&lt;/span&gt;
kubectl get endpoints &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

NAME   ENDPOINTS                           AGE
api    10.0.1.50:3000,10.0.1.51:3000       10m

&lt;span class="c"&gt;# If no endpoints, pods might not match service selector&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--show-labels&lt;/span&gt;

&lt;span class="c"&gt;# Test connectivity from another pod&lt;/span&gt;
kubectl run &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;busybox &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; wget &lt;span class="nt"&gt;-O-&lt;/span&gt; api.myapp.svc.cluster.local
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 3: Ingress Not Working
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check if ingress exists&lt;/span&gt;
kubectl get ingress &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Describe to see ALB creation status&lt;/span&gt;
kubectl describe ingress myapp-ingress &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Check AWS Load Balancer Controller logs&lt;/span&gt;
kubectl logs &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system deployment/aws-load-balancer-controller

&lt;span class="c"&gt;# Common issues:&lt;/span&gt;
&lt;span class="c"&gt;# - IAM permissions for controller&lt;/span&gt;
&lt;span class="c"&gt;# - Subnet tags missing (kubernetes.io/role/elb=1)&lt;/span&gt;
&lt;span class="c"&gt;# - Security groups blocking traffic&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 4: High Memory/CPU Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check current usage&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

NAME                  CPU&lt;span class="o"&gt;(&lt;/span&gt;cores&lt;span class="o"&gt;)&lt;/span&gt;   MEMORY&lt;span class="o"&gt;(&lt;/span&gt;bytes&lt;span class="o"&gt;)&lt;/span&gt;
api-7d4b6c9f8-abc123  450m         890Mi  &lt;span class="c"&gt;# Approaching limits!&lt;/span&gt;

&lt;span class="c"&gt;# Check if pods are being OOMKilled&lt;/span&gt;
kubectl describe pod api-7d4b6c9f8-abc123 &lt;span class="nt"&gt;-n&lt;/span&gt; myapp | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 &lt;span class="s2"&gt;"Last State"&lt;/span&gt;

&lt;span class="c"&gt;# Fix: Adjust resources&lt;/span&gt;
resources:
  requests:
    memory: &lt;span class="s2"&gt;"256Mi"&lt;/span&gt;
    cpu: &lt;span class="s2"&gt;"250m"&lt;/span&gt;
  limits:
    memory: &lt;span class="s2"&gt;"1Gi"&lt;/span&gt;      &lt;span class="c"&gt;# Increased&lt;/span&gt;
    cpu: &lt;span class="s2"&gt;"500m"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Issue 5: PVC Not Binding
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check PVC status&lt;/span&gt;
kubectl get pvc &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

NAME           STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS
postgres-pvc   Pending                                      gp3

&lt;span class="c"&gt;# Describe for details&lt;/span&gt;
kubectl describe pvc postgres-pvc &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Common issues:&lt;/span&gt;
&lt;span class="c"&gt;# - Storage class doesn't exist&lt;/span&gt;
&lt;span class="c"&gt;# - No available volumes&lt;/span&gt;
&lt;span class="c"&gt;# - Zone mismatch (PVC in us-east-1a, nodes in us-east-1b)&lt;/span&gt;

&lt;span class="c"&gt;# Fix: Use WaitForFirstConsumer binding mode&lt;/span&gt;
volumeBindingMode: WaitForFirstConsumer  &lt;span class="c"&gt;# Wait until pod is scheduled&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging Toolkit
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get everything in namespace&lt;/span&gt;
kubectl get all &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# Watch resources live&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;-w&lt;/span&gt;

&lt;span class="c"&gt;# Exec into running container&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; api-7d4b6c9f8-abc123 &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--&lt;/span&gt; /bin/sh

&lt;span class="c"&gt;# Port-forward for local testing&lt;/span&gt;
kubectl port-forward svc/api 8080:80 &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
&lt;span class="c"&gt;# Now access at localhost:8080&lt;/span&gt;

&lt;span class="c"&gt;# Check events&lt;/span&gt;
kubectl get events &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'.lastTimestamp'&lt;/span&gt;

&lt;span class="c"&gt;# Check resource usage&lt;/span&gt;
kubectl top nodes
kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp

&lt;span class="c"&gt;# View all resources with labels&lt;/span&gt;
kubectl get all &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--show-labels&lt;/span&gt;

&lt;span class="c"&gt;# Drain node for maintenance&lt;/span&gt;
kubectl drain node-name &lt;span class="nt"&gt;--ignore-daemonsets&lt;/span&gt; &lt;span class="nt"&gt;--delete-emptydir-data&lt;/span&gt;

&lt;span class="c"&gt;# Cordon node (stop scheduling new pods)&lt;/span&gt;
kubectl cordon node-name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  GitOps and CI/CD
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; How do I actually deploy updates? Keep editing YAML files and applying them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; No way! That's where GitOps comes in. Let me show you a proper CI/CD pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  The GitOps Workflow with Flux or ArgoCD
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Developer Flow:

1. Developer → Git Commit → Push to main branch
              │
              ▼
2. GitHub Actions (CI)
   - Runs tests
   - Builds Docker image
   - Tags image (git sha)
   - Pushes to ECR
   - Updates Kubernetes manifest with new image tag
   - Commits manifest to GitOps repo
              │
              ▼
3. Flux/ArgoCD (watching GitOps repo)
   - Detects change
   - Pulls new manifests
   - Applies to cluster
              │
              ▼
4. Kubernetes
   - Rolling update deployment
   - New pods come up
   - Old pods terminate
              │
              ▼
5. Developer sees app updated!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example GitHub Actions workflow:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/deploy.yaml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and Deploy&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ECR_REPOSITORY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp-api&lt;/span&gt;
  &lt;span class="na"&gt;EKS_CLUSTER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-cluster&lt;/span&gt;
  &lt;span class="na"&gt;AWS_REGION&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v3&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v2&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ROLE_ARN }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ env.AWS_REGION }}&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to Amazon ECR&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;login-ecr&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push image&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;ECR_REGISTRY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.login-ecr.outputs.registry }}&lt;/span&gt;
          &lt;span class="na"&gt;IMAGE_TAG&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.sha }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;docker build -t $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG .&lt;/span&gt;
          &lt;span class="s"&gt;docker push $ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG&lt;/span&gt;
          &lt;span class="s"&gt;echo "image=$ECR_REGISTRY/$ECR_REPOSITORY:$IMAGE_TAG" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update Kubernetes manifest&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;IMAGE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.build-and-push.outputs.image }}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;# Update image in deployment YAML&lt;/span&gt;
          &lt;span class="s"&gt;sed -i "s|image:.*|image: $IMAGE|g" k8s/deployment.yaml&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Commit and push to GitOps repo&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;git config user.name "GitHub Actions"&lt;/span&gt;
          &lt;span class="s"&gt;git config user.email "actions@github.com"&lt;/span&gt;
          &lt;span class="s"&gt;git add k8s/deployment.yaml&lt;/span&gt;
          &lt;span class="s"&gt;git commit -m "Update image to ${{ github.sha }}"&lt;/span&gt;
          &lt;span class="s"&gt;git push&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ArgoCD Application:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/company/gitops-repo&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;      &lt;span class="c1"&gt;# Delete resources not in git&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# Fix manual changes&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; So ArgoCD watches git and keeps the cluster in sync?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Exactly! If someone manually changes something in the cluster, ArgoCD changes it back. Git is the source of truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Cost Optimization
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; This is great, but I'm worried about costs. Any tips?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Absolutely! EKS costs can add up. Here's my cost optimization playbook:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Right-Size Your Pods
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install VPA (Vertical Pod Autoscaler)&lt;/span&gt;
git clone https://github.com/kubernetes/autoscaler.git
&lt;span class="nb"&gt;cd &lt;/span&gt;autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

&lt;span class="c"&gt;# Create VPA in recommendation mode&lt;/span&gt;
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  updateMode: &lt;span class="s2"&gt;"Off"&lt;/span&gt;  &lt;span class="c"&gt;# Just recommend, don't auto-update&lt;/span&gt;

&lt;span class="c"&gt;# Check recommendations&lt;/span&gt;
kubectl describe vpa api-vpa
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Use Spot Instances
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In eksctl config&lt;/span&gt;
&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot-workers&lt;/span&gt;
    &lt;span class="na"&gt;instanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t3.medium&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t3a.medium&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t2.medium&lt;/span&gt;
    &lt;span class="na"&gt;spot&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;minSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
    &lt;span class="na"&gt;maxSize&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
    &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
        &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;

&lt;span class="c1"&gt;# Pods that can tolerate spot&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch-processor&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tolerations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
        &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Equal&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
        &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;
      &lt;span class="na"&gt;nodeSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;lifecycle&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Use Fargate for Bursty Workloads
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Fargate profile&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eksctl.io/v1alpha5&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterConfig&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-cluster&lt;/span&gt;
&lt;span class="na"&gt;fargateProfiles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch-jobs&lt;/span&gt;
    &lt;span class="na"&gt;selectors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;workload-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Implement Pod Disruption Budgets for Safe Downscaling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# Keep at least one running during disruptions&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Use Cluster Autoscaler or Karpenter
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Cost savings:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scales down unused nodes&lt;/li&gt;
&lt;li&gt;Consolidates workloads&lt;/li&gt;
&lt;li&gt;Uses cheapest instance types&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Monitor and Alert
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# CloudWatch alarm for high costs&lt;/span&gt;
&lt;span class="s"&gt;aws cloudwatch put-metric-alarm \&lt;/span&gt;
  &lt;span class="s"&gt;--alarm-name eks-high-cost \&lt;/span&gt;
  &lt;span class="s"&gt;--alarm-description "Alert if EKS costs exceed threshold" \&lt;/span&gt;
  &lt;span class="s"&gt;--metric-name EstimatedCharges \&lt;/span&gt;
  &lt;span class="s"&gt;--namespace AWS/Billing \&lt;/span&gt;
  &lt;span class="s"&gt;--statistic Maximum \&lt;/span&gt;
  &lt;span class="s"&gt;--period 86400 \&lt;/span&gt;
  &lt;span class="s"&gt;--evaluation-periods 1 \&lt;/span&gt;
  &lt;span class="s"&gt;--threshold 1000 \&lt;/span&gt;
  &lt;span class="s"&gt;--comparison-operator GreaterThanThreshold&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7. Use Graviton (ARM) Instances
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;managedNodeGroups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;graviton-workers&lt;/span&gt;
    &lt;span class="na"&gt;instanceTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;t4g.medium&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;t4g.large&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Graviton-based&lt;/span&gt;
    &lt;span class="c1"&gt;# 20% cheaper than x86!&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cost breakdown example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Monthly EKS Costs:

Control Plane:              $73/month  (fixed)

3x t3.medium (x86):
  - On-Demand: $90/month
  - Spot: ~$27/month         (70% savings!)

3x t4g.medium (ARM):
  - On-Demand: $72/month     (20% savings vs x86!)
  - Spot: ~$22/month         (75% savings!)

Best Practice Mix:
  - 2x t4g.medium on-demand  (baseline)    $48
  - 3x t4g.medium spot       (burst)       $22
  - Control plane                          $73
  ----------------------------------------
  Total: ~$143/month for 5-node cluster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Production Readiness Checklist
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Okay, before we wrap up – give me a checklist. What do I need before going to production?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;pulls out final napkin&lt;/em&gt; Here's my production readiness checklist:&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Multi-AZ deployment&lt;/strong&gt; (at least 2 AZs)&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Private subnets&lt;/strong&gt; for worker nodes&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;NAT Gateways&lt;/strong&gt; for outbound internet&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;VPC Flow Logs&lt;/strong&gt; enabled&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Control plane logging&lt;/strong&gt; enabled&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Cluster autoscaling&lt;/strong&gt; configured&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Pod autoscaling&lt;/strong&gt; (HPA) for services&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Proper instance types&lt;/strong&gt; selected (right-sized)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;IRSA configured&lt;/strong&gt; for pod IAM roles&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Pod Security Standards&lt;/strong&gt; enforced&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Network Policies&lt;/strong&gt; defined&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Secrets in AWS Secrets Manager&lt;/strong&gt; (not K8s secrets)&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;ECR image scanning&lt;/strong&gt; enabled&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Security groups&lt;/strong&gt; properly configured&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;IAM least privilege&lt;/strong&gt; for all roles&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Encryption at rest&lt;/strong&gt; for EBS volumes&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Encryption in transit&lt;/strong&gt; (TLS everywhere)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Observability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;CloudWatch Container Insights&lt;/strong&gt; installed&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Logging to CloudWatch&lt;/strong&gt; configured&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Metrics collection&lt;/strong&gt; working&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Distributed tracing&lt;/strong&gt; (X-Ray) configured&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Dashboards created&lt;/strong&gt; for key metrics&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Alerts configured&lt;/strong&gt; for critical issues&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;On-call rotation&lt;/strong&gt; defined&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  High Availability
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Multiple replicas&lt;/strong&gt; for all services&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Pod Disruption Budgets&lt;/strong&gt; defined&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Liveness/readiness probes&lt;/strong&gt; on all pods&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Resource requests/limits&lt;/strong&gt; set&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Anti-affinity rules&lt;/strong&gt; for spreading pods&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Health checks&lt;/strong&gt; on load balancers&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Graceful shutdown&lt;/strong&gt; configured&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Disaster Recovery
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;etcd backups&lt;/strong&gt; automated (handled by EKS)&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;PV snapshots&lt;/strong&gt; scheduled&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Gitops repo&lt;/strong&gt; backed up&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Disaster recovery plan&lt;/strong&gt; documented&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Recovery tested&lt;/strong&gt; at least once&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;RTO/RPO defined&lt;/strong&gt; and achievable&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Operations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;GitOps workflow&lt;/strong&gt; implemented&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;CI/CD pipeline&lt;/strong&gt; automated&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Rollback procedure&lt;/strong&gt; tested&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Runbooks&lt;/strong&gt; for common issues&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Access control&lt;/strong&gt; (RBAC) configured&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Audit logging&lt;/strong&gt; enabled&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Change management&lt;/strong&gt; process defined&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Management
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;[ ] &lt;strong&gt;Resource quotas&lt;/strong&gt; per namespace&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Cost tracking&lt;/strong&gt; by team/project&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Spot instances&lt;/strong&gt; for applicable workloads&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Rightsizing&lt;/strong&gt; reviewed monthly&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Unused resources&lt;/strong&gt; cleaned up&lt;/li&gt;
&lt;li&gt;[ ] &lt;strong&gt;Budget alerts&lt;/strong&gt; configured&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;overwhelmed&lt;/em&gt; That's a lot...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; It is! But you don't need everything day one. Prioritize:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1: Get it running&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic cluster&lt;/li&gt;
&lt;li&gt;Deploy apps&lt;/li&gt;
&lt;li&gt;Basic monitoring&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Week 2-4: Make it reliable&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HA configuration&lt;/li&gt;
&lt;li&gt;Proper health checks&lt;/li&gt;
&lt;li&gt;Autoscaling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 2-3: Make it secure&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IRSA&lt;/li&gt;
&lt;li&gt;Network policies&lt;/li&gt;
&lt;li&gt;Secrets management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Month 3+: Optimize&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cost optimization&lt;/li&gt;
&lt;li&gt;Performance tuning&lt;/li&gt;
&lt;li&gt;Advanced features&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;closes laptop&lt;/em&gt; Wow. That was... comprehensive. My brain is full.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;laughs&lt;/em&gt; I know it's a lot. Kubernetes and EKS are powerful but complex. Let me leave you with key takeaways:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Start Simple&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Day 1: Basic EKS cluster with eksctl
Day 2: Deploy first app
Week 1: Add monitoring
Week 2: Add autoscaling
Month 1: Implement GitOps
Month 2: Advanced features
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Use Managed Services&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Let AWS manage the control plane&lt;/li&gt;
&lt;li&gt;Use managed node groups&lt;/li&gt;
&lt;li&gt;Leverage AWS integrations (IAM, ALB, EBS)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Embrace Declarative Configuration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Everything in YAML/Git&lt;/li&gt;
&lt;li&gt;Let Kubernetes reconcile&lt;/li&gt;
&lt;li&gt;Don't fight the system&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Focus on Observability&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can't fix what you can't see&lt;/li&gt;
&lt;li&gt;Logs, metrics, traces&lt;/li&gt;
&lt;li&gt;Alert on what matters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Security is a Journey&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with basics (IRSA, network policies)&lt;/li&gt;
&lt;li&gt;Add layers over time&lt;/li&gt;
&lt;li&gt;Never stop improving&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; And if I had to explain EKS to my manager in 30 seconds?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; "EKS is AWS's managed Kubernetes service. AWS handles the complex control plane, we focus on running our applications. It gives us industry-standard container orchestration, portability, and a massive ecosystem of tools. It's more complex than ECS but more powerful and portable. Perfect for our growing needs."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; Perfect. One last question – when should I absolutely NOT use EKS?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; Great question!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use EKS if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a single, simple application → Use App Runner or ECS&lt;/li&gt;
&lt;li&gt;Your team has zero container experience → Start with ECS, migrate later&lt;/li&gt;
&lt;li&gt;You need something running TODAY → Kubernetes has a learning curve&lt;/li&gt;
&lt;li&gt;Budget is extremely tight → Control plane costs $73/month minimum&lt;/li&gt;
&lt;li&gt;You don't need Kubernetes features → Simpler tools exist&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DO use EKS if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need Kubernetes for multi-cloud&lt;/li&gt;
&lt;li&gt;You have Kubernetes expertise&lt;/li&gt;
&lt;li&gt;You need advanced orchestration features&lt;/li&gt;
&lt;li&gt;You're building a platform for multiple teams&lt;/li&gt;
&lt;li&gt;You want the K8s ecosystem tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;stands up&lt;/em&gt; Alright, I'm ready. Time to create my first EKS cluster!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; That's the spirit! Remember:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Your first cluster&lt;/span&gt;
eksctl create cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; my-first-cluster &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodegroup-name&lt;/span&gt; workers &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--node-type&lt;/span&gt; t3.medium &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--nodes&lt;/span&gt; 2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--managed&lt;/span&gt;

&lt;span class="c"&gt;# Deploy something&lt;/span&gt;
kubectl create deployment nginx &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nginx
kubectl expose deployment nginx &lt;span class="nt"&gt;--port&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;80 &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;LoadBalancer

&lt;span class="c"&gt;# Check it&lt;/span&gt;
kubectl get all
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; And when it inevitably breaks?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;hands over card&lt;/em&gt; Text me. Or check the logs, describe the pods, and check events. 90% of issues are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Wrong image name&lt;/li&gt;
&lt;li&gt;Missing permissions&lt;/li&gt;
&lt;li&gt;Resource constraints&lt;/li&gt;
&lt;li&gt;Configuration errors&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Alex:&lt;/strong&gt; &lt;em&gt;shakes hand&lt;/em&gt; Thanks, Sam. Same time next week to discuss service meshes?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sam:&lt;/strong&gt; &lt;em&gt;grins&lt;/em&gt; Let's start with getting this working first!&lt;/p&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Quick Reference Guide
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Essential Commands:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Cluster management&lt;/span&gt;
eksctl create cluster &lt;span class="nt"&gt;-f&lt;/span&gt; cluster.yaml
eksctl get cluster
eksctl delete cluster &lt;span class="nt"&gt;--name&lt;/span&gt; my-cluster

&lt;span class="c"&gt;# Context management&lt;/span&gt;
kubectl config get-contexts
kubectl config use-context my-cluster

&lt;span class="c"&gt;# Resource management&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
kubectl describe pod my-pod &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
kubectl logs my-pod &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;-f&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--&lt;/span&gt; /bin/sh

&lt;span class="c"&gt;# Apply configurations&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; deployment.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="c"&gt;# Apply all YAML in directory&lt;/span&gt;

&lt;span class="c"&gt;# Scaling&lt;/span&gt;
kubectl scale deployment/my-app &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5
kubectl autoscale deployment/my-app &lt;span class="nt"&gt;--min&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 &lt;span class="nt"&gt;--max&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10 &lt;span class="nt"&gt;--cpu-percent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;70

&lt;span class="c"&gt;# Updates&lt;/span&gt;
kubectl &lt;span class="nb"&gt;set &lt;/span&gt;image deployment/my-app &lt;span class="nv"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;myapp:v2.0
kubectl rollout status deployment/my-app
kubectl rollout undo deployment/my-app

&lt;span class="c"&gt;# Debugging&lt;/span&gt;
kubectl get events &lt;span class="nt"&gt;-n&lt;/span&gt; myapp &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'.lastTimestamp'&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;-n&lt;/span&gt; myapp
kubectl top nodes

&lt;span class="c"&gt;# Port forwarding&lt;/span&gt;
kubectl port-forward svc/my-app 8080:80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resource Hierarchy:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cluster
  └─ Namespace
      ├─ Deployment
      │   └─ ReplicaSet
      │       └─ Pod
      │           └─ Container
      ├─ Service
      ├─ ConfigMap
      ├─ Secret
      ├─ PersistentVolumeClaim
      │   └─ PersistentVolume
      └─ Ingress
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Useful Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://eksworkshop.com" rel="noopener noreferrer"&gt;EKS Workshop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/" rel="noopener noreferrer"&gt;Kubernetes Docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.github.io/aws-eks-best-practices/" rel="noopener noreferrer"&gt;EKS Best Practices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/reference/kubectl/cheatsheet/" rel="noopener noreferrer"&gt;kubectl Cheat Sheet&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  ↑ Back to Table of Contents
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Alex walks out of the coffee shop, laptop bag over shoulder, ready to build something amazing. Sam orders another coffee and opens a laptop – time to help the next person on their Kubernetes journey.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Beginning of Your EKS Adventure&lt;/strong&gt; 🚀☸️&lt;/p&gt;

&lt;p&gt;↑ Back to Table of Contents&lt;/p&gt;

</description>
      <category>aws</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
