Forem: Gowtham Potureddi

Senior SQL: Advanced Joins, Window Analytics, Plans, Indexing & Production Mindset

Gowtham Potureddi — Wed, 13 May 2026 06:00:41 +0000

Senior SQL is not a longer SELECT — it is scale-aware relational engineering: you can state grain, predict cardinality, read a planner, choose indexes and partitions, and reason about correctness under concurrency while keeping SQL maintainable for the next teammate. Hiring loops for senior data engineers, analytics engineers, and backend owners increasingly assume that PostgreSQL, SQL Server, Snowflake, BigQuery, or Redshift are all “just dialects” around the same invariants.

The shift from junior to senior is the shift from “make this dataset” to “how does this behave at **tens or hundreds of millions* of rows, under real isolation, with observable plans?”* Below the hero, the fastest lever is still keyboard time on joins, windows, and EXPLAIN-driven refactors:

Browse practice hub →, open SQL language practice →, sharpen joins →, deepen window functions →, and reinforce CTEs →.

On this page

Junior vs senior — the mindset and the bar
Join mastery — cardinality, order, and physical strategies
Window analytics — partitions, orders, and frames
Recursive CTEs — hierarchies and graph-shaped data
Plans, indexes, and partitions — observability meets physics
Isolation, transactions, and locking — correctness under concurrency
Modeling, ETL SQL, quality checks, and anti-patterns
Tips to stay senior under interview clocks
Frequently asked questions
Practice on PipeCode

1. Junior vs senior — the mindset and the bar

From syntax fluency to production responsibility

Invariant: Junior SQL answers “what row shape?” — senior SQL answers “what row shape, what cost, and what failure modes?”

Detailed explanation. A junior-ready script filters and aggregates correctly on sample data. Senior SQL implies you can defend index use, spot join fan-out, choose window frames deliberately, and articulate transaction trade-offs — the stack companies run on Snowflake, BigQuery, Redshift, Postgres, or SQL Server rewards that depth with stable night jobs and non-deadlocking noon dashboards.

Pro tip: When a principal asks “what would you check first?” for a slow query, answer with grain + predicates + join graph + plan diff + stats freshness before mentioning “add an index.”

What junior coverage usually stops at

Detailed explanation. Baseline competency is SELECT / WHERE / GROUP BY / ORDER BY, basic JOIN, and INSERT / UPDATE / DELETE hygiene. That is enough to be productive on small tables and tutorials — insufficient when one-to-many edges multiply rows explosively or when NULL semantics invalidate NOT IN patterns across production feeds.

Worked example.

signal	junior-heavy answer	senior-shaped answer
slow report	“add DISTINCT”	“measure join width; maybe semijoin or pre-aggregate”

What senior coverage adds

Detailed explanation. Seniors lean on advanced joins with explicit cardinality stories, window analytics with correct frames, recursive CTEs for org/dependency graphs, execution plans (EXPLAIN, EXPLAIN ANALYZE where available), index strategy (composite, covering, selective partials), partition pruning, isolation levels, locking/deadlock narratives, star/snowflake modeling literacy, staged CTE ETL readability, and data-quality probes your pipeline can run daily.

How seniors decompose a “suddenly slow” query

Detailed explanation. Production regressions are rarely random: a stats refresh, a code deploy that widens predicates, a fan-out join introduced in a refactor, or a warehouse reschedule that starves slots all show up as plan or wall-clock shifts. Seniors time-box triage into repro → grain → predicates → join graph → plan diff → data skew so each hypothesis is falsifiable in minutes, not days.

Worked example.

stakeholder question	junior-heavy reflex	senior-shaped triage
“Dashboard blew up”	guess one new index	compare yesterday vs today plan; check partition predicates; confirm foreign-key join did not become M:N

Observable signals worth naming in interviews

Detailed explanation. You do not need perfect telemetry to sound senior — you need explicit observables: buffer/cache hit patterns, spill to disk in sort/hash nodes, rows out vs rows in at each join, remote vs local bytes in warehouses, and whether late-arriving facts changed window cohort sizes. Pair those nouns with what you would change (predicate, index leading key, pre-aggregation, or isolation boundary) and you map execution reality to engineering action.

Common beginner traps

Myth: senior = more nested subqueries — often flatter CTEs + clearer grain beats clever tortured SQL.
Treating DISTINCT as deodorant — masks join explosions instead of fixing keys.
Ignoring dialect session settings — the same text runs different plans with different work_mem / warehouse slots.

2. Join mastery — cardinality, order, and physical strategies

Joins are algebra and physics

Invariant: every join multiplies or filters row sets predictably — seniors narrate one-to-one, one-to-many, and many-to-many edges before typing JOIN.

Detailed explanation. Textbook joins look symmetric; optimizers treat them as physical operators: nested loop (probe), hash (build + probe), merge (sorted streams). Cardinality estimates, predicate selectivity, index alignment, and memory budgets determine which operator wins. Interview credibility comes from linking schema diagram → join graph → expected operator family, not reciting definitions.

Cardinality consciousness

Detailed explanation. Before aggregating, ask: if I join customers to orders, how many rows per customer appear? If the business question is per customer but your join returns per order line, downstream SUM scans the wrong multiset. Seniors stabilize with pre-aggregation, EXISTS semijoins, or deduping keys before attaching wide fact tables.

Worked example.

relationship	join result width
1 customer : N orders	N rows per customer
accidental M:N bridge	explosion

Physical strategies (how interviewers phrase it)

Detailed explanation. Nested loop shines with tiny outer sides or selective index nested loops. Hash join often wins for large equi-joins without helpful sort orders. Merge join needs sorted inputs — cheap when indexes provide order, expensive when sorts spill. Saying when each appears beats naming them alone.

Semijoins, antijoins, and row multiplication

Detailed explanation. EXISTS and IN (semi-correlated) patterns answer membership without duplicating the right-hand side — when you only need “is there a matching order?” you should not inner join orders and then DISTINCT your way back to customer grain. NOT EXISTS expresses antijoin with sane NULL semantics where NOT IN over nullable columns becomes a footgun. Interviewers listen for that distinction because it separates “I can write joins” from “I can guard cardinality.”

Outer joins and predicates: where the filter lives

Detailed explanation. Predicates on the nullable side of a LEFT JOIN behave differently in the ON clause vs the WHERE clause: in WHERE, you often null out preserved rows and accidentally convert a left join into an inner join; in ON, you shape the match before preservation. Seniors say aloud which semantics the business question needs (include non-matching parents vs only parents with qualifying children), then place predicates deliberately.

Numeric fan-out: why “only ten accounts” still explodes

Detailed explanation. Three 1:N joins in a row multiply: 10 accounts × 200 orders × 5 line items is 10,000 fact-shaped rows before a single SUM. If the dashboard question is account grain, that join order without pre-aggregation is wrong, not just slow.

Worked example.

step	relationship	rows per surviving account (illustrative)
accounts → orders	1:N	200
orders → lines	1:N	×5 → 1,000
accidental tag bridge	N:M	×k → thousands+

Common beginner mistakes

SELECT projections that widen fact grain before GROUP BY — expensive and ambiguous.
Outer joins with predicates on the outer side in the wrong clause — accidentally turning them into inner joins or duplicating rows.
Assuming the optimizer will “figure it out” without verifying stats, histograms, or session limits.

3. Window analytics — partitions, orders, and frames

Keep row grain while computing comparative metrics

Invariant: GROUP BY collapses; OVER() decorates — seniors pick the right one before typing.

Detailed explanation. Ranking (ROW_NUMBER, RANK, DENSE_RANK), offsets (LAG/LEAD), running totals, and moving averages are standard in analytics pipelines. Senior mastery is PARTITION BY discipline (correct cohort boundaries), ORDER BY inside windows (ties handled deliberately), and frame clauses (ROWS vs RANGE vs GROUPS) that match business time semantics.

Ranking patterns

Detailed explanation. ROW_NUMBER breaks ties arbitrarily unless you add tie-break columns — great for dedup keep-one. RANK leaves gaps after ties; DENSE_RANK does not. Interview prompts often hide tie-break requirements; name them aloud.

Frames — where running metrics go wrong

Detailed explanation. Default frames differ by function; aggregates over ORDER BY windows often accumulate from partition start through current row, while LAG ignores frames. For 7-day moving averages you usually want an explicit ROWS BETWEEN 6 PRECEDING AND CURRENT ROW (or calendar-aware RANGE in warehouses that support it well).

Worked example — frame semantics in one line of business logic. Suppose events share a user_id partition and an event_ts order. A 7-row moving click count uses ROWS when “seven events” is the contract; use RANGE INTERVAL '7 day' PRECEDING when “seven calendar days of irregular events” is the contract — mixing these quietly changes cohort sizes and downstream KPIs.

`LAG` / `LEAD` and session boundaries

Detailed explanation. Offsets compare each row to its neighbors inside the same PARTITION BY. That is how seniors build sessions (“gap > 30 minutes starts a new session”), previous-value deltas, and trip completion flags without correlated subqueries. The footgun is NULL on the partition’s first row — decide whether IGNORE NULLS (where supported) or a COALESCE story matches the spec.

Worked-example solution.

-- Reveal gaps between consecutive events per user (sessionization primitive)
SELECT user_id,
       event_ts,
       LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts,
       event_ts - LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS gap
FROM user_events;

Rule of thumb: if the problem statement says “compared to the previous row in some ordering,” reach for LAG/LEAD before self-joins — fewer duplicate sorts, clearer intent.

Common beginner mistakes

Forgetting PARTITION BY — “global” ranks across unrelated cohorts.
Over-wide SELECT projections inside CTEs feeding windows — unnecessary width tanks sort spill cost.
Using windows where GROUP BY already expresses the same collapse — doubles work.

SQL Interview Question on top three salaries per department

Tables: employees(id, name, department_id, salary) — ties possible. Prompt: Return at most three employees per department by salary descending, breaking ties by lower id first. Emit department_id, name, salary, rn.

Solution Using `ROW_NUMBER` with tie-break `ORDER BY`

SELECT department_id, name, salary, rn
FROM (
    SELECT department_id,
           name,
           salary,
           ROW_NUMBER() OVER (
               PARTITION BY department_id
               ORDER BY salary DESC, id ASC
           ) AS rn
    FROM employees
) x
WHERE rn <= 3;

Step-by-step trace

Input:

id	department_id	name	salary
1	10	Ada	90000
2	10	Bob	90000
3	10	Chi	85000
4	10	Dan	80000
5	20	Eve	70000

Partition by department_id — dept 10 and dept 20 rank independently.
ORDER BY salary DESC, id ASC — among salary ties, smaller id wins rn = 1 (Ada before Bob).
ROW_NUMBER assigns 1…4 within dept 10; outer filter keeps rn ≤ 3 → Ada, Bob, Chi.

Output:

department_id	name	salary	rn
10	Ada	90000	1
10	Bob	90000	2
10	Chi	85000	3
20	Eve	70000	1

Why this works — concept by concept:

Partition boundary — window resets per department_id, mirroring “top within group” specs.
Deterministic ordering — id tie-break prevents non-deterministic rank picks across engines.
ROW_NUMBER vs RANK — we need exactly N rows even with ties; RANK/DENSE_RANK can emit more than three logical ties depending on wording.
Predicate after window — compute rank once, filter rn; avoids correlated subquery patterns.
Cost — window sort is O(n log n) per partition in typical implementations; indexes on (department_id, salary DESC, id) help warehouse engines avoid full resorts when data is clustered.

SQL
Topic — window functions
Window SQL drills

Practice →

SQL
Topic — joins
Join-heavy SQL drills

Practice →

4. Recursive CTEs — hierarchies and graph-shaped data

Trees, org charts, and bill-of-materials patterns

Invariant: recursive CTEs walk a graph defined by a base case + inductive join — seniors prove cycle avoidance or accept termination rules.

Detailed explanation. Classic pattern: seed roots (manager_id IS NULL), iteratively attach children by joining the working set to the base table. BOM explosions and dependency queues reuse the same skeleton. Interviews probe depth limits, cycle detection, and whether you should push heavy graph work to graph engines instead of SQL when edges explode.

Worked example — verbal shape.

leg	action
anchor	pick root rows
recursive	join children to frontier
guard	optional `WHERE depth < 50`

Concrete org-chart skeleton (ANSI shape)

Detailed explanation. Most dialects compile anchor UNION ALL recursive member into iterative operators; you should still think in rounds: each recursive leg extends the frontier one hop. Keep the recursive member join purely structural (parent id to child manager_id) and push business filters either into the anchor or into a final WHERE so you do not accidentally starve valid branches.

Worked-example solution.

-- Depth-limited reporting tree: dialect-specific RECURSIVE keyword may be required
WITH RECURSIVE subordinates AS (
    SELECT id,
           manager_id,
           name,
           1 AS depth
    FROM employees
    WHERE manager_id IS NULL           -- anchor: executives; swap for :boss_id in interviews
    UNION ALL
    SELECT e.id,
           e.manager_id,
           e.name,
           s.depth + 1
    FROM employees e
    JOIN subordinates s ON e.manager_id = s.id
    WHERE s.depth < 20                  -- guardrail against runaway depth
)
SELECT * FROM subordinates;

Cycles, uniqueness, and when SQL is the wrong tool

Detailed explanation. Undirected or cyclic graphs need visit tracking: maintain a path string, array of ids, or a visited bitmap column in the recursive leg and abort when you would revisit a node. Without that, a single back-edge can recurse until the engine stops you. Even with guards, very deep hierarchies on hot OLTP paths may be the wrong layer — materialized paths, closure tables, or graph services exist because repeated recursion is CPU- and lock-heavy.

Bill-of-materials and explosion factors

Detailed explanation. BOM joins are recursive in business language: each part may decompose into sub-parts with quantities. Seniors track multiplicative quantities through levels (parent qty × child qty) and watch for diamond structures where the same sub-assembly appears twice — dedupe keys or DAG modeling prevents double-counting rollups.

Common beginner mistakes

Missing uniqueness — duplicate edges cause exponential blowups.
No cycle guard on adjacency with back-edges — recursion runs away.
Deep graphs in OLTP hot paths — offload or materialize paths offline.

5. Plans, indexes, and partitions — observability meets physics

`EXPLAIN` is the senior’s debugger

Invariant: plans make I/O and CPU explicit — seniors diff estimated vs actual rows and watch for seq scans, spills, bad nested loops, and stale stats.

Detailed explanation. A junior sees “slow query.” A senior checks selectivity, projection width, join order, and whether indexes match predicate leading columns. On warehouses, translate the same instinct to partition pruning, cluster keys, and slot contention — different nouns, same skepticism about full reads.

Index strategy (B-tree mental model)

Detailed explanation. Composite indexes follow left-prefix use: (customer_id, order_date) helps WHERE customer_id = ? and range order_date within that customer — not arbitrary order_date alone. Covering indexes include projections to avoid heap lookups when MVCC engines make that worthwhile. Always weigh write amplification on hot ingestion tables.

Worked example.

DDL	intent
`CREATE INDEX ON orders(customer_id, order_date)`	seek customer timeline

Partitioning — prune, don’t pray

Detailed explanation. Range partitions on order_date let engines skip cold files or table segments. Seniors write predicates that align to partition keys (half-open ranges help). Anti-pattern: function-wrapped partition columns that hide prune (WHERE YEAR(dt)=2025 instead of range on dt).

Worked example.

predicate on `event_date`	partition prune?
`event_date >= DATE '2025-04-01' AND event_date < DATE '2025-05-01'`	yes — engine can eliminate irrelevant segments
`EXTRACT(YEAR FROM event_date) = 2025`	often no — function masks the column
join on surrogate only, filter on dimension date later	risky — fact scans may widen before filter

Reading a plan like a diff

Detailed explanation. Treat EXPLAIN (ANALYZE, BUFFERS) (Postgres) or vendor equivalents as a before/after diff: did estimated rows diverge from actual by 10× (hinting stale stats or correlated predicates)? Did a hash join spill? Did a nested loop suddenly execute billions of inner probes? Those questions map to histogram refresh, predicate rewrite, index leading key, or join order hints — pick one lever per iteration.

Selective partial indexes and write amplification

Detailed explanation. Partial indexes (WHERE status = 'OPEN') shrink index size on skewed status columns and speed hot paths that always filter the same slice — at the cost of planner surprises if ORMs omit the same predicate. Covering indexes add include-columns to satisfy SELECT lists in index-only scans but increase VACUUM/maintenance surface area on write-heavy tables.

Common beginner mistakes

Indexing every column — harms writes; weak selectivity on indexed columns hurts planner choices.
Blaming the planner before checking vacuum/analyze, AUTO STATS, or histogram freshness.
Micro-benchmarking on empty tables — plans change radically at scale.

6. Isolation, transactions, and locking — correctness under concurrency

Isolation is a contract, not a vibe

Invariant: isolation levels trade anomalies for throughput — seniors pick with eyes open.

Detailed explanation. Know the textbook quadrilogy — dirty reads, non-repeatable reads, phantoms, serialization anomalies — and which levels suppress which on your engine defaults (read committed vs repeatable read vs serializable / snapshot). Locks (row, predicate, deadlocks) are how databases enforce those stories under write contention.

Locking & deadlocks

Detailed explanation. Deadlocks arise from opposite lock order on two resources — mitigation is consistent lock acquisition order, smaller transactions, and retries on 40001-class errors where supported. Seniors capture deadlock graphs instead of guessing.

Isolation levels vs anomalies (memory aid)

Detailed explanation. Different engines implement snapshot, MVCC, and predicate locks differently, but interviewers still expect you to name anomalies and which level tolerates them.

Worked example.

isolation level (typical names)	dirty read	non-repeatable read	phantom read
Read uncommitted	allowed	allowed	allowed
Read committed	blocked	possible	possible
Repeatable read / snapshot	blocked	blocked	engine-dependent
Serializable	blocked	blocked	blocked (often at throughput cost)

A pragmatic concurrency playbook

Detailed explanation. Default to short transactions, ordered lock acquisition on shared resources, SELECT … FOR UPDATE only when you mean it, and idempotent retry logic for serialization failures. For analytics, read-only replicas or warehouse sessions isolate heavy scans from OLTP lock pressure — another form of isolation, just at the architecture layer.

Common beginner mistakes

Long transactions holding locks while calling HTTP services — stalls the whole store.
Implicit READ UNCOMMITTED “for speed” — surprises downstream with phantoms.
Assuming ORMs manage boundary lines — you still own batch boundaries.

7. Modeling, ETL SQL, quality checks, and anti-patterns

Readable pipelines and honest schemas

Invariant: modeling decides which SQL is even possible — stars/snowflakes, surrogate keys, SCD strategies, and facts at grain.

Detailed explanation. Seniors design fact tables at immutable event grain and dimensions for attributes that change slowly. CTEs stage raw → cleaned → conformed → aggregated layers so diffs read like dataflow, not wall-of-text SQL. DQ checks (GROUP BY duplicate detectors, NULL rate scans) belong beside transforms, not after CFO escalations.

ETL SQL readability

Detailed explanation. Prefer WITH chains with named intents (cleaned_events, daily_revenue) over nested opaque subqueries. Warehouse runners still care — maintainers git blame your CTE names at 2 AM.

Keys, grain, and slowly changing dimensions

Detailed explanation. Natural keys (email, SKU) feel convenient until merges, typos, or vendor changes arrive — surrogate keys stabilize joins but require disciplined ETL to preserve history. SCD Type 1 overwrites attributes (easy, history lost); Type 2 versions rows with valid_from / valid_to (truthful, joins heavier); Type 3 keeps limited prior columns (rare, simplified). Seniors pick per attribute: addresses often Type 2, corrected typos sometimes Type 1 with audit trails elsewhere.

Worked example.

SCD flavor	when seniors choose it	SQL consequence
Type 1	truth today only	simple dimension join
Type 2	legal/finance needs history	join on as-of or current flag
Type 3	“last previous region” reporting	extra columns, lighter than full Type 2

DQ probes beside transforms, not after escalations

Detailed explanation. Lightweight checks catch contract breaks early: GROUP BY natural key HAVING COUNT(*) > 1 finds duplicates; NULL rate SUM(CASE WHEN col IS NULL THEN 1 END) / COUNT(*) on critical columns flags ingestion drift; referential probes (LEFT JOIN dimension WHERE fact key not matched) catch orphan facts before CFO reviews.

Anti-patterns seniors refuse

Detailed explanation. SELECT-star in hot paths widens IO. Functions on indexed columns (LOWER(email)) often NULL-unsafe and can suppress index use — prefer computed / persisted columns or case-folded canonical fields. Correlated subqueries can be fine — or catastrophic — verify plans.

Materialized views / rollups

Detailed explanation. Materialized views (where supported) precompute heavy aggregates for stable dashboards — trade staleness for latency. Document refresh semantics; seniors don’t hide hourly lag behind a “live” button label.

Common beginner mistakes

Denormalizing “because warehouses love joins” without SCD strategy — creates retroactive lies.
DQ as an afterthought — duplicates discovered monthly should be detected daily.

Tips to stay senior under interview clocks

Start every join question with cardinality — who is 1, who is N, what is the output grain?
Default to half-open time windows for reporting — fewer off-by-one month bugs.
Say “I’d diff the plan” — then list stats, indexes, data skew, predicate bake-in.
Know when SQL stops — deep cyclic graphs may belong in graph tools, not 90-line CTE battle.
Name frame units aloud — ROWS (fixed neighbors) vs RANGE (business time) prevents silent KPI drift.
Where to practice on PipeCode — chain window functions →, joins →, CTEs →, and aggregation → until window + join stories feel automatic.

Frequently asked questions

What is “senior SQL” in hiring terms?

Senior SQL means you can ship correct, efficient, maintainable relational workloads — reading plans, designing indexes/partitions, mastering windows and recursive patterns, and debugging concurrency issues — not only writing syntactically valid queries on toy tables. Interviewers listen for explicit cardinality stories, failure modes (locks, skew, bad stats), and refactor discipline: can you improve a query without hiding problems behind DISTINCT.

How is senior SQL different from knowing a specific warehouse?

Dialects differ (BigQuery vs Snowflake vs Postgres), but grain, join cardinality, frames, pruning, and isolation transfer. Seniors learn local plan vocabulary fast because the invariants repeat. The differentiator is not memorizing QUALIFY or CLUSTER BY alone — it is mapping each feature back to less bytes read, fewer shuffles, or clearer semantics.

When should I prefer `RANK` over `ROW_NUMBER`?

Use RANK/DENSE_RANK when tie groups must share standing (e.g., “top quartile bands”). Use ROW_NUMBER when you need deterministic dedup or exactly N rows with explicit tie-break columns. If the prompt says “top three salaries” but ties may exceed three people, RANK can overshoot row count — say that aloud and clarify requirements before coding.

Do I always need an index for fast queries?

No — tiny tables or analytical scans may be cheaper sequential; write-heavy tables pay index maintenance. Seniors choose based on selectivity, predicate shape, and observed plans — not folklore. Sometimes the winning move is narrower projections, pre-aggregation, or better stats rather than a new B-tree.

What is the biggest modeling mistake in analytics SQL?

Accidental grain shift — joining dimensions or facts so one event becomes many rows, then aggregating as if grain were still one row per event. Fix the join graph, not the DISTINCT. The durable fix is usually staging at the correct grain (e.g., per user-day) before attaching wide dimensions.

How do I practice senior patterns safely?

Work on larger realistic slices — partitioned time series, skewed keys, and multi-step CTE pipelines — and always inspect plans after rewrites. Supplement reading with timed reps on SQL topics → you're weakest at. Rotate filtering → and join → sets when questions hide fan-out in plain English.

Practice on PipeCode

PipeCode ships 450+ interview-grade problems spanning joins, aggregation, window analytics, CTEs, and filtering in SQL. Start from Explore practice →, narrow to language SQL →, and drill harder sets on SQL topic hub →. Unlock plans → when you want unrestricted runs.

Reporting Services in SQL (SSRS): Architecture, Report Types, RDL & Interview Notes

Gowtham Potureddi — Wed, 13 May 2026 05:55:50 +0000

Reporting services in SQL are the products and platforms that turn raw query results into governed business reports — charts, paginated PDFs, scheduled email attachments, and portal folders with permissions. The ecosystem runs from open SQL against transactional or warehouse databases to presentation layers your stakeholders actually open; on Windows-centric stacks SQL Server Reporting Services (SSRS) remains the classic teaching example because it couples SQL datasets to RDL definitions and a centralized report server.

The mental model never changes at the center: SQL establishes grain and facts, the reporting layer binds parameters, layouts banded sections, and subscriptions push artifacts on a calendar. After the hero image, you can jump straight into interview prep reps that strengthen the same predicates and aggregates your datasets rely on:

Browse practice hub →, open SQL language practice →, tighten aggregation →, sharpen filters →, and revisit joins →.

On this page

Why reporting services sit between SQL and the business
SSRS architecture — four components and the request path
Report types you should be able to explain cold
Datasets, data sources, RDL, and expressions
Parameters, subscriptions, exports, and security
SSRS versus Power BI — how to frame trade-offs
From SQL snippet to scheduled PDF — rehearsal workflow
Tips for reporting-aware SQL interviews
Frequently asked questions
Practice on PipeCode

1. Why reporting services sit between SQL and the business

Raw tables are honest; stakeholders need narrative artifacts

Invariant: reporting services do not replace SQL — they repeat vetted statements under access control, versioned templates, and distribution semantics that ad-hoc query tools skip.

Detailed explanation. An analyst can SELECT month, SUM(revenue) perfectly once, but finance still demands the same definition every Monday morning as a PDF with headers, page breaks, and drill paths. Reporting servers cache execution logs, route credentials through shared data sources, and let operators schedule rendering — responsibilities beyond a bare JDBC session.

The gap you are filling is not “prettier grids.” It is operational trust: a report is a contract that names which database, which filter rules, which grain, which owner, and how refreshed — then proves it ran the same way last Tuesday as today. Spreadsheets and one-off SQL notebooks rarely preserve that lineage at enterprise scale.

Pro tip: Interview answers improve when you name three layers aloud — data source (connection), dataset (query + parameters), layout (bands + expressions) — before mentioning chart types.

Reporting is broadcast: many readers, one definition. Ad-hoc analytics is narrowcast: one analyst, evolving logic. Mixing the two without a catalog is how “two official KPIs” happen.

What “services” means in this context

layer	owns	failure mode if ignored
database	correctness, keys, SLAs	pretty charts lie gracefully
reporting server	authZ, caching, schedule	leaked rows, duplicate deliveries
presentation	layout, exports	unreadable pixel soup

Why plain query tools stop short of “reporting”

Detailed explanation. A SQL client can run the same saved script, but it typically does not (by itself) version the template as an org asset, route Windows/SSO identities into folder ACLs, render pixel-stable PDFs for regulators, or email an attachment when a window closes in Chicago time. That orchestration is the service. Data engineering interviews often probe whether you can separate “I can write the query” from “I can operate the artifact.”

Worked example.

capability	SQL worksheet	reporting server
parameter UX	paste dates manually	pickers + defaults + validation
audit “who ran what”	maybe local history	execution log in catalog
deliver to exec inbox	copy-paste	subscription + attachment

Grain, filters, and one definition of the metric

Detailed explanation. Every report question resolves to grain: one row equals one ______. Revenue “by order” is not revenue “by shipment line” is not revenue “by invoice” — joins and allocation rules change totals. Reporting services don’t fix wrong grain; they freeze a definition long enough to argue about it productively. When someone says “the dashboard is wrong,” your first instinct should be compare grain and predicates, not “re-render.”

Worked example — same English, two grains.

question	implied grain	SQL shape (conceptual)
revenue by order	one row per `order_id`	`SUM(order_total)` grouped by order
revenue by line item	one row per `order_line_id`	`SUM(line_amount)` grouped by line

If you aggregate line items that belong to the same order twice because of a bad join, both a raw SELECT and an SSRS table will happily display the inflated number — only disciplined modeling fixes that.

Common beginner traps

Assuming “BI” owns semantics — without documented grain, two teams ship conflicting “official revenue.”
Embedding credentials in every .rdl — shared data sources stay auditable and rot slower.
Skipping half-open ranges on time filters — boundary bugs (BETWEEN inclusivity) skew period comparisons.
Treating the report as the source of truth — the relational model + curated views are truth; the report is a read projection.

2. SSRS architecture — four components and the request path

Server-side rendering with a catalog database behind it

SQL Server Reporting Services (SSRS) is Microsoft’s server-based reporting platform: designers author .rdl artifacts, publish them to a report server, users open them through a web portal, and metadata (items, roles, schedules) lands in report server databases backed by SQL Server.

Detailed explanation. When a user clicks Run, the server resolves data sources, executes dataset queries with parameter values, hydrates report definitions, renders into HTML/PDF/Excel, and optionally logs execution metrics for operators. Treat the report server as an orchestration tier between HTTP clients and your databases — not a substitute for ETL.

From a data engineering perspective, SSRS is two different dependencies: (1) operational data stores you query for facts, and (2) the report server catalog that stores definitions, permissions, schedules, and history. Performance tuning split-brains when teams optimize (1) but never look at execution logs in (2).

Pro tip: Whiteboard the path Portal → Report Processor → Data Extension → Database → Renderer → Export once; many “SSRS is slow” tickets are really dataset SQL or network hop problems wearing a portal costume.

Report Builder / SSDT (design time)

Detailed explanation. Authors build tablixes (flexible table/matrix regions), charts, parameters, and expressions in Report Builder (lighter, report-focused) or SQL Server Data Tools (SSDT) inside Visual Studio (heavier, solution-oriented). Both emit Report Definition Language files: .rdl XML you can diff in Git like any other code artifact. Mature teams review .rdl changes for accidental SELECT scope expansions the same way they review migration scripts.

Published artifacts use the .rdl extension and store XML describing data wiring, layout bands, and rendering hints — not a compiled binary blob you “can’t inspect.”

Report Server (run time)

Detailed explanation. The report server is the service that accepts execution requests (sync or async), authenticates the caller, authorizes against catalog security, binds parameters, executes datasets through configured providers, renders output using a rendering extension (HTML, PDF, Excel layouts differ), and records execution. It is also where cached report snapshots and shared schedules live — features that trade freshness for predictable render time and fewer database hits during Monday morning peaks.

Report server database (catalog)

Detailed explanation. SSRS persists folder hierarchies, .rdl and .rsds (shared data source) items, role assignments, subscriptions, snapshots, and execution / trace data in report server databases (traditionally a pair: primary catalog + optional TempDB-style workload — check your version docs for the exact layout you run). Think of this as metadata engineering: if the catalog is offline, no published definition runs, even when your sales warehouse is healthy.

Web portal (consumption)

Detailed explanation. The web portal is the HTTP front door for searching folders, opening reports, managing subscriptions, and downloading exports. It is not the same thing as the database tier; it is a UX + routing layer. Interviewers sometimes ask how you would harden this surface — answers touch TLS, integrated auth, least-privilege folder roles, and content managers vs browser-only personas.

Worked example — verbal trace.

step	component	note
1	Portal authenticates user	identity flows to server
2	Server loads `.rdl`	latest published version
3	Datasets hit SQL with parameters	this is your DE hot path
4	Renderer emits chosen format	PDF/Excel ≠ “just another HTML”
5	Execution logged	troubleshooting + compliance

Snapshots, caching, and “why did yesterday match but today doesn’t?”

Detailed explanation. Snapshot or cached report executions intentionally freeze data at a point in time. That is a feature for regulated statements; it is a foot-gun when analysts expect live warehouse freshness. When debugging discrepancies, always ask: live query, snapshot, or report-level cache — three different answers to “what numbers are we looking at?”

Common beginner mistakes

“SSRS is just drag-and-drop” — edge cases (multi-value parameters, dynamic SQL, double headers) still trace back to grain and joins.
Confusing snapshots with live queries — historical snapshots trade freshness for stability.
Ignoring execution logs when debugging timeouts — the database tier may be healthy while badly parameterized SQL scans explode.
Publishing optimizers — folding TOP into charts without noticing ORDER BY instability can reorder “top N” under concurrency.

3. Report types you should be able to explain cold

Tabular, matrix, charts, drill paths, and parameterized slices

Invariant: choose report types by consumption pattern, not whichever default template opened first.

Detailed explanation. Tabular reports list detail rows — invoices, ledgers. Matrix reports pivot dynamic columns (e.g., months across the top). Charts communicate distribution or trend. Drill-down expands hierarchy within one layout; drill-through jumps to another report with context keys. Parameterized reports bind user input to SQL predicates (WHERE region = @Region on SQL Server).

Each type couples to SQL shape differently: tabular often maps cleanly to ORDER BY + detail grain; matrix implies pivot-like grouping (think PIVOT or conditional aggregates in SQL, even when SSRS does the pivot visually); charts aggregate pre-bucketed series; drill-through requires portable keys (IDs), not vague labels.

Tabular (list) reports — audit-friendly detail

Detailed explanation. Tabular layouts list one record per row at a chosen grain — customer transactions, HR actions, GL lines. They pair with simple SQL: SELECT ... FROM ... WHERE ... ORDER BY. Interview wins come from naming sort keys for stable pagination (ORDER BY event_time, id) and visibility rules (suppress salary columns for non-HR roles).

Worked example — stakeholder read. Finance wants every invoice line for Q1 — grain is invoice_line_id, not invoice header.

Matrix (pivot) reports — dynamic columns

Detailed explanation. A matrix repeats groups on rows and columns — e.g., product family down the side, calendar month across the top, SUM(revenue) in cells. In SQL terms you are either pivoting in the dataset or letting SSRS aggregate from a long dataset (month, family, revenue). The failure mode is sparse cubes (thousands of empty cells) or too many dynamic columns for PDF pagination.

Worked example.

month	family	revenue
2025-01	shoes	100
2025-01	hats	40
2025-02	shoes	110

Matrix consumes long input; the renderer widens months into columns visually.

Chart reports — encoding choices matter

Detailed explanation. Charts are aggregated presentation — bars for categorical comparisons, lines for ordered time series, small multiples when categories explode. Interviewers care that you avoid double-encoding the same metric (bar and label and redundant legend swatches).

Drill-down — hierarchy inside one `.rdl`

Detailed explanation. Drill-down toggles visibility of group footers or nested groups: year → quarter → month. SQL-wise you often fetch detail rows once and let group hierarchies roll up — not three round trips per click.

Drill-through — context jump between reports

Detailed explanation. Drill-through navigates to another report and passes parameters (CustomerId, FiscalMonth) so the detail query stays indexed and small. The anti-pattern is passing display names without keys when two customers share a cleaned label.

Parameterized slices — where SQL and UX meet

Detailed explanation. Parameters surface as pickers; behind the scenes they become predicate bind variables. On SQL Server, @StartDate / @Region are typical. The job of the dataset author is to never concatenate parameters into strings as raw text — use provider bindings so plans stay cacheable and injection stays impossible.

Drill-down versus drill-through (favorite tripping question)

pattern	interaction	SQL implication
drill-down	expand nested groups inside same `.rdl`	grouping aligns with `ROLLUP` / nested `GROUP BY` mental models
drill-through	open separate detail report	pass surrogate keys as parameters; detail SQL uses selective seeks

Common beginner mistakes

Matrix without sparse handling — exploding column cardinality hurts readability and performance.
Pie charts for tiny deltas — interviewers notice chart literacy, not decoration.
Drill-through without key contracts — ambiguous keys duplicate detail rows downstream.
Detail SQL that returns mega-rows “for charts” — aggregate in-database when possible; pull only the series you plot.

SQL Interview Question on parameterized monthly revenue

Tables: sales(sale_id, sale_date, product_id, region, revenue DECIMAL) with daily grain. Prompt: Build a month-level revenue trend for an analyst-selected window inclusive of the start date and exclusive of the day after the end date (half-open end). Return month and total_revenue ascending by month.

Solution Using `date_trunc`, half-open range, and `SUM`

The sample uses PostgreSQL-style date_trunc because many data teams prototype monthly buckets that way; in SSRS on SQL Server, you would typically bind @StartDate / @EndDate to report parameters and use DATEFROMPARTS / EOMONTH patterns or calendar tables your warehouse already trusts — the invariant (half-open end, monthly grain) stays the same even when function names change.

SELECT date_trunc('month', sale_date)::date AS month,
       SUM(revenue) AS total_revenue
FROM sales
WHERE sale_date >= DATE '2025-01-15'
  AND sale_date < DATE '2025-04-01'
GROUP BY 1
ORDER BY 1;

Step-by-step trace

Input slice (abbreviated daily facts — only rows influencing January–March 2025):

sale_date	revenue
2025-01-20	400.00
2025-02-05	250.00
2025-02-20	null
2025-03-10	150.00
2025-04-01	999.00

WHERE sale_date >= '2025-01-15' AND sale_date < '2025-04-01' removes April rows and anything before Jan 15.
date_trunc('month', sale_date) buckets survivors into 2025-01-01, 2025-02-01, 2025-03-01 midnight timestamps; cast to date for clean labels.
SUM(revenue) folds each month; NULL revenue contributes nothing to the sum (SQL aggregate default).

Output:

month	total_revenue
2025-01-01	400.00
2025-02-01	250.00
2025-03-01	150.00

Why this works — concept by concept:

Half-open window — < DATE '2025-04-01' includes all of March without swallowing April 1; pairs cleanly with SSRS calendar parameters that map to start/end fields.
Month bucketing — date_trunc matches how operational-month reports think even when daily facts are irregular.
Null safety — reporting datasets still inherit NULL fact holes; aggregates remain correct if you intend “ignore unknowns,” otherwise guard with COALESCE upstream.
Cost — single-pass scan with hash aggregate typically O(n) in row volume after selective predicates; protect with partition pruning / indexes on sale_date at warehouse scale.

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →

SQL
Topic — filtering
Filtering & predicates (SQL)

Practice →

4. Datasets, data sources, RDL, and expressions

Connection objects versus query results feeding the canvas

Invariant: data sources answer “where”; datasets answer “what rows + columns now.”

Detailed explanation. A shared data source (.rsds or published shared item) centralizes provider, server, database, and impersonation — Windows integrated, stored SQL credential, or execution-context accounts depending on org policy. Dataset definitions store command text (often SQL, sometimes stored procedures), parameter mappings, and field metadata (FieldName → type) that report regions consume. RDL (Report Definition Language) is XML that packages both; mature teams diff .rdl in pull requests because “tiny layout tweaks” often smuggle new joins or removed filters.

The mental model for data engineers: data source = connection factory, dataset = bounded query unit. Every dataset execution should name maximum row expectation and required indexes in the same breath as “pretty chart.”

Embedded vs shared data sources (ops trade-off)

Detailed explanation. Embedded connections travel inside each .rdl — fast for prototypes, painful for password rotation. Shared data sources let DBAs rotate secrets once and let authors re-point dozens of reports by updating a single catalog item. In interviews, favor shared when discussing SOC2-style access reviews.

Stored procedures vs ad hoc SQL in datasets

Detailed explanation. Teams sometimes ban raw SQL in reports and require EXEC dbo.Report_MonthlyRevenue @Start, @End instead. Procedures stabilize plans, centralize review, and stop SELECT sprawl — at the cost of slower iteration for one-off investigations. Saying when you prefer each is senior signal.

Field list discipline (`SELECT` projections)

Detailed explanation. Layout expressions reference Fields!Column.Value. If your dataset projection is unstable (column rename in a view), every downstream expression breaks. Explicit column lists and semantic layer views (vw_reporting_sales_daily) isolate churn — the report binds to stable field names even when physical tables evolve.

Expressions — layout math vs SQL responsibility

Detailed explanation. SSRS expressions (~VB.NET-flavored in many shops) handle row-level formatting, running sums in footers, visibility toggles, and conditional palette. They are not a second SQL engine. Rule of thumb: aggregations that define business KPIs belong in SQL or modeled views; expressions format and annotate.

Worked example — when to push down.

need	do in SQL / model	do in expressions
official net revenue	`SUM` with tax rules	—
red text if variance > 10%	precompute variance column optional	`IIF(Fields!VarPct.Value > 0.1, "Red", "Black")`

Lookup datasets (dimension labels)

Detailed explanation. A primary dataset returns fact rows with product_id; a secondary lookup dataset maps product_id → display_name. SSRS Lookup() functions can replace verbose SQL joins when lookup cardinality is small and caching behaves — but abusing lookups duplicates work SQL could do once with a single join.

Common beginner mistakes

Multiple datasets with conflicting grain joined only in the layout — produces silent Cartesian risks.
Dynamic SQL strings built by concatenating user input — parameterize or bleed SQL injection into the reporting tier.
SELECT-star dataset queries — breaking when schemas drift; explicit columns stabilize consumers and caching.
Hiding bad joins behind expressions — if SQL emits duplicated rows, expression totals lie confidently.

5. Parameters, subscriptions, exports, and security

Interactivity plus operational delivery

Detailed explanation. Report parameters are the handshake between human intent and SQL predicates. They appear as text boxes, drop-downs, multi-selects, or cascading lists (region then city). Behind the UI, parameters bind to query parameters (@p) or shared dataset inputs. Subscriptions schedule render + deliver (email, share, archive) without a human clicking Run each morning. Role-based security on folders and items maps org structure (Finance vs Store Ops) to catalog ACLs — distinct from database roles but equally capable of leaking sensitive PDFs if mis-set.

concern	what to mention in interviews
authN/authZ	integrated security, custom roles, item-level inheritance
delivery	standard vs data-driven subscriptions
exports	pixel-perfect PDF vs Excel data layout

Single-value vs multi-value parameters (SQL shape)

Detailed explanation. Single values bind cleanly (WHERE region = @Region). Multi-select lists explode into IN semantics. On SQL Server, teams use table-valued parameters or split string functions (legacy) — the key interview point is never pasting raw comma-text into dynamic SQL.

Worked example — conceptual SQL Server predicate.

-- Conceptual: @RegionList is bound as a TVP or handled by SSRS multi-value expansion
SELECT *
FROM sales
WHERE region IN (SELECT value FROM @RegionList);

Cascading parameters and dataset round trips

Detailed explanation. A country dropdown rebinds state choices; each cascade can fire another dataset query. That is fine at low cardinality and deadly on cold caches when every manager opens the report at 9:00 AM. Mitigations: cached reference datasets, indexed lookup tables, or denormalized picker sources.

Standard vs data-driven subscriptions

Detailed explanation. Standard subscriptions attach one schedule + one recipient set. Data-driven subscriptions read a recipient table (“email, parameter tuple per row”) so ops can blast personalized PDFs without cloning reports — powerful and easy to misuse without row-level security discipline in the driving query.

Export formats are not interchangeable

Detailed explanation. PDF prioritizes pagination and print fidelity. Excel exports sometimes favor editability over strict layout. CSV is often lossy for merged cells and subtotals. Interview answers that name which export fits which regulatory use case read as practitioner-level, not tutorial-level.

Security: folders, items, and “who can subscribe?”

Detailed explanation. Catalog security layers roles (Browser, Content Manager, etc. — exact names vary by version/edition) on folders and items. Least privilege means most users are browse/run, not publish. Data engineers should care because subscriptions can exfiltrate data to mailboxes outside the database audit trail unless DLP/mail policies catch attachments.

Common beginner mistakes

Multi-select parameters without clean IN ergonomics — know your dialect’s table-valued parameter story on SQL Server.
Timezone-naive schedules — 8 AM in which zone? daylight edges matter for global retail.
Over-sharing subscription outputs — the attachment leaves the controlled portal surface.
Nullable parameters — forgetting “All” semantics can accidentally filter to NULL only or exclude NULL rows unintentionally.

6. SSRS versus Power BI — how to frame trade-offs

Paginated operational reporting versus exploratory analytics

Detailed explanation. SSRS remains the pragmatic choice when the business still prints, archives PDFs, or demands pixel-stable layouts that survive legal discovery. Power BI wins exploration: slicers, cross-highlighting, natural-language-adjacent visuals for analysts, and mashups across SaaS connectors. Many enterprises intentionally keep both: SSRS ships the statement, Power BI investigates why the statement moved.

The nuance interviewers listen for: tool choice is workload choice, not “old vs cool.”

dimension	SSRS	Power BI (typical framing)
paginated PDFs	strong	workable but not primary
self-service visuals	limited	strong
subscriptions & blast email	mature	varies by SKU / automation
operational “print the month”	excellent	sometimes awkward
licensing / org motion	bundled legacy story	capacity + workspace governance

When SSRS is still the correct default

Detailed explanation. Pick SSRS when stakeholders sign outputs, file them with regulators, or mail immutable month-end packs. Pick Power BI when teams need interactive slicing on certified datasets and accept softer pagination semantics.

Migration reality: don’t promise a button-for-button lift

Detailed explanation. Migrating hundreds of paginated .rdl assets to another stack is rarely “export → import.” Layout engines differ; subreport boundaries, custom code, and expressions may need rewrite. Budget for visual parity testing and parallel-run quarters.

Common beginner mistakes

Declaring “SSRS is dead” — regulated workflows still pay per paginated artifact.
Ignoring governance — whichever tool wins, certified datasets still beat rogue Excel extracts.
Letting two tools define the same KPI differently — align on semantic models or accept eternal reconciliation meetings.

7. From SQL snippet to scheduled PDF — rehearsal workflow

An end-to-end story you can whiteboard in three minutes

Detailed explanation. Production reporting is a pipeline wearing a GUI: prototype SQL → peer review → embed in dataset → bind parameters → layout → publish to a folder with ACLs → validate exports → schedule with monitoring on failures. Data engineering maturity shows up in how you test before the COO sees the PDF.

stage	artifact	checkpoint
model	vetted SQL	grain spelled out; `EXPLAIN` / plan sane
bind	parameters	half-open dates; multi-select semantics defined
layout	`.rdl`	chart encodings reviewed; no accidental totals
publish	catalog item	correct folder + inherited roles
validate	PDF + Excel	footers match regulator template
operate	subscription	failure alert + owner on-call

Validation checklist (what to say in interviews)

Detailed explanation. Before certifying a report, explicitly verify: row-level security still holds after joins; parameters cannot bypass filters via NULL tricks; execution time is bounded under peak concurrency; exports match on-screen totals (rounding rules aligned); subscriptions only reach expected domains.

Failure modes you should anticipate

Detailed explanation. Report outages cluster into a few buckets: database timeouts (missing index on filter columns), credential rotation (shared data source stale), schema drift (view rename broke field list), clock skew on scheduled windows, and email gateway throttling. Naming these buckets is often enough to pass system-design flavored BI questions.

Tips for reporting-aware SQL interviews

Anchor storytelling on grain, parameters, and delivery — hiring loops still ask how you partner with finance once SQL is proven.

Re-read every reporting SELECT as a dataset contract — column names become field handles; ambiguous aliases surface late. If you cannot explain one output row, you are not ready to publish.
Rehearse half-open [start, end) predicates aloud; they match how calendars map to SSRS and prevent off-by-one month bugs that only appear on leap years or fiscal calendars.
Pair OLTP replicas or warehouse roles mentally — reporting workloads should not casually hammer transactional primaries; name read routing and timeouts.
Know drill-down vs drill-through with one sentence each, then be ready to sketch which keys cross the boundary.
Be fluent in “where it broke” — browser, catalog, dataset SQL, warehouse, mail — troubleshooting stories beat reciting feature lists.
Where to practice on PipeCode — combine SQL drills →, subqueries →, and joins → so dataset SQL stays automatic under time pressure.

Frequently asked questions

What is SSRS in one sentence?

SQL Server Reporting Services is Microsoft’s server platform for designing, securing, publishing, and delivering SQL-backed reports — especially paginated exports and subscriptions tied to RDL definitions. It sits between your databases and authenticated consumers so execution is repeatable, auditable, and permissioned rather than ad hoc.

What lives inside an `.rdl` file?

RDL is XML describing data sources, datasets (SQL or other commands), parameters, layout bands, charts, and expressions — effectively the compiled blueprint the report server renders. Practically, treat it like infrastructure-as-code for visuals: you can peer review it, search for risky joins, and rollback versions when a deploy misbehaves.

Dataset versus data source — what is the difference?

A data source is the connection metadata; a dataset is the query result shape (fields, parameters) produced through that connection and consumed by report controls. Mixing them up in conversation sounds like confusing JDBC URL with ResultSet schema — both matter, but at different layers.

How does drill-down differ from drill-through?

Drill-down expands grouped hierarchy inside the same report; drill-through navigates to a different report, passing keys as parameters to show richer detail. The first optimizes one dataset fetch with nested visuals; the second optimizes selective detail SQL for deep inspection.

Why do teams still run SSRS next to Power BI?

Paginated, print-perfect documents, entrenched subscriptions, and operational PDF workflows often remain on SSRS while exploratory analytics sits in Power BI — complementary rather than strictly replacement. The coexistence story is common in regulated or franchise businesses that still mail monthly packs.

What should a data engineer verify before certifying a dataset?

Grain, predicate safety (parameterized, no string-built SQL), null handling, indexes for date filters, and access paths (who can schedule exports) — reporting amplifies small SQL mistakes into company-wide artifacts. Add execution time targets and snapshot vs live semantics so finance never argues against a frozen PDF thinking it was live warehouse data.

Practice on PipeCode

PipeCode ships 450+ interview-grade problems spanning SQL skills that mirror reporting datasets — aggregation, filtering, joins, and subqueries. Start from Explore practice →, narrow to language SQL →, and level up parameter-friendly SQL on topic SQL →. Unlock plans → when you want unrestricted runs.

SQL for Developers: Relational Foundations, Safe CRUD, Joins, Aggregates & Performance Muscle Memory

Gowtham Potureddi — Wed, 13 May 2026 05:18:48 +0000

SQL for developers is how you read and write the systems you already ship — user accounts, orders, feature flags, observability tables. Backend engineers, full-stack builders, and data engineers share the same primitives: relational tables, stable keys, honest JOIN semantics around NULL, explicit grain for aggregates, and ACID discipline when concurrency hits.

What follows mirrors how teams onboard ICs — schema literacy, guarded CRUD, predicate hygiene, joins without silent row multiplication, GROUP BY / HAVING versus window analytics, then indexes, transactions, and EXPLAIN as your debugging lingua franca — every numbered skill block ends like sql interview questions with answers: runnable Postgres SQL, traced execution, and a terse why. After the hero art, dive straight into reps when you crave keyboard time:

Browse practice hub →, open SQL practice →, deepen joins →, reinforce filters →, or widen with database fundamentals →.

On this page

Why SQL matters for developers and data engineers
Tables, keys, and the shape of relational data
Reading and writing rows — SELECT, INSERT, UPDATE, DELETE safely
Filtering, NULLs, sorting, and LIMIT
Joins — INNER, LEFT, and when rows multiply
GROUP BY, HAVING, and analytics-style windows
Indexes, transactions, ACID, and EXPLAIN-aware debugging
Choosing SQL skills for your stack (checklist)
Frequently asked questions
Practice on PipeCode

1. Why SQL matters for developers and data engineers

The relational contract hiding behind every HTTP handler

Invariant: SQL is the persisted half of almost every SaaS workload — signup rows, entitlement tables, payout ledgers. SQL for developers fluency separates engineers who prototype quickly from teammates who confidently answer "why did this counter disagree with finance?" without escalating.

Detailed explanation. Application tiers think in verbs (POST, PATCH, enqueue) while relational engines expose predicates, constraints, joins, and transactions. When those layers disagree, outages look like phantom bugs but read as inconsistent reads, missing indexes, duplicate grain, or unscoped transactions underneath.

Pro tip: Name grain aloud — exactly one row means one _______ — before you accept a JOIN plan or BI metric.

Why backends, analytics, and SRE converge on SQL

Everybody eventually asks Postgres the same primitives: correlations, aggregates, cardinality checks. Showing up fluent collapses Slack threads into scripts you can rerun, diff, commit to sql/ snippets, instrument in CI smoke tests — even if warehouses later optimise the OLAP clones.

Worked example.

persona	recurrent question type	payoff from SQL literacy
backend	`UPDATE` fan-out / orphaned FK rows	deterministic migrations
DS / analytics	reproducible cohort filters	parameterized queries
SRE	blast-radius queries during incidents	`EXPLAIN`-aware rollbacks

Step-by-step.

Incident alert references revenue drift — replicate via JOIN spanning payments ⇄ refunds.
Count rows twice — once naive, once with explicit DISTINCT grain_key assertions.
Validate indexes cover WHERE + JOIN predicates in staging before prod deploy.
Document literal-free SQL snippets for analytics parity.

OLTP versus analytic workloads

workload	emblematic SQL	optimise for
OLTP (`INSERT`, `UPDATE` critical rows)	`UPDATE … WHERE id = $1 RETURNING …`	short latch-friendly txn windows
Warehouse / OLAP	`SUM(x) GROUP BY day` (+ optional windows)	columnar parallelism, partition pruning
Replica analytics bridging both	parameterized slice queries	reproducible predicates + timeouts

Detailed explanation. Mixed-mode Postgres instances still differentiate by predicate selectivity, txn duration, and hardware headroom. Front-line CRUD hates long reads holding locks; heavyweight BI tolerates eventual freshness but demands honest scan plans.

Common beginner traps

Mirroring spreadsheets — unstructured columns creep into DDL without reviewers.
Trusting dashboards over source tables — BI layers often coerce grain silently.
ORM-only troubleshooting — you still need emitted SQL logs for hidden N+1 joins.
Ignoring isolation trade-offs — READ COMMITTED anomalies appear only under concurrency realism.

2. Tables, keys, and the shape of relational data

Tables are unordered multisets guarded by declarative constraints

Invariant: a row-oriented table expresses one record type; keys pin identity while foreign keys articulate relationships declaratively rather than scattering pointer logic exclusively in Ruby/Java services.

Primary versus natural identifiers

Natural keys mirror business artefacts (ISO country code). Surrogate IDs (BIGSERIAL) stay stable across refactors yet carry no domain meaning alone. Postgres typically combines both: surrogate PK for joins, UNIQUE(email) to enforce human-facing identity.

Worked example.

design	advantage	caveat
`BIGSERIAL PRIMARY KEY`	immutable join edges	meaningless for humans
`UNIQUE(email)`	legible dashboards	brittle if mergers rename emails

Worked-example DDL commentary. BIGSERIAL auto-sequence ensures monotonic surrogates with cheap index inserts. NOT NULL on email encodes onboarding invariants Postgres enforces deterministically versus optional app validation.

CREATE TABLE users (
    id          BIGSERIAL PRIMARY KEY,
    email       TEXT NOT NULL UNIQUE,
    city        TEXT,
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE orders (
    order_id    BIGSERIAL PRIMARY KEY,
    user_id     BIGINT NOT NULL REFERENCES users(id) ON DELETE RESTRICT,
    total_usd   NUMERIC(14,2) NOT NULL CHECK (total_usd >= 0),
    placed_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX orders_user_idx ON orders (user_id);

Foreign-key delete semantics matter in production churn

The difference between ON DELETE CASCADE (waves downstream deletes) versus RESTRICT (blocking deletes referencing kids) dictates safe admin tooling workflows. Prefer explicit policies over implicit defaults guessed during incidents.

-- Example: disallow deleting buyers with unpaid orders lingering
ALTER TABLE orders
DROP CONSTRAINT orders_user_id_fkey,
ADD CONSTRAINT orders_user_id_fkey
    FOREIGN KEY (user_id)
    REFERENCES users(id)
    ON DELETE RESTRICT;

Step-by-step.

Model parent (users), child (orders) cardinality first.
Select delete policy matching business law — finance rarely cascades blindly.
Index child FK columns (CREATE INDEX on orders.user_id).
Add CHECK constraints early — cheaper than patching corrupt rows later.

Common beginner mistakes

Deferring FK creation → silent orphan rows creep under concurrent writers.
Currency as floats — use NUMERIC(p,s) for monetary columns.
Timestamp without zone interpreted as UTC — prefer TIMESTAMPTZ + explicit TZ policy.
Overloading JSONB exclusively — structured columns keep optimiser hints honest.

Quick integrity checklist

check	Postgres hook
uniqueness	`UNIQUE`, partial unique indexes
domain logic	`CHECK`, domain types
referential coupling	declarative FK + chosen `ON DELETE`
auditing	triggers or append-only ledger tables

Rule of thumb: if two services disagree on cardinality, converge on DDL truth before arguing in Slack.

3. Reading and writing rows — SELECT, INSERT, UPDATE, DELETE safely

Every writer path needs identity — explicit filters and returning projections

Invariant: INSERT, UPDATE, DELETE must name which rows mutate (WHERE), prefer parameters ($1) over interpolated strings, and return affected rows (RETURNING) when callers need confirmations without another round-trip.

Transactions wrap multi-leg business truths

Transfers, swaps, entitlement downgrades seldom touch one isolated row atomically satisfying business law. Wrap coordinated statements (BEGIN … COMMIT) so partial states never linger.

BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;
UPDATE accounts SET balance = balance + 100 WHERE id = 2;
COMMIT;

Why this works:

__BEGIN__ opens snapshot / locking context per isolation level chosen.
Two ordered UPDATEs express money conservation — auditors expect atomicity tie.
__COMMIT__ publishes both deltas together; ROLLBACK; rewinds catastrophes.
`Cost — transactional overhead negligible versus financial inconsistency fallout.

Parameterized predicates prevent injection and cache reuse

Never splice user strings manually — placeholders keep plans cacheable and thwart SQL injection.

sql DELETE FROM sessions WHERE user_id = $1 AND expires_at < NOW();

INSERT patterns developers lean on daily

Bulk ingest + upsert choreography appears constantly — understand both single-row ergonomics (RETURNING) and batch paths.

`sql
INSERT INTO users (email, city)
VALUES ('dev@corp.com', 'Chennai')
RETURNING id, created_at;

INSERT INTO audit_log(event, payload)
VALUES
('password_reset', '{"user_id": 42}'::jsonb),
('mfa_challenge', '{"user_id": 42}'::jsonb);
`

Worked example — idempotent onboarding upserts (ON CONFLICT).

sql INSERT INTO user_profiles (user_id, display_name) VALUES ($1, $2) ON CONFLICT (user_id) DO UPDATE SET display_name = EXCLUDED.display_name, updated_at = NOW();

Detailed explanation.

Conflicting rows recycle through EXCLUDED pseudo-table exposing proposed values.
Pair with partial unique indexes for conditional uniqueness scenarios (invite tokens, soft deletes).

Safe UPDATE hygiene checklist

step	rationale
dry-run SELECT clone	verifies row cardinality before mutation
`LIMIT` + key filter	avoids blanket table rewrite
`RETURNING old.*` auditing	emits before/after diffs programmatically

Rule of thumb: if deleting “temp junk,” still scope by timestamp + TTL — surprise full-table wipes bankrupt trust faster than slow queries.

4. Filtering, NULLs, sorting, and LIMIT

Logical evaluation order is not top-to-bottom textual order

Invariant: Engines conceptually apply clauses in this order — FROM → JOIN → WHERE → GROUP BY → HAVING → windowing → SELECT expressions → DISTINCT → ORDER BY → LIMIT/OFFSET. Textual SQL order differs deliberately; misplacing expectations causes “phantom” aggregates or illegal references.

Detailed explanation.

Predicates in WHERE see raw row grain before collapsing via GROUP BY.
SELECT aliases rarely appear inside WHERE (Postgres exceptions exist for subqueries, not shortcuts).
ORDER BY runs after projection — you may sort computed expressions.

NULL is tri-valued logic (`TRUE`, `FALSE`, `UNKNOWN`)

Comparisons involving UNKNOWN ripple through compound predicates unpredictably unless you memorize De Morgan interactions with AND/OR.

`sql
SELECT * FROM users WHERE city IS NULL; -- correct
SELECT * FROM users WHERE city = NULL; -- ALWAYS UNKNOWN → zero rows (pitfall)

SELECT *
FROM experiments
WHERE status IS DISTINCT FROM outcome; -- treats NULL knowingly
`

Worked example.

predicate	evaluates when `city` NULL
`city = 'Chennai'`	UNKNOWN
`city IS NULL`	TRUE

Worked-example pattern — COALESCE bridging optional columns.

sql SELECT COALESCE(city, 'unknown') AS city_label FROM users;

Combine with NULLIF to coerce sentinel blanks into canonical NULL semantics.

Composing predicates responsibly

Prefer explicit parentheses when mixing AND/OR:

sql SELECT id, email FROM users WHERE (city IN ('Hyderabad', 'Chennai') OR vip IS TRUE) AND suspended IS FALSE;

ORDER BY, LIMIT:

sql SELECT id, email FROM users WHERE city IN ('Hyderabad', 'Chennai') ORDER BY email ASC LIMIT 25 OFFSET 50;

Detailed explanation.

Large OFFSET scans skipped rows wastefully — keyset pagination (WHERE id > $cursor ORDER BY id LIMIT) often cheaper at scale even if ergonomically heavier.
Stable sorts duplicate tie-break columns (ORDER BY created_at DESC, id DESC) to defeat nondeterministic pages.

Common beginner traps

NOT IN (...) collapses unexpectedly when inner list harbors NULL (entire predicate UNKNOWN).
BETWEEN inclusive endpoints surprise folks expecting half-open intervals.
ILIKE '%foo%** cannot exploit plain B-tree indexes without pg_trgm / expression indexes.

Optional pattern — existential filters with EXISTS

Prefer semijoins when probing presence without caring about multiplicity:

sql SELECT u.id FROM users u WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.user_id = u.id );

Rule of thumb: when debugging filters, SELECT COUNT(*) before and after layering predicates — divergence isolates offending clause fast.

5. Joins — INNER, LEFT, and when rows multiply

Joins reshape cardinality — tame fan-out before aggregates

Invariant: JOIN combines row sets via predicates (typically equality on keys). When either side is one-to-many, the result repeats left rows once per matching right row unless you stabilize grain first.

Detailed explanation.

INNER JOIN emits only pairs that satisfy the ON clause — unmatched rows on either side disappear.
LEFT [OUTER] JOIN keeps every row from the left spine; unmatched right columns become NULL (sentinel absence, not “empty string”).
FULL OUTER is rarer in app code but useful when reconciling two feeds where either side might be orphaned.

`sql
-- Buyer activity: users who ordered at least once (one row per matching order pair)
SELECT u.name, o.order_id
FROM users u
JOIN orders o ON o.user_id = u.id;

-- Anti-join — users present on the LEFT with NO matching orders on the RIGHT
SELECT u.name
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.order_id IS NULL;
`

INNER vs LEFT in one rehearsal table

join	survives without match on opposite side	read it as
`INNER`	no	“pairs only — inner intersection”
`LEFT`	yes (left survives)	“keep cohort A, optionally attach B”

Fan-out rehearsal (why `COUNT(*)` lied)

spine	facts	INNER join rows if 3 matching orders
1 Alice	3 orders for Alice	3 rows named Alice

COUNT(*), SUM(amount) downstream now count result rows, not necessarily distinct users. Fix upstream with distinct keys, sub-aggregates, or semi-joins (EXISTS / IN) depending on semantics.

Developer SQL interview question — users who never ordered

Tables: users(id, name) and orders(order_id, user_id) with referential integrity. List dormant accounts (users who never placed an order), ordered deterministically.

Solution Using LEFT JOIN anti-join

Code solution.

sql SELECT u.id, u.name FROM users u LEFT JOIN orders o ON o.user_id = u.id WHERE o.order_id IS NULL ORDER BY u.id;

Step-by-step trace.

step	planner story
1	Build left spine — one output row candidate per user.
2	For each user, seek/probe matching orders on `orders.user_id`.
3	If no row qualifies, *`o.` becomes NULL**, including surrogate `order_id`.
4	`WHERE o.order_id IS NULL` keeps only unmatched left rows (anti-pattern if `order_id` could be NULL absent FK — integrity prevents that here).

Output:

id	name
404	dormant

Companion pattern — NOT EXISTS (duplicate-safe semi-join).

sql SELECT u.id, u.name FROM users u WHERE NOT EXISTS ( SELECT 1 FROM orders o WHERE o.user_id = u.id ) ORDER BY u.id;

Why this works — concept by concept:

LEFT preservation — every user survives until filtered; dormant cohort remains visible.
NULL sentinel — with a real surrogate key on orders, order_id NULL reliably means no attachment, not ambiguous data void.
NOT EXISTS — logically ignores duplicate orders per user (semijoin); no accidental multiplier from exploding the right side.
Index leverage — B-tree on orders(user_id) turns probes into seeks instead of scans.
Cost — hash or merge-family join tends toward Θ(n + m) cardinality with healthy selectivity versus nested-loop blowups on accidental cross products.

SQL Topic — joins Join-heavy SQL drills

Practice →

SQL
Language — SQL
SQL language library

Practice →

6. GROUP BY, HAVING, and analytics-style windows

Collapse grain or stay row-wise — aggregates vs windows split the job

Invariant: GROUP BY collapses rows into buckets, producing one output row per group after aggregation. OVER() keeps input grain — every source row survives while you decorate it with comparative metrics (rank, running sum).

Detailed explanation.

After JOIN/WHERE, the relational engine optionally groups by the listed expressions. SELECT may reference either grouping keys or aggregate functions evaluated inside each bucket — stray base columns violate the contract unless functionally dependent (Postgres errs early; beware lenient dialects permitting hidden ambiguity).
HAVING filters post-aggregation predicates (COUNT(*) > 10), while WHERE trims rows before bucketing (cannot cite aliases from SELECT aggregates — repeat the aggregate or nest a subquery).
Window frames (ROWS / RANGE / GROUPS) default per function; analytic ranking (ROW_NUMBER, RANK, DENSE_RANK) ignores frame — they only need PARTITION BY + ORDER BY.

Stock patterns side by side

`sql
-- Collapse city grain — one row per city after aggregation
SELECT city, COUNT() AS n
FROM users
GROUP BY city
HAVING COUNT() > 5;

-- Decorate employee grain — ranking within department without collapsing rows
SELECT emp_id,
dept_id,
salary,
ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rn
FROM employees;
`

Dedup companion — deterministic keeper rows

Grouped counts diagnose violators; windows remediate duplicates while picking one canonical row (Partition by key, ORDER BY audit columns, filter WHERE rn = 1):

sql WITH ranked AS ( SELECT *, ROW_NUMBER() OVER ( PARTITION BY email ORDER BY updated_at DESC NULLS LAST, id ASC ) AS rn FROM users ) SELECT * FROM ranked WHERE rn > 1; -- offenders to DELETE or archive in a batched migration

Developer SQL interview question — duplicate emails awaiting cleanup

List normalized emails appearing more than once with violation counts.

Solution Using grouped counts

Code solution.

sql SELECT email, COUNT(*) AS dup_count FROM users GROUP BY email HAVING COUNT(*) > 1 ORDER BY dup_count DESC, email;

Step-by-step trace.

step	computation
1	Normalize upstream if needed (`LOWER(TRIM(email))` in ingestion or projection).
2	`GROUP BY email` emits one accumulator per distinct key.
3	`HAVING COUNT(*) > 1` drops singleton buckets — only violators survive.

Output:

email	dup_count
dev@corp	3

Why this works — concept by concept:

GROUP BY bucket — each email becomes its own multiset; COUNT(*) measures multiplicity after predicates.
Having vs where — WHERE cannot express COUNT(*) > 1 without a subquery; HAVING runs on aggregated state.
Operational pairing — follow with ROW_NUMBER() partitioning on the same dedupe key to choose survivors without arbitrary ties.
Cost profile — hash aggregate typically linear in spilled row volume; spilled sorts add I/O proportional to work_mem pressure.

SQL Topic — aggregation Aggregation drills

Practice →

SQL
Topic — window functions
Window SQL drills

Practice →

SQL
Topic — group-by
GROUP BY drills

Practice →

7. Indexes, transactions, ACID, and EXPLAIN-aware debugging

Indexes + planners + transactional contracts separate “queries” from “systems”

Detailed explanation.

Default B-tree indexes accelerate equality / range predicates (WHERE city = 'Hyderabad' or WHERE created_at BETWEEN …) via seeks instead of sequential scans once selectivity warrants them.
Indexes are derived state — every insert/update/delete touches matching index entries (write amplification); partial indexes prune maintenance when predicates cover a cohort (WHERE churned IS FALSE).
CREATE INDEX CONCURRENTLY avoids long ACCESS EXCLUSIVE locks on Postgres hot tables during build — trade-off is longer DDL and retry bookkeeping if build fails midway.

`sql
CREATE INDEX CONCURRENTLY idx_users_city ON users (city);

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM users
WHERE city = 'Hyderabad';
`

Interview-grade EXPLAIN reading hints.

Prefer EXPLAIN (ANALYZE, BUFFERS) for truth about buffers hit vs read, loops, actual row counts — textual plans alone omit runtime skew.
Watch estimated vs actual row mismatches (bad stats ⇒ wrong join order, surprise nested loops on big inner sides).
Seq Scan acceptable on narrow tables / cold caches; punitive when millions of qualifying rows funnel through filter after scan.

ACID anchors

letter	mnemonic	what to say in-panel
A	Atomicity	all statements commit together or `ROLLBACK` rewinds observable effects
C	Consistency	constraints + declarative checks hold on commit boundaries
I	Isolation	levels trade phantom reads vs concurrency — `READ COMMITTED` default on Postgres exposes committed deltas each statement; `REPEATABLE READ` / MVCC snapshots tame drift for longer read transactions
D	Durability	WAL persists committed work across crashes

Practical transaction hygiene

Declare explicit boundaries (BEGIN / COMMIT) when orchestrating multi-table invariants — ORMs emitting autocommit per statement fracture money-movement narratives. Serialization failures (SQLSTATE 40001) under SERIALIZABLE / snapshot conflicts signal retry opportunities, not “random database bugs.” Pair schema changes with CONCURRENT index creations and phased backfills whenever traffic tolerates eventual consistency stages.

Choosing SQL skills for your stack (checklist)

horizon	competency	Practice lane
Week 1	DDL + FK stories	/explore/practice/topic/database →
Week 2	predicates + paging	/explore/practice/topic/filtering/sql →
Week 3	joins + cardinality	/explore/practice/topic/joins/sql →
Week 4	aggregates/windows	/explore/practice/topic/aggregation/sql →

Frequently asked questions

Which dialect first?

PostgreSQL is the pragmatic default — rich standard SQL surface, expressive JSONB, strong isolation story, ubiquitous in modern stacks, and interview panels often cite it explicitly. Treat SQLite as excellent for correctness drills and EXPLAIN QUERY PLAN intuition, MySQL where legacy hiring signals demand it — but defer dialect trivia until relational mechanics feel automatic.

LEFT JOIN … IS NULL versus NOT EXISTS?

Both express relational difference (anti-semijoins). LEFT JOIN + NULL filter is readable when order_id is a dependable surrogate and you want symmetry with exploratory outer joins. NOT EXISTS scales mentally when duplicates on the right explode intermediate row counts — multiplicity does not falsify existential absence. Mention both in reviews; pick based on cardinality and planner friendliness verified with EXPLAIN.

When do aggregates beat windows?

Collapse to metric tables or cohort summaries ⇒ GROUP BY + aggregates + HAVING. Preserve row grain while ranking/deduping/running deltas ⇒ OVER() partitions. Mixed patterns stack CTE layers: aggregate subtotals upstream, JOIN keyed aggregates back, then window across enriched rows once uniqueness returns.

Are indexes unequivocally good?

No — each index consumes storage, lengthens mutation paths (insert/update hotspots), and complicates migrations. Prefer narrow partial indexes, covering composites aligning with ORDER BY when read savings dominate, but baseline with metrics (p95 latency, churn on write-heavy queues) rather than speculative indexing folklore.

What proves seniority fastest?

Fluent EXPLAIN storytelling (buffer churn, mis-estimates), articulating isolation anomalies you have actually chased (lost updates, phantom reads), and disciplined schema evolution (locking, backfills, concurrency-safe DDL). Read alone plateaus — annotate past incidents with reproduction SQL and planner diffs during debrief loops.

Do ORMs replace raw SQL literacy?

ORMs scaffold CRUD ergonomics quickly, yet N+1, implicit transaction scopes, brittle migrations, deadlock graphs, and hot-index regressions still surface raw SQL realities. Seniors jump between declarative mappings and handwritten SQL confidently because production incidents rarely respect abstraction boundaries exclusively.

Practice on PipeCode

PipeCode ships 450+ interview-grade problems spanning SQL joins, aggregates, transactional logic, analytics windows, subqueries, and pragmatic schema debugging. Anchor on Explore practice →, escalate through language SQL →, grab sets like topic SQL → or subqueries →, and unlock plans → whenever you want unrestricted runs.

CTE in SQL for Data Engineering Interviews: WITH Clauses, Recursive CTEs, and Window SQL Patterns

Gowtham Potureddi — Wed, 13 May 2026 05:07:40 +0000

CTE in SQL — a Common Table Expression introduced with WITH — is how you turn a brittle wall of nested subqueries into a readable, debuggable pipeline. In data engineering interviews — the same lanes as basic sql interview questions, joins in sql interview questions, and sql interview questions with answers rounds — reviewers use CTEs as a readability signal: if you can name intermediate results and chain them cleanly, you will survive the live whiteboard refactor.

You will build the pattern from scratch: single-CTE anatomy, chained CTE pipelines, CTE + sql window functions for rank-then-filter (top‑N per group), WITH RECURSIVE org-chart and counting sequences, and the classic board question triad (CTE vs subquery vs temporary table) with blunt trade-offs. Every interview-style beat ends as sql interview questions with answers: runnable Postgres-flavoured SQL, a traced execution, printed output tables, and a concept-by-concept why this works map — without repeating the boilerplate headline as keyword stuffing.

When you want hands-on reps immediately after reading, browse CTE (SQL) practice →, dive practice SQL hub →, sharpen window function SQL →, rehearse join interview SQL →, or widen coverage on the general CTE topic →.

On this page

Why CTEs matter in interviews and pipelines
Single CTE — anatomy of WITH … AS
Multiple chained CTEs — read top-down like dbt staging
CTE with joins and aggregations
CTE + sql window functions — rank, then filter
Recursive CTE — hierarchies without leaving SQL
CTE vs subquery vs temp table — interview trade-offs
Choosing CTE usage (checklist)
Frequently asked questions
Practice on PipeCode

1. Why CTEs matter in interviews and pipelines

The readability invariant interviewers optimise for

The core invariant you should state aloud: a CTE binds a disposable, named relational expression to one outer SELECT, INSERT, UPDATE, or DELETE; it disappears when the enclosing statement completes—no DDL, no session lease—and it reads cleaner than tortured nested parentheses. That wording alone answers half of the "what is CTE" prompts you will hear in screening loops.

Detailed explanation.

Bound to one statement — every WITH chain is part of a single top-level statement. Nothing "lives" after COMMIT/ROLLBACK of that unit the way a temp table does; you cannot SELECT a CTE alias from a different query in the same session without copy-pasting the definition.
Algebraic, not physical — the engine may inline a CTE into the outer query, merge predicates, or hoist joins. You name relations for humans and maintainers; the planner still rewrites. Senior answers separate authoring ergonomics from execution guarantees (see §7 for materialization nuance).
Same relational type as a subquery — a CTE produces a relation (bag or set depending on UNION vs UNION ALL); every row has a schema fixed by the inner SELECT list. That is why you can stack CTEs like typed pipeline stages.

When an interviewer asks why you reached for WITH instead of a derived table, cite three forces:

Nameability — avg_salary reads better than "subselect #3" and signals intent in code review.
Refactor‑friendliness — comment out a CTE while debugging without rewiring parentheses; reorder stages when the story changes.
Composition — later CTEs may reference earlier ones in the same chain; that mirrors dbt / SQL transformation layers without leaving the warehouse editor.

Pro tip: When someone says "walk me through this query," start at the last CTE in the chain and narrate backward to data sources—mirrors how many optimisers inline, and shows you control the algebraic story end-to-end.

Common beginner mistakes

Treating CTE results as persisted tables—they are logical scopes, not caches (mention materialized CTE hints only when you genuinely know your engine exposes them).
Inlining seventeen anonymous subqueries to "save lines" — interviewers downgrade readability instantly.
Hiding mutating statements inside so-called readability layers — keep each CTE a pure relational expression unless the prompt explicitly mixes DML patterns.
Reusing a CTE name as if it were a view registered in the catalog—only the enclosing statement can see it; cross-statement reuse needs a real view, temp table, or ORM layer.

2. Single CTE — anatomy of `WITH … AS`

Name the intermediate relation, then consume it once

Every non-recursive CTE has the same silhouette:

WITH alias AS (
    SELECT ...
)
SELECT ... FROM alias;

Warehouses disagree on microscopic optimisation trivia; they agree on this grammar.

`WITH` binds scope, not storage

Think of WITH as lexical, not physical: engines may inline the CTE, merge filters, reorder joins—the author job is expressing intent cleanly so downstream planners can.

Detailed explanation.

Column list optional — you may write WITH cte (a, b) AS (SELECT …) to rename projected columns explicitly; handy when outer queries should not depend on brittle inner aliases.
Multiple references — the outer statement may reference the same CTE twice (FROM cte c1 JOIN cte c2 …). That is legal but can surprise optimizers into scanning twice unless merged; mention that aloud if the interviewer asks about cost.
Mutual exclusion with some clauses — dialect details vary, but in interviews assume the CTE's inner SELECT follows normal rules (no bare aggregates without GROUP BY, no illegal forward references to outer query columns—correlated subqueries still need explicit correlation).

Worked example. Filter highly paid rows before projecting columns:

Concept	Meaning
Alias	reusable handle `big_earn` inside the outer query
Scope	disappears after outer `SELECT` finishes
Contrast vs inline	same relational idea, sharper whiteboard narration

Step-by-step.

Define big_earn as SELECT emp_id, name, salary FROM employees WHERE salary > 50000.
Outer query selects SELECT name, salary FROM big_earn ORDER BY salary DESC.
Add LIMIT 3 knowing the filtration already happened upstream logically — note: ORDER BY + LIMIT apply to the outer grain after the CTE is bound, which matches how you narrate debugging ("sort the named relation, then truncate").

Worked-example solution.

WITH big_earn AS (
    SELECT emp_id, name, salary
    FROM employees
    WHERE salary > 50000
)
SELECT name, salary
FROM big_earn
ORDER BY salary DESC
LIMIT 3;

Common beginner traps

Forgetting that outer WHERE cannot "reach into" the CTE's hidden predicates unless you expose columns—push stable filters into the CTE when they shrink working sets early.

Rule of thumb: if the subquery has a business meaning ("high earners," "valid orders"), name it.

3. Multiple chained CTEs — read top-down like dbt staging

Each new CTE may reference any prior CTE in the chain

The multi-CTE invariant: order matters only for name resolution—cte_n may reference cte_{1..n-1} but never forward-declare aliases you have not defined yet. This mirrors layered SQL transformations: staging → intermediates → marts.

Detailed explanation.

Acyclic name graph — imagine an edge from daily_totals → raw_orders because the former reads the latter. The textual order you write is a valid topological sort of that dependency DAG; swap two CTEs blindly and names break.
Grain discipline — each stage should have a one-sentence grain statement ("one row per order," "one row per calendar day per store"). When the grain shifts, rename the CTE so reviewers see the dimensional contract.
Debugging workflow — during a live interview, you can stub later CTEs as SELECT * FROM prior_cte LIMIT 10 to validate shapes before adding aggregates—chaining rewards that incremental reveal.

Worked example. Three-step revenue SLA filter:

Stage	Responsibility
`raw_orders`	normalise ingest projections
`daily_totals`	aggregate at calendar-day grain
`flagged_days`	predicate on revenue thresholds

Step-by-step.

raw_orders projects event_date, usd_revenue.
daily_totals aggregates SUM(revenue) grouped by calendar date — watch for fan-out: if raw_orders accidentally joins to a dimension before this step, your SUM multiplies; keep joins that change cardinality either explicit in a dedicated CTE or after aggregation when semantics demand it.
flagged_days keeps SLA-busting days only — pure filter on an already-collapsed daily relation.

Worked-example solution.

WITH raw_orders AS (
    SELECT DATE(order_ts)::date AS d, revenue_usd
    FROM ingest.shopify_orders
    WHERE refunded IS FALSE
),
daily_totals AS (
    SELECT d, SUM(revenue_usd)::numeric(14,2) AS rev
    FROM raw_orders
    GROUP BY d
),
flagged_days AS (
    SELECT *
    FROM daily_totals
    WHERE rev >= 25000
)
SELECT *
FROM flagged_days
ORDER BY d DESC;

Common chaining pitfalls

Hidden Cartesian products — joining two wide CTEs without keys duplicates rows; verbalize keys whenever you fuse layers.
Leaky filters — applying WHERE only in the outermost SELECT while earlier CTEs still scan massive history; push time windows and partitioning predicates as early as the schema allows.

Rule of thumb: rename each layer after the grain (per_order, per_customer_day) so graders see dimensional thinking.

4. CTE with joins and aggregations

Join inside the CTE when the join is the reusable story

Pulling joins in sql interview questions into a CTE tells the room you know how to isolate dimension wiring before applying business filters.

Detailed explanation.

Freeze aggregates once — the pattern WITH cohort_metrics AS (… GROUP BY key …) computes each group's summary exactly once in the narrative. The outer query then behaves like a probe: attach metrics to detail rows and filter on those scalars.
Cardinality contract — after dept_avg, you expect one row per dept_id (assuming dept_id is a key in departments). Joining that CTE back to employees on the same key should not duplicate employees unless the aggregate CTE itself was built from a many-to-many path—if it was, split "explode" joins out of the aggregate stage.
Interview narration — say aloud: "first I collapse to department grain, then I join the employee spine at employee grain, then I filter." That three-beat story maps directly to the SQL.

Department averages with a reusable join CTE

Worked example. Employees enriched with departmental averages prior to outperform filters.

Step-by-step.

Aggregate salaries by department ID.
Join employees back to those aggregates.
Filter rows beating their department averages.

Worked-example solution.

WITH dept_avg AS (
    SELECT d.dept_id, AVG(e.salary)::numeric(12,2) AS dept_avg_salary
    FROM employees e
    JOIN departments d ON d.dept_id = e.dept_id
    GROUP BY d.dept_id
)
SELECT e.name,
       e.salary,
       d.dept_name,
       a.dept_avg_salary,
       e.salary - a.dept_avg_salary AS spread
FROM employees e
JOIN departments d USING (dept_id)
JOIN dept_avg a USING (dept_id)
WHERE e.salary > a.dept_avg_salary;

Common beginner mistakes

Omitting GROUP BY stability rules while referencing raw columns beside aggregates inside the same projection.
Accidentally multiplying row counts via join cardinalities prior to aggregates—solve with guarded sub-aggregates in earlier CTEs.

Alternative sketch — correlate without a JOIN in the aggregate CTE

You can compute AVG(...) OVER (PARTITION BY dept_id) in a dedicated CTE instead of grouping first; grouped CTE + join is easier to reason about aloud when WITH readability is graded.

SQL interview question — employees beating their department average

Assume employees(emp_id, name, dept_id, salary) plus departments(dept_id, dept_name). Return every teammate whose salary exceeds that department's average, including the departmental average beside each teammate row.

Solution Using a join-friendly aggregate CTE

Code solution.

WITH dept_avg AS (
    SELECT dept_id, AVG(salary)::numeric(12,2) AS dept_avg_salary
    FROM employees
    GROUP BY dept_id
)
SELECT e.emp_id,
       e.name,
       d.dept_name,
       e.salary,
       a.dept_avg_salary
FROM employees e
JOIN dept_avg a USING (dept_id)
JOIN departments d USING (dept_id)
WHERE e.salary > a.dept_avg_salary
ORDER BY d.dept_name, e.salary DESC;

Step-by-step trace.

step	relation	outcome
1	Scan `employees` inside `dept_avg`	input multiset for averaging
2	`GROUP BY dept_id` + `AVG(salary)`	department-grain relation: one `dept_avg_salary` scalar per dept
3	`JOIN employees … USING (dept_id)`	employee-grain relation again; each worker row carries parent dept's average unchanged
4	`WHERE e.salary > a.dept_avg_salary`	anti-regression filter — removes rows at or below cohort mean

Output:

emp_id	name	dept_name	salary	dept_avg_salary
12	Ava	Retail	92,500	78,320
4	Mei	Insights	130,400	110,980

(Demonstrative salaries — graders care about algebraic shape.)

Why this works — concept by concept:

CTE dept_avg — freezes department-grain aggregates exactly once per cohort; avoids repeating AVG subqueries inline on every employee row in the outer text.
JOIN USING (dept_id) — stitches scalar averages back without exploding beyond employee cardinality when dept_avg stayed at true dept key grain.
Filter after aggregation — separates compute departmental truth vs evaluate individuals; interviewers listen for that separation as a signal you understand two different grains in one question.
Numeric cast — ::numeric(12,2) (or equivalent) prevents float drift in panel walkthroughs when money-like fields appear.
Cost — hash aggregate typically Θ(n) on the employee spine for the grouped leg, plus Θ(n) equi-join cost to reattach; dominated by scans unless indexed paths short-circuit.

SQL
Topic — CTE (SQL)
CTE‑focused SQL problems

Practice →

SQL
Topic — joins
Join interview patterns

Practice →

SQL
Topic — aggregation
Aggregation drills

Practice →

5. CTE + sql window functions — rank, then filter

Separate ranking logic from predicates—two CTE beats one clever monster

Most sql window functions arcs follow rank-within partitions → predicate on ranks; folding both layers into one derived table hides mistakes under pressure.

Detailed explanation.

Logical processing order — window functions attach to rows after FROM/WHERE/GROUP BY/HAVING in the relational pipeline for that inner SELECT; you typically cannot reference a window alias in the same SELECT's WHERE in standard SQL—you project ranks in one CTE, filter in the next (or nest a subquery). That restriction is precisely why graders like CTE + window combos: the shape matches the semantics.
Partition = rivalry scope — PARTITION BY dept_id means "within this department bucket only, reorder rows"; rows in other departments never compete for rn = 1.
ORDER BY inside OVER — resolves ties; emp_id as trailing sort key buys determinism for ROW_NUMBER when salaries collide.

`ROW_NUMBER()` vs cousins (pick aloud in-panel)

function	ties at same ORDER BY keys	skips rank values after ties?
`ROW_NUMBER()`	breaks arbitrarily per sort key completeness	never — strictly 1..N per partition
`RANK()`	equal keys share rank	yes — gaps after a tie (`1, 1, 3`)
`DENSE_RANK()`	equal keys share rank	no gaps (`1, 1, 2`)

Demand deterministic slicing ⇒ ROW_NUMBER + exhaustive ORDER BY. Reward parity for equal salaries ⇒ RANK or DENSE_RANK per business rule.

`ROW_NUMBER()` is deterministic for mechanical top‑N slicing

Different from RANK()—choose deliberately when duplicates must share honours.

Worked example. Produce top‑3 salaries per department with deterministic ties.

Step-by-step.

Decorate employees with ROW_NUMBER() OVER (PARTITION BY dept_id ORDER BY salary DESC, emp_id ASC).
Filter rn <= 3 in downstream scope—the second CTE (or outer SELECT) is your predicate stage isolated from analytic decoration.

Worked-example solution.

WITH ranked AS (
    SELECT dept_id,
           emp_id,
           name,
           salary,
           ROW_NUMBER() OVER (
               PARTITION BY dept_id
               ORDER BY salary DESC, emp_id ASC
           ) AS rn
    FROM employees
)
SELECT *
FROM ranked
WHERE rn <= 3
ORDER BY dept_id, rn;

SQL interview question — top two salaries whenever a department staffs at least two people

Assume employees(dept_id, emp_id, name, salary). Return rows only for departments with population ≥ 2, showing the highest two salaries in each (ROW_NUMBER ordering, break ties on emp_id).

Solution Using stacked CTEs for readability

Code solution.

WITH ranked AS (
    SELECT dept_id,
           emp_id,
           name,
           salary,
           ROW_NUMBER() OVER (
               PARTITION BY dept_id
               ORDER BY salary DESC, emp_id ASC
           ) AS rn,
           COUNT(*) OVER (PARTITION BY dept_id) AS dept_pop
    FROM employees
),
top_two AS (
    SELECT *
    FROM ranked
    WHERE dept_pop >= 2
      AND rn <= 2
)
SELECT dept_id, emp_id, name, salary
FROM top_two
ORDER BY dept_id, rn;

Step-by-step trace.

step	action
1	`ranked` decorates rows with deterministic `rn`; grain unchanged vs base `employees` — one output row still per `(dept_id, emp_id)`.
2	`COUNT() OVER (PARTITION BY dept_id)` is a pure scalar broadcast* inside each dept — computes headcount without self-joining aggregates back.
3	`top_two` enforces BOTH “big enough dept” plus podium depth simultaneously — logically equivalent to `HAVING` on dept size after hypothetical `GROUP BY`, but keeps employee-level projection intact.

Output:

dept_id	emp_id	name	salary
10	4	Ava	98,400
10	7	Omar	95,050
20	1	Zoe	120,010

Why this works — concept by concept:

Window PARTITION BY — constrains rivalry to departmental peers exclusively; cross-department ordering is irrelevant noise.
ROW_NUMBER semantics — hands you distinct ranks per partition for deterministic slicing machinery when tie-break columns are explicit (emp_id).
COUNT(*) window — encodes interviewer guardrails (“only multi-person teams”) without collapsing rows pre-rank (GROUP BY would destroy per-employee rn).
Stacked CTEs — separates decorate (ranked) from predicate (top_two), mirroring dialect rules about filtering window outputs.
Cost intuition — window sort/heaps approach Θ(n log n) dominance per partition in typical merge-sort planners; mention index-friendly ORDER BY keys (dept_id, salary DESC) as an optimization avenue if indexes exist.

SQL
Topic — CTE
CTE topic lane

Practice →

SQL
Topic — window functions
Window-function SQL problems

Practice →

SQL
Topic — CTEs
Broader CTE practice set

Practice →

6. Recursive CTE — hierarchies without leaving SQL

`WITH RECURSIVE` stitches an anchor SELECT to an inductive UNION ALL spine

Recursive patterns answer organisation charts, bill-of-material explosions, controlled sequence generation—for acyclic hierarchies WITH RECURSIVE lands squarely inside CTE in SQL interviewer expectations across Postgres-first DE loops.

Detailed explanation.

On each evaluation round, SQL engines conceptually compute:

Anchor (non-recursive SELECT) — initial working set (frontier, generation 0).
Recursive member — SELECT … JOIN recursive_cte_alias …; the join references the WITH RECURSIVE alias (tree, seq, …). Results are UNION ALL-appended unless you explicitly demand dedupe with UNION.
Fixed point — iteration stops when the recursive member contributes no new rows under the engine's recursive evaluation rules (graphs with cycles need explicit guarding — anchors + cycle columns or rewriting — otherwise recursion becomes pathological).

Invariant checklist:

Anchor SELECT gathers starting frontier rows (often roots where parent keys are NULL).
Recursive SELECT joins the prior frontier back to driving base rows.
UNION ALL appends successive generations (UNION when dedupe is mandated explicitly and you accept DISTINCT cost).
Cycles blow up naive trees — state how you would detect or prevent them (CYCLE, path arrays, UNION dedupe semantics, or procedural escape hatches).

WITH RECURSIVE seq(n) AS (
    SELECT 1
    UNION ALL
    SELECT n + 1
    FROM seq
    WHERE n < 5
)
SELECT * FROM seq;

This toy emits {1…5} — it isolates recursion mechanics before you attach employees edges.

SQL interview question — enumerate every descendant under VP `vp_id`

Given employees(emp_id, name, manager_id) with an acyclic tree, vp_id acts as subtree root. Emit every reachable employee with hierarchical level numbering starting at 0 for the VP herself.

Solution Using Postgres `WITH RECURSIVE`

Code solution.

WITH RECURSIVE tree AS (
    SELECT emp_id,
           name,
           manager_id,
           0 AS level
    FROM employees
    WHERE emp_id = :vp_id
    UNION ALL
    SELECT e.emp_id,
           e.name,
           e.manager_id,
           tree.level + 1
    FROM employees e
    JOIN tree ON e.manager_id = tree.emp_id
)
SELECT emp_id,
       name,
       level,
       REPEAT('  ', level) || name AS indent_name
FROM tree
ORDER BY level, emp_id;

Step-by-step trace.

frontier	expands by
depth 0	VP seed row from anchor — subtree root anchored at interviewer parameter
depth k ≥ 1	every employee whose `manager_id` references someone already resident in `tree` (inductive join)

Expansion halts when the recursive SELECT would append only rows already reachable—under cycle-free graphs, breadth grows until leaf managers fail to recruit new hires. Worst-case work proportional to |subtree| edges traversed modulo engine batching semantics.

Output:

emp_id	name	level	indent_name
100	Quinn	0	Quinn
110	Ravi	1	··Ravi

Why this works — concept by concept:

Anchor picks root identity — pins recursion context to interviewer-supplied vp_id; switching roots is a literal parameter tweak, no structural rewrite.
JOIN aligns generations — children attach only to parents already enumerated, matching org-tree edge direction (employee.manager_id → parent.emp_id).
Level accumulator — 0 at VP, increments per hop—communicates depth for indented displays, maximum-depth caps (WHERE level <= …), or post-filtering slices in outer queries.
UNION ALL composition — appends frontier rows without collapsing duplicates that UNION DISTINCT might hide prematurely on messy data (distinctness still comes from emp_id uniqueness in a clean HR dimension).
Complexity intuition — visits each node in the reached subtree once along each valid parent link in acyclic settings; cycles break this story—call that out explicitly in senior panels.

SQL
Topic — CTE/SQL hub
Deep CTE‑SQL drills

Practice →

Python
Topic — recursion
Recursion practice (adjacent muscle)

Practice →

SQL
Language — SQL
All SQL problems

Practice →

7. CTE vs subquery vs temp table — interview trade-offs

Know which tool implies which lifecycle story

Three-way grading grid interviewers memorize:

Tool	Lifetime	Typical debugging	Replay story
CTE (`WITH`)	enclosing statement only	annotate pipeline layers mentally	rerun full statement blob
Inline derived table	same statement span	cramped syntax	rerun entire mega-expression
`CREATE TEMP TABLE`	session-bound	rerun arbitrary downstream slices	iterative analyst workflow

Detailed explanation.

Inlining vs forcing materialization — on PostgreSQL 12+, WITH foo AS (...) SELECT … usually inherits optimisation like an inline derived table; prepend MATERIALIZED (or NOT MATERIALIZED) once you deliberately shape evaluation order (control duplicate work) or tame estimated rows quirks the optimizer misses. Naming the dialect matters: Snowflake / BigQuery / other warehouses each implement hints differently — never claim portability without hedging.
Subquery multiplicity — identical inline subqueries in one statement may execute more than once unless the planner deduplicates identical fragments — CTE naming makes reuse intent obvious to collaborators even when plans merge.
Temp tables redeem exploration — you can CREATE INDEX ON tmp_… for repeated joins inside a detective session; that is awkward for ephemeral CTEs. Temps also interoperate across ORM/console round-trips in one session when you slice-and-dice predicates interactively.

Interview sound bite: "CTEs optimise communication during one deterministic statement; **TEMP tables optimise iterative exploration where you rerun predicates interactively.**"

Worked-example solution.

-- Exploration inside psql → TEMP wins iteration ergonomics:
CREATE TEMP TABLE tmp_high_value_orders AS
SELECT *
FROM staging.orders
WHERE revenue_usd > 500;

SELECT customer_id,
       AVG(revenue_usd) avg_rev
FROM tmp_high_value_orders
GROUP BY 1;

-- vs production SQL model favouring readability:
WITH high_value_orders AS (
    SELECT *
    FROM staging.orders
    WHERE revenue_usd > 500
)
SELECT customer_id,
       AVG(revenue_usd) avg_rev
FROM high_value_orders
GROUP BY 1;

Choosing in one breath

Use CTEs for published pipelines and pairing windows with filters; reserve TEMP for sandbox hypotheses, brute-force cardinality introspection, and multi-query rehearsals; reserve opaque inline subqueries for one-shot predicates where naming hurts more than helping.

Rule of thumb: articulate lifetime + collaborator audience plainly—students often chant "plans identical" without naming engine nuances or session economics.

SQL
Topic — subqueries
Subquery practice lane

Practice →

Choosing CTE usage (checklist)

Situation	Reach for …	Avoid …
Multi-phase authored SQL destined for repos / reviews	chained CTEs	one mega SELECT nobody dares refactor
Window ranking + nuanced filters	sequential CTEs	deeply nested analytic soup
Acyclic hierarchies entirely inside warehouse dialect	`WITH RECURSIVE` pathways	bouncing back to procedural-only apps prematurely
Ad-hoc investigation with iterative reruns	`TEMP TABLE` + helpful indexes	retyping massive CTE trees each tweak

Frequently asked questions

What are typical cte in sql interview questions?

Expect definition ("what token opens a CTE?" → WITH), lifecycle (statement scope vs session objects), readability refactors (nested subquery → named pipeline), window + CTE combos rank-then-filter, and trees via WITH RECURSIVE. Panels often wedge one optimisation empathy question ("would you force materialisation? why?"). Answer each with two crisp sentences: mechanism first, dialect caveat second—not acronym dumps.

How is a CTE different from a temporary table?

A CTE is logical sugar glued to one enclosing statement: no catalog object, disappears when the batch finishes unless you wrap it differently. CREATE TEMP TABLE … allocates session-scoped physical storage, survives until disconnect or DROP, and supports indexes / repeated downstream queries comfortably. Reach for whichever matches lifetime and whether collaborators need iterative rerun ergonomics—not whichever "feels fancier".

When should `WITH RECURSIVE` surrender to procedural graph code?

Stay in SQL while the graph stays moderate, acyclic (or you have explicit CYCLE / path handling patterns for your dialect), and results feed relational BI directly. Pivot to procedural / graph libs when cycles are common without clean detection, branching explodes runtimes, validations span systems outside the warehouse contract, or you need fine-grained backtracking/search primitives SQL does not express cleanly.

Do CTEs automatically materialise intermediate results?

No blanket guarantee. Many optimisers inline / fold ordinary CTEs like named subqueries. PostgreSQL exposes MATERIALIZED / NOT MATERIALIZED to steer evaluation; cite those only alongside the engine name. Recursive CTEs follow different planner stories—mention termination semantics if the interviewer probes performance cliffs.

How should sql interview questions with answers narratives flow?

Lead with legible layering (stage name → grain), paste minimal runnable SQL, then narrate Step-by-step trace from base relations outward, freeze an Output table—even toy numbers—and finish with Why this works tying grain, join cardinality, aggregation vs window semantics, and coarse cost intuition. Mirrors how this article stitches panel answers together.

What one-line summary should stick in recall?

"WITH* names intermediate relations so collaborators reason about algebraic stages the same way dbt exposes staging → mart layers—but never confuse naming with persisted storage unless you pinned it there yourself."*

Practice on PipeCode

PipeCode ships 450+ data-engineering interview problems—including PostgreSQL-first SQL practice keyed to WITH chains, joins, aggregation, CTE + sql window functions, and branching recursive reasoning.

Kick off via Explore practice →; drill the dedicated CTE(SQL) lane →; fan out across CTE topic → or CTE(s) bucket →; deepen window-functions SQL practice →; rehearse join SQL drills →; reinforce aggregation practice → whenever grouped metrics underpin your predicates.

Data Warehouse Design for Data Engineering Interviews: A Beginner's Guide to Fact Tables, Star Schemas, and Grain

Gowtham Potureddi — Tue, 12 May 2026 04:40:27 +0000

Data warehouse design is the discipline of laying out tables so analytical questions are fast, correct, and easy to ask. A well-designed enterprise data warehouse turns "what was revenue by region last quarter?" into a sub-second query; a badly-designed one turns the same question into a 30-minute, three-join, three-cell-disagrees-with-finance pile. For data-engineering interviews, the same three or four concepts — fact tables, dimension tables, grain, star schema, SCD — show up in every loop and every system-design round.

This guide is a beginner-friendly walk through data warehouse design from first principles. We start with OLTP vs OLAP and why the two need fundamentally different schemas, then build out the Kimball data warehouse mental model — fact tables, dimensions, the star schema vs snowflake schema trade-off, grain, surrogate keys, slowly changing dimensions, partitioning, and the six-step design process — with worked examples and an interview-style problem in each section. We also place the warehouse next to its neighbours — data warehouse vs data lake, data warehouse vs data mart, data lakehouse vs data warehouse — so you can defend the design choice in a round, not just memorise the diagram.

If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse ETL practice →, or open ETL System Design for Data Engineering Interviews → for a structured path.

On this page

Why data warehouse design matters
Fact tables — measurable business events
Dimension tables — descriptive context
Star schema vs snowflake schema
Grain, keys, and surrogate keys
Slowly Changing Dimensions (SCD)
Partitioning, ETL/ELT, and the design process
Choosing a schema (checklist)
Frequently asked questions
Practice on PipeCode

1. Why data warehouse design matters

OLTP vs OLAP, and why the warehouse needs its own shape

The single most important sentence in data warehouse design: the OLTP database that runs your application is shaped wrong for analytics. Operational databases (PostgreSQL, MySQL) are normalised, row-stored, and tuned for single-row writes; warehouses (Snowflake, Amazon Redshift, Google BigQuery) are denormalised, columnar, and tuned for full-table scans. A data engineer's first job is recognising which shape a workload needs and building the data warehouse architecture accordingly.

Pro tip: In a system-design round, your first sentence about any analytical request is "this is an OLAP workload, so I'd model it as a fact table at this grain with these dimensions, and run it on a columnar warehouse like Snowflake or BigQuery." That sentence packs grain, schema choice, and warehouse selection into one beat — interviewers love it.

OLTP design — normalised, transactional, single-row optimised

The OLTP invariant: operational databases are heavily normalised (3NF) to prevent update anomalies; rows are stored together so single-row reads and writes are fast; the workload is many small transactions per second. PostgreSQL and MySQL are the canonical examples. They are the right tool for the write side of the world — the user clicking "Buy" — and the wrong tool for the analytical question that follows.

Normalised — each fact lives in exactly one place; customers, orders, addresses are separate tables.
Row-stored — fetching one row of 30 columns is one disk seek.
High write throughput — millisecond INSERT / UPDATE / DELETE.
Indexes for point lookups — find customer 42 in O(log N).
ACID transactions — money cannot disappear between debit and credit.

Worked example. An OLTP order schema:

table	rows per record	typical operation
`customers`	1 per customer	`UPDATE … SET address = …`
`orders`	1 per order	`INSERT … VALUES (…)`
`order_items`	1 per order line	`INSERT … VALUES (…)`
`payments`	1 per payment	`UPDATE … SET status = 'paid'`

Step-by-step.

A user clicks "Place order"; the app opens a transaction.
INSERT INTO orders writes the order header; INSERT INTO order_items writes the line items.
UPDATE inventory SET qty = qty - 1 decrements stock.
INSERT INTO payments records the charge attempt.
COMMIT makes everything visible atomically; the whole transaction takes ~10–30 ms.

Worked-example solution. OLTP table for orders (Postgres):

CREATE TABLE orders (
    order_id    BIGSERIAL    PRIMARY KEY,
    customer_id BIGINT       NOT NULL REFERENCES customers(customer_id),
    placed_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    total       NUMERIC(14,2) NOT NULL
);
CREATE INDEX ON orders (customer_id);
CREATE INDEX ON orders (placed_at);

Rule of thumb: if the workload is "millisecond writes for a live application," it is OLTP — normalise it, index it, and stop. Analytics goes somewhere else.

OLAP design — denormalised, columnar, scan-optimised

The OLAP invariant: analytical workloads scan many rows and few columns; the right shape is columnar storage with denormalised fact tables and pre-joined dimensions, so a single SELECT can answer a business question without locking the OLTP database. Snowflake, BigQuery, and Redshift store each column as its own compressed file — a 100 M-row aggregation reads ~5% of the bytes that an OLTP row scan would read.

Denormalised — fact tables carry foreign keys to dimensions; dimensions carry pre-joined descriptive context.
Columnar storage — each column is its own file; analytical scans skip irrelevant columns.
Few transactions — batch ELT loads commit thousands of rows at once.
No row-level locks — long-running analytical queries don't block writers.
Aggregation-friendly — GROUP BY over millions of rows runs in seconds.

Worked example. An OLAP fact + dimension schema:

table	grain	typical query
`fact_orders`	one row per order line	`SUM(revenue) GROUP BY month`
`dim_customer`	one row per customer (history)	join for `city`, `segment`
`dim_product`	one row per product	join for `category`, `brand`
`dim_date`	one row per calendar day	join for `month`, `quarter`

Step-by-step.

The analytical question is "revenue by category by month for the last quarter."
The query selects category (from dim_product), month (from dim_date), and SUM(revenue) (from fact_orders).
The warehouse reads only the three columns it needs; everything else is skipped.
Partition pruning on date_id skips ~95% of fact rows.
The full aggregation returns in 2–5 seconds over a 100 M-row fact.

Worked-example solution. OLAP star-shaped query:

SELECT p.category,
       d.month,
       SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
WHERE d.year = 2026
GROUP BY p.category, d.month
ORDER BY d.month, p.category;

Rule of thumb: if the workload is "scan many rows, return an aggregate, run on a schedule for humans to read," it is OLAP — denormalise it, build it as a fact + dim schema, and put it in a columnar warehouse.

Where the warehouse fits — vs database, data lake, data mart, lakehouse

The placement invariant: a database holds the live application state (OLTP); a data warehouse holds modelled analytical history (OLAP, star schemas); a data lake holds raw files (sometimes pre-warehouse); a data mart is a subject-area subset of a warehouse; a data lakehouse merges lake storage with warehouse-style ACID tables on top. Picking the right placement is half the design.

Database (OLTP) — Postgres / MySQL; live application.
Data warehouse (OLAP) — Snowflake / Redshift / BigQuery; star/snowflake schemas for analytics.
Data lake — S3 / GCS / ADLS holding raw Parquet / JSON / CSV; cheaper but unstructured.
Data mart — subject-area subset (e.g., mart_marketing); business-team-owned.
Data lakehouse — Iceberg / Delta / Hudi on top of object storage; ACID + warehouse semantics on lake files.

Worked example. A modern company's three-tier stack:

tier	system	purpose
OLTP	Postgres	live orders, users, payments
Lake (raw)	S3 + Parquet	event firehose, schema-flexible
Warehouse (modelled)	Snowflake	star schemas for finance/BI
Mart	`MART_FINANCE` schema in Snowflake	finance-team-only view

Step-by-step.

The app writes to Postgres; transactional reads stay there.
A CDC pipeline streams Postgres changes into the S3 data lake as raw Parquet.
Daily ELT (dbt or Spark) models the raw lake data into star-shaped fact/dim tables in Snowflake.
Finance reads from MART_FINANCE (a curated subset); marketing reads from MART_MARKETING.
The warehouse is the modelled truth; the lake is the raw archive; the mart is the consumer-facing slice.

Worked-example solution. A subject-area data mart on top of a warehouse:

CREATE SCHEMA mart_finance;

CREATE TABLE mart_finance.daily_revenue AS
SELECT d.date,
       p.category,
       SUM(f.revenue) AS revenue
FROM fact_orders f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
GROUP BY d.date, p.category;

Rule of thumb: when asked "data warehouse vs data lake vs data mart" in an interview, sketch the three boxes in a line — lake (raw) → warehouse (modelled) → mart (consumer slice) — and name a tool for each.

Common beginner mistakes

Running analytical queries against the OLTP database — slows the live application and gives stale, locked-row answers.
Treating the data lake as a warehouse — raw files can be queried but have no grain, schema, or referential integrity until you model them.
Skipping the dimensional model — putting everything in one wide table (OBT) works until two analysts disagree on customer_segment because it was hard-coded twice.
Building a single "warehouse" without subject-area marts — every team has to learn every table.
Conflating Kimball (bottom-up, star-schema marts) and Inmon (top-down, normalised EDW) — both work; pick one and be consistent.

Data Warehouse Interview Question on When to Build a Warehouse vs Query Postgres

A growing startup has 50 M orders in Postgres. The CFO wants a monthly revenue report joining orders, customers, products, and regions. The current report runs on Postgres and takes 4 hours. Decide whether to (a) optimise Postgres, (b) build a data warehouse, or (c) build a data lake, and defend the choice.

Solution Using a Kimball Star Schema on a Cloud Warehouse with Daily ELT

Code solution.

-- Snowflake (or Redshift / BigQuery) — modelled warehouse
CREATE TABLE fact_orders (
    order_id    NUMBER(38,0),
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    date_id     NUMBER(38,0) NOT NULL,
    region_id   NUMBER(38,0) NOT NULL,
    revenue     NUMBER(14,2) NOT NULL,
    quantity    NUMBER       NOT NULL
)
CLUSTER BY (date_id);

CREATE TABLE dim_customer (customer_id NUMBER(38,0) PRIMARY KEY, name TEXT, segment TEXT);
CREATE TABLE dim_product  (product_id  NUMBER(38,0) PRIMARY KEY, name TEXT, category TEXT, brand TEXT);
CREATE TABLE dim_date     (date_id     NUMBER(38,0) PRIMARY KEY, date DATE, month INT, year INT);
CREATE TABLE dim_region   (region_id   NUMBER(38,0) PRIMARY KEY, region TEXT, country TEXT);

-- daily ELT runs in the warehouse
INSERT INTO fact_orders
SELECT order_id, customer_id, product_id,
       TO_NUMBER(TO_CHAR(placed_at,'YYYYMMDD')),
       region_id, total, qty
FROM stage.orders WHERE load_date = CURRENT_DATE;

Step-by-step trace.

step	choice	result
1	option (a) optimise Postgres	indexes help but the workload conflicts with the live app
2	option (c) data lake only	raw files; no grain; analysts re-implement joins every report
3	option (b) build a warehouse + star schema	one modelled source of truth; sub-second BI
4	daily ELT lands new orders	freshness = T-1 day, which is fine for monthly CFO report
5	the 4-hour report becomes a 3-second BI query	finance happy; OLTP unaffected

Output: the monthly report drops from 4 hours to 3 seconds; the OLTP Postgres is no longer fighting the analyst; the warehouse becomes the source of truth for every downstream BI / ML / finance use case.

Why this works — concept by concept:

Separation of OLTP and OLAP — live app stays fast; analytics moves to a columnar engine.
Star schema — fact_orders at the centre, dim_customer / dim_product / dim_date / dim_region around it; queries are simple joins.
Daily ELT — extract from Postgres, load to warehouse, transform with SQL inside the warehouse.
CLUSTER BY (date_id) — co-locates partitions by date so monthly filters prune ~95% of the fact.
Surrogate keys (customer_id numeric) — stable identifiers that survive business-key changes.
Cost — O(rows in last month) on a clustered scan; an OLTP scan would be O(rows in fact_orders) with row-level locks.

Inline CTA: drill the ETL practice page and the SQL aggregation topic for grain-correct rollups.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

2. Fact tables — measurable business events

The numeric heart of the warehouse — what happened, when, and how much

A fact table stores measurable business events. Every row is an event — an order placed, a click recorded, a payment processed — and every column is either a measure (numeric quantity: revenue, units, duration) or a foreign key to a dimension that gives the event business context (which customer, which product, which day). Fact tables are usually the largest tables in a warehouse — millions to billions of rows — and they are the focus of every analytical query.

Pro tip: When you walk through a fact-table design in an interview, say the grain in the first sentence and name the measures and foreign keys in the next two. "One row per order line. Measures: revenue, quantity, discount. FKs: customer, product, date, region." That structure signals you actually know what you're doing.

Transaction fact tables — one row per business event

The transaction-fact invariant: a transaction fact table stores one row per atomic business event at its natural grain; the row records the measures of that event and foreign keys to every dimension that gave it context; this is the most common and most interview-asked fact type. Order lines, payments, clicks, ad impressions — all transaction facts.

One row per event — never aggregate; the warehouse can always roll up later, never roll down.
Numeric measures — revenue, quantity, discount, tax.
Foreign keys — customer_id, product_id, date_id, region_id.
Degenerate dimensions — operational IDs (order_number, transaction_id) stored on the fact row.
Append-mostly — new events arrive; old events rarely change.

Worked example. A sales transaction fact with 5 sample rows:

sale_id	customer_id	product_id	date_id	revenue	quantity
1	1001	50	20260510	200.00	2
2	1002	51	20260510	100.00	1
3	1001	52	20260510	350.00	1
4	1003	50	20260511	100.00	1
5	1002	51	20260511	100.00	1

Step-by-step.

Each row is one order line; grain is "one row per (order, product line)."
Measures revenue and quantity are numeric, additive, and aggregate cleanly with SUM.
FKs customer_id, product_id, date_id link to dimensions that describe who, what, when.
A GROUP BY date_id, customer_id rolls up to per-day-per-customer revenue.
The same fact answers "revenue by customer," "revenue by product," "revenue by day" — different GROUP BY clauses.

Worked-example solution. A transaction fact DDL:

CREATE TABLE fact_sales (
    sale_id     NUMBER(38,0) PRIMARY KEY,
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    date_id     NUMBER(38,0) NOT NULL,
    revenue     NUMBER(14,2) NOT NULL,
    quantity    NUMBER       NOT NULL
)
CLUSTER BY (date_id);

Rule of thumb: if you are tempted to write the fact at a coarser grain than the event, ask first — almost always the right grain is the finest one and any rollup is a SQL query.

Periodic snapshot fact tables — state at fixed intervals

The snapshot-fact invariant: a periodic snapshot fact stores the state of a process at fixed time intervals (end of day, end of month); each row records the level of measures (inventory on hand, account balance) at that snapshot moment; useful when the process is continuous and you want a series of point-in-time photos. Inventory levels, account balances, headcount.

One row per (snapshot date, entity) — e.g., one row per (day, product) for inventory.
Semi-additive measures — balances don't add across time (you can't sum yesterday's + today's inventory to get a meaningful number), but they aggregate across other dimensions.
Fixed cadence — daily, weekly, monthly snapshot.
History as time series — easy to query "balance over time."

Worked example. Daily inventory snapshot:

date_id	product_id	on_hand_units
20260510	50	120
20260510	51	85
20260511	50	118
20260511	51	80
20260512	50	115

Step-by-step.

Every night at midnight, an ETL job snapshots the current inventory for every product.
Each row is one (date, product) combination with the on-hand count at snapshot time.
SUM across products is meaningful ("total units across catalogue today"); SUM across days is not (yesterday's units + today's units is meaningless).
The fact answers "inventory trend for product 50 over time" via a single-column scan.
Snapshot growth is bounded — one row per (day, product) — so 5 years × 10 k products = 18 M rows, manageable.

Worked-example solution. Inventory snapshot DDL:

CREATE TABLE fact_inventory_snapshot (
    date_id        NUMBER(38,0) NOT NULL,
    product_id     NUMBER(38,0) NOT NULL,
    on_hand_units  NUMBER       NOT NULL,
    PRIMARY KEY (date_id, product_id)
)
CLUSTER BY (date_id);

Rule of thumb: snapshot facts answer "what was the state on day X"; transaction facts answer "what happened on day X." Pick the right shape for the question.

Accumulating snapshot fact tables — process lifecycle in one row

The accumulating-snapshot invariant: an accumulating snapshot fact stores one row per process instance (one order, one application, one shipment) and updates that row as the process moves through its lifecycle; ideal when the process has a finite, well-defined sequence of milestones. Order fulfilment (ordered → packed → shipped → delivered), loan application (submitted → reviewed → approved → funded).

One row per process instance — one order's entire lifecycle in a single row.
Multiple date columns — ordered_date_id, packed_date_id, shipped_date_id, delivered_date_id.
Multiple status columns — boolean flags for each milestone.
Row updates over time — same row, different fields filled in as the process advances.
Trend analysis on durations — delivered_date - ordered_date = fulfilment lead time.

Worked example. An order-lifecycle fact:

order_id	ordered_date	packed_date	shipped_date	delivered_date
1001	2026-05-10	2026-05-10	2026-05-11	2026-05-13
1002	2026-05-10	2026-05-11	NULL	NULL
1003	2026-05-11	NULL	NULL	NULL

Step-by-step.

When an order is placed, a new fact row is inserted with ordered_date set and the rest NULL.
When the warehouse packs the order, the same row is updated with packed_date.
When the courier picks it up, shipped_date is filled.
When the customer signs for delivery, delivered_date is filled.
Analysts can now ask "average days from order to delivery" with one simple subtraction.

Worked-example solution. Accumulating snapshot DDL:

CREATE TABLE fact_order_fulfilment (
    order_id        NUMBER(38,0) PRIMARY KEY,
    customer_id     NUMBER(38,0) NOT NULL,
    product_id      NUMBER(38,0) NOT NULL,
    ordered_date    DATE         NOT NULL,
    packed_date     DATE,
    shipped_date    DATE,
    delivered_date  DATE
);

Rule of thumb: accumulating snapshots fit a finite, well-known lifecycle; for open-ended workflows (support tickets, leads), prefer transaction facts at the state-change grain.

Common beginner mistakes

Mixing grains in one fact table — a row that's sometimes per-order, sometimes per-line, sometimes per-day silently breaks every aggregate.
Storing aggregated measures and re-aggregating ("sum of average") — answers diverge from the row-level truth.
Adding customer_name as a fact column — that belongs in the dimension; if it changes, every fact row drifts.
Forgetting date_id — the most-asked filter in every analytical query.
Treating snapshot facts as additive — summing balances across time is almost always wrong.

Data Warehouse Interview Question on Picking the Right Fact-Table Shape

A team is building a warehouse for an online learning platform. They need to answer (a) "how many lessons were completed per day per course?" (b) "what is the current number of active subscribers per course?" and (c) "what is the average days-to-completion per learner per course?" Propose three fact tables — one per question — and pick the right type for each.

Solution Using Transaction + Periodic Snapshot + Accumulating Snapshot

Code solution.

-- (a) transaction fact — one row per lesson completion event
CREATE TABLE fact_lesson_completion (
    completion_id NUMBER(38,0) PRIMARY KEY,
    learner_id    NUMBER(38,0) NOT NULL,
    course_id     NUMBER(38,0) NOT NULL,
    lesson_id     NUMBER(38,0) NOT NULL,
    date_id       NUMBER(38,0) NOT NULL,
    duration_sec  NUMBER       NOT NULL
);

-- (b) periodic snapshot — one row per (day, course) with active subscribers
CREATE TABLE fact_course_subscribers (
    date_id      NUMBER(38,0) NOT NULL,
    course_id    NUMBER(38,0) NOT NULL,
    active_subs  NUMBER       NOT NULL,
    PRIMARY KEY (date_id, course_id)
);

-- (c) accumulating snapshot — one row per (learner, course) lifecycle
CREATE TABLE fact_course_completion (
    learner_id     NUMBER(38,0) NOT NULL,
    course_id      NUMBER(38,0) NOT NULL,
    started_date   DATE         NOT NULL,
    midway_date    DATE,
    finished_date  DATE,
    PRIMARY KEY (learner_id, course_id)
);

Step-by-step trace.

question	fact type	why
(a) lessons completed per day per course	transaction	one row per event; `GROUP BY date, course` rolls up
(b) active subscribers per course right now	periodic snapshot	one row per (day, course); semi-additive count
(c) average days-to-completion	accumulating snapshot	one row per learner-course lifecycle
each fact at its natural grain	—	rollups are SQL, never re-modelling
dimensions shared	`dim_learner`, `dim_course`, `dim_date`	conformed across all three facts

Output: the three analytical questions become three small SQL queries against three correctly-shaped facts, each with its own grain. The conformed dimensions mean a join from any fact to any dimension produces the same answer about "what is Course 42?".

Why this works — concept by concept:

One fact type per business question — picking the wrong shape costs you a re-model; picking the right one costs nothing.
Transaction fact at the event grain — never aggregate at write time; rollups are SQL.
Periodic snapshot for state — balance / count / level metrics need a fixed-cadence row.
Accumulating snapshot for finite lifecycles — durations and milestone counts in one row.
Conformed dimensions — same dim_learner joins to all three facts.
Cost — O(events) for the transaction fact, O(days × courses) for the snapshot, O(learner-course pairs) for the accumulating; all bounded and queryable.

Inline CTA: sharpen fact-shape choice on the aggregation practice topic and the ETL topic.

SQL
Topic — aggregations
SQL aggregation problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

3. Dimension tables — descriptive context

The "who, what, where, when" that gives facts business meaning

A dimension table stores the descriptive attributes that put facts into business context. If fact_sales says "100 units of product 50 sold on 2026-05-10," the dimension tables tell you that product 50 is a "Wireless Mouse" in category "Accessories", that the sale was on a Monday in May, and that the customer is a "Premium" segment in "Bangalore". Dimensions are smaller than facts but heavily joined — every analytical query touches one or more.

Pro tip: Every dimension answers a "by" question — revenue by category, clicks by region, sign-ups by referral source. When you sketch a star schema, label each dimension with the "by" it enables. That single habit catches missing dimensions before you write a line of SQL.

Conformed dimensions — same definition shared across facts

The conformed-dimension invariant: a conformed dimension is one dimension table joined to multiple fact tables with identical column definitions; "Customer 42" means the same thing whether queried from fact_orders or fact_support_tickets. Conformed dimensions are what turn a collection of subject-area marts into an enterprise data warehouse.

One dim_customer — same customer_id and same attributes across the warehouse.
One dim_date — every fact joins to it; one source of truth for "month," "quarter," "fiscal year."
Cross-mart consistency — finance and marketing see the same customer name.
No re-modelling per mart — analysts never re-derive "what is Customer 42?".
Cross-fact analytics — same customer's orders and tickets can be joined safely.

Worked example. A conformed dim_customer shared by three facts:

fact	join key	what the dim adds
`fact_orders`	`customer_id`	name, segment, city
`fact_support_tickets`	`customer_id`	same name, segment, city
`fact_app_sessions`	`customer_id`	same name, segment, city

Step-by-step.

Marketing wants "revenue by city" and joins fact_orders to dim_customer.
Support wants "ticket count by segment" and joins fact_support_tickets to dim_customer.
Product wants "active sessions by city" and joins fact_app_sessions to dim_customer.
All three teams use the same dimension; the answers about "Customer 42 lives in Bangalore" are identical.
If Customer 42 moves to Hyderabad, one SCD2 update in dim_customer keeps all three facts honest.

Worked-example solution. Conformed dimension DDL:

CREATE TABLE dim_customer (
    customer_id   NUMBER(38,0) PRIMARY KEY,
    customer_name TEXT         NOT NULL,
    segment       TEXT,
    city          TEXT,
    country       TEXT,
    sign_up_date  DATE
);

Rule of thumb: if two analysts give different answers for the same "customer," check that they're joining the same dimension. Conformed dimensions are how you stop that argument.

Slowly Changing Dimensions (preview) — handling attribute change

The SCD preview invariant: dimension attributes change over time (a customer's city, a product's category); SCD types are the canonical patterns for handling that change; SCD2 is the interview favourite. Full treatment is in Section 6 — for now, know that dimensions are not purely static.

SCD Type 1 — overwrite; lose history.
SCD Type 2 — add new row with valid_from / valid_to; keep history.
SCD Type 3 — add a previous_city column; keep one prior value.
Most common in production — Type 2 for important attributes, Type 1 for unimportant ones.
Surrogate key — required for SCD2 since the business key isn't unique anymore.

Worked example. Customer 42 moves cities:

customer_sk	customer_id	city	valid_from	valid_to	is_current
1	42	Hyderabad	2025-01-01	2026-03-14	FALSE
2	42	Bangalore	2026-03-15	NULL	TRUE

Step-by-step.

Customer 42 originally lives in Hyderabad; one row with is_current = TRUE.
On 2026-03-15, the customer moves; the old row is closed (valid_to set, is_current = FALSE).
A new row is inserted for Bangalore with valid_from = 2026-03-15 and is_current = TRUE.
Historical fact joins use WHERE sale_date BETWEEN valid_from AND COALESCE(valid_to, '9999-12-31').
Current-state queries use WHERE is_current = TRUE.

Worked-example solution. SCD2 dimension with surrogate key:

CREATE TABLE dim_customer (
    customer_sk   NUMBER(38,0) PRIMARY KEY,
    customer_id   NUMBER(38,0) NOT NULL,
    customer_name TEXT,
    city          TEXT,
    valid_from    DATE,
    valid_to      DATE,
    is_current    BOOLEAN
);

Rule of thumb: if an attribute is queried historically, SCD2 it; if it's only ever shown as "current," SCD1 is fine.

Date dimensions — the most-joined dim in every warehouse

The date-dim invariant: dim_date has one row per calendar date with pre-computed columns for day, week, month, quarter, year, fiscal year, is_weekend, is_holiday; every fact has a date_id FK; analysts never compute date math at query time. It is the single most reused dimension in the warehouse.

One row per calendar day — 5 years × 365 = 1,825 rows; trivially small.
Pre-computed columns — day_of_week, week_of_year, month_name, quarter, fiscal_year, is_weekend, is_business_day, is_holiday.
date_id as integer YYYYMMDD — sortable, partition-friendly, indexable.
Reusable across every fact — orders, clicks, payments, sessions all join here.
Always populate the full range upfront — no gaps in the calendar.

Worked example. A small slice of dim_date:

date_id	date	day_name	month	quarter	year	is_weekend
20260510	2026-05-10	Sunday	5	2	2026	TRUE
20260511	2026-05-11	Monday	5	2	2026	FALSE
20260512	2026-05-12	Tuesday	5	2	2026	FALSE

Step-by-step.

A monthly revenue report joins fact_orders to dim_date on date_id.
GROUP BY dim_date.month, dim_date.year returns one row per (year, month).
A "weekend-only" filter is WHERE dim_date.is_weekend = TRUE — no EXTRACT(DOW …) needed.
A fiscal-year report uses GROUP BY dim_date.fiscal_year — analysts never have to remember fiscal-month logic.
The whole dim is small enough to broadcast — every join is essentially free.

Worked-example solution. Date-dimension generation:

-- generate 5 years of dates (Snowflake / BigQuery / Postgres variants exist)
INSERT INTO dim_date (date_id, date, day_name, month, quarter, year, is_weekend)
SELECT
    TO_NUMBER(TO_CHAR(d, 'YYYYMMDD'))         AS date_id,
    d                                         AS date,
    TO_CHAR(d, 'Day')                         AS day_name,
    EXTRACT(MONTH FROM d)                     AS month,
    CEIL(EXTRACT(MONTH FROM d) / 3.0)         AS quarter,
    EXTRACT(YEAR FROM d)                      AS year,
    EXTRACT(DOW FROM d) IN (0, 6)             AS is_weekend
FROM (SELECT generate_series('2024-01-01'::date, '2030-12-31'::date, INTERVAL '1 day')::date AS d) g;

Rule of thumb: every warehouse you build should have a dim_date on day one — even before the first fact table. Generating it later is busywork.

Common beginner mistakes

Storing descriptive columns directly on the fact table — fact_orders.customer_name works until the name changes and yesterday's revenue drifts.
Skipping conformed dimensions — every team builds their own customer table; analyst answers diverge.
Building one giant "junk" dimension — combining unrelated flags into one row instead of two clear dimensions.
Forgetting dim_date — analysts write EXTRACT(MONTH FROM date_col) everywhere; partition pruning suffers.
Treating dimensions as immutable — they change; pick an SCD type before the first row lands.

Data Warehouse Interview Question on Conformed Dimensions Across Two Marts

The marketing mart and the finance mart each have their own customer table. Marketing's customer.segment says "Premium" for customer 42; finance's says "Tier 1". The CEO asks "how many premium customers paid in April?" and gets two different answers. Propose a fix.

Solution Using a Single Conformed `dim_customer` with Both Attributes

Code solution.

-- One enterprise-wide dim, joined by both marts
CREATE TABLE dim_customer (
    customer_id    NUMBER(38,0) PRIMARY KEY,
    customer_name  TEXT,
    marketing_seg  TEXT,                  -- "Premium" / "Standard"
    finance_tier   TEXT,                  -- "Tier 1" / "Tier 2"
    city           TEXT,
    sign_up_date   DATE
);

-- Marketing mart joins for segment
SELECT marketing_seg, COUNT(DISTINCT f.customer_id)
FROM fact_orders f
JOIN dim_customer c ON c.customer_id = f.customer_id
GROUP BY 1;

-- Finance mart joins for tier on the same dim
SELECT finance_tier, SUM(f.revenue)
FROM fact_payments f
JOIN dim_customer c ON c.customer_id = f.customer_id
GROUP BY 1;

Step-by-step trace.

step	observation
1	mkt and fin each have their own `customer` table
2	mkt.customer.segment ≠ fin.customer.tier
3	CEO asks one question
4	conform to one `dim_customer` with both columns
5	both marts join the same dim; labels match across the board

Output: the CEO's question returns one answer regardless of which mart the analyst queries. Future cross-mart questions ("are our Tier-1 finance customers also Premium marketing?") become a single SQL join.

Why this works — concept by concept:

One conformed dim — every team joins the same dim_customer; no parallel truths.
Both attributes side-by-side — marketing keeps its segment, finance keeps its tier, both visible on the same row.
Cross-mart analytics — "Tier 1 + Premium" customers are now one WHERE clause away.
Single update path — when customer 42's segment changes, you update one place.
Faster reviews — the CEO never sees diverging numbers for the "same" filter.
Cost — one dim, one join per query; the duplicated table cost disappears.

Inline CTA: drill cross-table modelling on the SQL practice page and the aggregation topic.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

4. Star schema vs snowflake schema

The canonical model choice — flat dimensions or normalised hierarchy

The star schema vs snowflake schema decision is the single most-tested data-modelling question in interviews. A star schema keeps every dimension flat — one table per business entity, with all hierarchical attributes denormalised onto the row. A snowflake schema (the modelling pattern, not the cloud warehouse) normalises dimensions into sub-dimensions, saving space at the cost of more joins. Most modern warehouses prefer star — the query simplicity and performance almost always outweigh the storage savings.

Pro tip: When asked "star schema vs snowflake schema," answer in one sentence: "Star for query speed and simplicity, snowflake for storage savings on huge dimensions — and 90% of the time, star wins." Then offer a one-clause justification per side and stop talking.

Star schema — flat dimensions, simple joins, fast queries

The star invariant: a star schema has one fact table at the centre joined to N denormalised dimension tables; each dimension carries every attribute it needs as a column on a single row; queries are one-hop joins from fact to dim; the shape looks like a star with the fact at the centre and dimensions as the points. It is the default Kimball recommendation and the default modern-warehouse shape.

One fact, N dimensions — typical warehouse has 1 fact and 4–10 dimensions.
Flat dimensions — dim_product carries category, subcategory, brand, supplier all on one row.
One-hop joins — fact → dim, never dim → sub-dim.
Query simplicity — joins are obvious; analysts write SQL without help.
Performance — columnar warehouses optimise star joins natively.

Worked example. A retail star schema:

              dim_customer
                    |
dim_product — fact_sales — dim_date
                    |
                dim_store

table	columns
`fact_sales`	`sale_id`, `customer_id`, `product_id`, `date_id`, `store_id`, `revenue`, `quantity`
`dim_customer`	`customer_id`, `name`, `city`, `segment`, `country`
`dim_product`	`product_id`, `name`, `category`, `subcategory`, `brand`
`dim_date`	`date_id`, `date`, `month`, `quarter`, `year`
`dim_store`	`store_id`, `name`, `region`, `format`

Step-by-step.

The central fact_sales carries the four FKs and two measures.
Each dimension is flat — dim_product has category and brand directly on the row, not in a separate dim_category table.
"Revenue by category by year" is one SELECT with two joins.
The shape is symmetric — every dimension is reachable in one join from the fact.
Columnar engines see one fact + N dim joins and execute them in parallel.

Worked-example solution. A canonical star-schema query:

SELECT p.category,
       d.year,
       SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_product p ON p.product_id = f.product_id
JOIN dim_date    d ON d.date_id    = f.date_id
GROUP BY p.category, d.year
ORDER BY d.year, p.category;

Rule of thumb: if you can't justify why a particular dimension must be normalised, leave it flat. Star is the default for a reason.

Snowflake schema — normalised dimensions, more joins, more storage discipline

The snowflake invariant: a snowflake schema (modelling pattern) normalises dimensions into sub-dimensions; dim_product.category_id references dim_category; queries need one more join per normalised level; useful when a hierarchical attribute has very high cardinality and changes independently. Reserve it for the rare cases when storage or update frequency genuinely matters.

Normalised dimensions — dim_product references dim_category which references dim_department.
More joins — fact_sales → dim_product → dim_category → dim_department.
Less redundancy — a category change updates one row in dim_category, not every row in dim_product.
More complex SQL — analysts have to remember the join path.
Slower queries — extra joins compound at scale.

Worked example. Same retail, snowflaked dimensions:

                  dim_customer
                       |
dim_brand → dim_product — fact_sales — dim_date
                       |                    |
                  dim_category         dim_quarter → dim_year
                       |
                dim_department

query	star joins	snowflake joins
revenue by category by year	2	4
revenue by department by quarter	2	5
top brands by city	2	4

Step-by-step.

The same fact_sales is now wrapped by normalised dimensions.
dim_product has category_id, not category — to get the category name you join dim_category.
"Revenue by category by year" becomes a four-table join instead of three.
The schema saves space — there are only N distinct categories instead of M product rows × the category name.
For most warehouses the storage savings are negligible and the join cost is real.

Worked-example solution. Snowflaked dim DDL:

CREATE TABLE dim_department (department_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_category   (category_id   NUMBER PRIMARY KEY, name TEXT, department_id NUMBER REFERENCES dim_department);
CREATE TABLE dim_product    (product_id    NUMBER PRIMARY KEY, name TEXT, category_id   NUMBER REFERENCES dim_category);

Rule of thumb: normalise a dimension only when the hierarchical attribute (a) is gigantic, (b) changes independently of the parent, or (c) is shared across multiple dimensions.

When to pick which — a one-line decision per dimension

The decision invariant: for each dimension, ask "does this attribute change independently and at significant volume?" — if yes, snowflake it; if no, star it. Most attributes fail that test; most dimensions stay flat.

Star, default — flat, denormalised, fast queries.
Snowflake, exception — only when storage or independent update wins.
Mixed (galaxy) schemas — multiple facts sharing conformed dimensions.
One-big-table (OBT) — extreme denormalisation, one row per event with every attribute inline; used by some Looker / Power BI shops.
Hybrid — star for most dimensions, snowflake one or two large hierarchical ones.

Worked example. Per-dimension choice for a retail warehouse:

dimension	choice	reason
`dim_customer`	star (flat)	denormalised attributes change together
`dim_product`	star	brand / category small, change with product
`dim_date`	star	static, small, joined heavily
`dim_geography`	snowflake	city → state → country shared, very large, infrequent change
`dim_employee`	star	hierarchy small, joined infrequently

Step-by-step.

Walk each dimension and ask the question.
For most retail dimensions, the answer is "keep it flat."
dim_geography is the exception — country/state hierarchies repeat across millions of customer / store rows; normalising saves real space.
Pick consistently and document the choice.
The resulting schema is mostly star with one normalised dimension — a hybrid that maximises performance with controlled redundancy.

Worked-example solution. Hybrid schema:

-- star dimensions (flat)
CREATE TABLE dim_customer (customer_id NUMBER PRIMARY KEY, name TEXT, segment TEXT, geography_id NUMBER);
CREATE TABLE dim_product  (product_id  NUMBER PRIMARY KEY, name TEXT, category TEXT, brand TEXT);

-- snowflaked geography (the one exception)
CREATE TABLE dim_country   (country_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_state     (state_id   NUMBER PRIMARY KEY, name TEXT, country_id NUMBER REFERENCES dim_country);
CREATE TABLE dim_geography (geography_id NUMBER PRIMARY KEY, city TEXT, state_id NUMBER REFERENCES dim_state);

Rule of thumb: if you can't articulate the win from normalising, the default is star.

Common beginner mistakes

Defaulting to snowflake "for normalisation" — modern warehouses don't reward it.
Normalising dim_date — one of the cheapest, smallest, most-joined dimensions; flat is always right.
Mixing schema styles within one warehouse without documentation — analysts lose track of the join path.
Treating snowflake schema (the model) and Snowflake (the cloud warehouse) as the same thing — they are unrelated; the schema pre-dates the company by 30 years.
Picking OBT (one-big-table) for a warehouse with many subject areas — works for narrow dashboards, kills cross-team analytics.

Data Warehouse Interview Question on Star vs Snowflake for a Retail Warehouse

A retailer has 50 million fact_sales rows, 10 dimensions ranging from dim_customer (5 M rows, mostly flat) to dim_geography (50 k rows, country/state/city hierarchy shared across customers and stores). Pick the schema shape per dimension and defend the overall choice.

Solution Using a Hybrid — Star for Most, Snowflake for Geography Only

Code solution.

-- 9 flat star dimensions + 1 snowflaked dim_geography (city → state → country)

-- star, flat
CREATE TABLE dim_customer (
    customer_id NUMBER PRIMARY KEY,
    name TEXT, segment TEXT, sign_up_date DATE,
    geography_id NUMBER NOT NULL
);
CREATE TABLE dim_product (
    product_id NUMBER PRIMARY KEY, name TEXT,
    category TEXT, subcategory TEXT, brand TEXT
);
CREATE TABLE dim_date (date_id NUMBER PRIMARY KEY, date DATE, month INT, year INT);

-- the one snowflaked dim — saves space because the hierarchy is shared
CREATE TABLE dim_country   (country_id NUMBER PRIMARY KEY, name TEXT);
CREATE TABLE dim_state     (state_id NUMBER PRIMARY KEY, name TEXT, country_id NUMBER);
CREATE TABLE dim_geography (geography_id NUMBER PRIMARY KEY, city TEXT, state_id NUMBER);

-- fact joins normally
CREATE TABLE fact_sales (
    sale_id NUMBER PRIMARY KEY,
    customer_id NUMBER, product_id NUMBER, date_id NUMBER,
    revenue NUMBER(14,2), quantity NUMBER
);

Step-by-step trace.

step	dimension	choice	reason
1	dim_customer	star	flat; attributes change together
2	dim_product	star	flat; category cheap to denormalise
3	dim_date	star	static, tiny, joined everywhere
4	dim_geography	snowflake	hierarchy shared, large, independent change
5	dim_store, dim_promo, dim_payment	star	flat, small
6	overall shape	hybrid (mostly star + one snowflake)	balances perf and storage

Output: the warehouse runs star-schema-fast for 95% of queries; the one snowflaked dimension saves disk on city/state/country redundancy without hurting most lookups; the schema documentation reads "star except for dim_geography."

Why this works — concept by concept:

Star for most dimensions — query simplicity and parallel join performance win.
Snowflake dim_geography only — hierarchical, shared, large; normalisation pays off here.
Conformed dimensions across the warehouse — dim_customer joins to every fact identically.
fact_sales clustered by date_id — every monthly / quarterly query prunes hard.
Surrogate keys on every dim — stable identifiers; SCD2-friendly going forward.
Cost — O(N log N) for the central fact scan; an extra O(K) join hop only for geography queries.

Inline CTA: drill star-schema joins on the SQL practice page and the aggregation topic.

SQL
Topic — joins
SQL join problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

5. Grain, keys, and surrogate keys

The three foundations every fact and dimension stands on

Grain is "what does one row mean?", keys are how rows are uniquely identified, and surrogate keys are stable, system-generated identifiers that survive business-key changes. These three concepts are the foundation of every well-designed warehouse — and the three most-asked interview questions in data-engineering loops. Get them right and every downstream choice falls out; get them wrong and the schema is unfixable.

Pro tip: In every system-design round, the first sentence of your fact-table answer is "the grain is one row per X." The second sentence names the FK columns. The third names the measures. If you can't say grain in one phrase, the design isn't ready.

Grain — what one row represents

The grain invariant: the grain of a fact table is the answer to "what is the meaning of one row?" — it must be stated explicitly, in one phrase, before any column is chosen; mixing grains in one table is the most common modelling mistake and the source of every double-counting bug. Pick the finest grain that the source data supports — rollups are SQL, but you can never re-derive detail from a summary.

State it in one phrase — "one row per (order, product line)."
Pick the finest grain available — coarser views are aggregates; coarser data is irreversible.
Document the grain inline — table comment, dbt YAML, or schema notebook.
Never mix grains — a table with sometimes-order, sometimes-line rows is broken.
Grain drives partition key — usually the date column at the row's natural grain.

Worked example. Three grain choices for a sales fact:

grain	rows	what each row means	rollups possible
one row per item sold	50 M / month	finest; one product unit per row	per order, per day, per category
one row per order line	10 M / month	aggregated to (order, product)	per order, per day, per category
one row per order	2 M / month	aggregated by order	per day, per customer; not by product line

Step-by-step.

The source data has 50 M individual item-sale events per month.
Option 1 (one row per item) preserves every detail; analysts can roll up however they want.
Option 2 (one row per order line) groups items by (order, product) — slightly smaller, but you lose per-unit detail.
Option 3 (one row per order) is too coarse — you cannot reconstruct "revenue by product" from it.
Pick the finest grain (option 1 or 2) and write rollups as SQL.

Worked-example solution. Stating grain explicitly:

-- grain: one row per order line (one product per row, multiple rows per order)
CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    order_id     NUMBER NOT NULL,           -- degenerate dimension
    product_id   NUMBER NOT NULL,
    customer_id  NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    quantity     NUMBER NOT NULL,
    unit_price   NUMBER(14,2) NOT NULL,
    revenue      NUMBER(14,2) NOT NULL      -- quantity * unit_price
);

Rule of thumb: if two analysts disagree on a number, check that they're aggregating to the same grain. Half the time the bug is exactly that.

Primary, foreign, and natural keys — the basics

The key-basics invariant: a primary key uniquely identifies a row, a foreign key links to a primary key in another table, a natural key is the business identifier (customer_email, order_number), and a surrogate key is a system-generated stable identifier. Warehouses use surrogate keys for stability; OLTP systems often use natural keys directly.

Primary key (PK) — one row, one identifier; uniqueness enforced.
Foreign key (FK) — references another table's PK; integrity check.
Natural key (NK) — business identifier (customer_email); can change.
Composite key — PK of multiple columns (e.g., (date_id, store_id) for daily-store snapshot).
Degenerate dimension — operational ID stored on the fact (order_number); no dim table needed.

Worked example. A retail warehouse's key structure:

table	PK	FK to	natural key
`dim_customer`	`customer_id` (surrogate)	—	`customer_email`
`dim_product`	`product_id` (surrogate)	—	`sku`
`fact_sales`	`sale_id`	`customer_id`, `product_id`, `date_id`	`order_number` (degenerate)

Step-by-step.

dim_customer has a surrogate customer_id as PK and a natural customer_email.
The customer's email might change ("alice@old.com" → "alice@new.com"); the surrogate ID doesn't.
fact_sales joins to dim_customer on the surrogate, so historical sales remain attached to the same person.
dim_product.sku is the natural key; product_id is the surrogate; same logic.
fact_sales.order_number is a degenerate dimension — preserved on the fact for traceability but with no dim table because there are no useful attributes about an order beyond its line items.

Worked-example solution. Key declarations:

CREATE TABLE dim_customer (
    customer_id  NUMBER PRIMARY KEY,           -- surrogate
    email        TEXT UNIQUE,                   -- natural key
    name         TEXT
);

CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    order_number TEXT NOT NULL,                 -- degenerate dimension
    customer_id  NUMBER NOT NULL REFERENCES dim_customer,
    product_id   NUMBER NOT NULL REFERENCES dim_product,
    revenue      NUMBER(14,2) NOT NULL
);

Rule of thumb: if a column is used to join and changes over time, you want a surrogate key. If it changes only theoretically, the natural key may be fine.

Surrogate keys — stable, system-generated, SCD-ready

The surrogate-key invariant: a surrogate key is a system-generated, stable identifier (typically a BIGINT sequence) attached to every dimension row; it is what fact tables join to; it survives business-key changes and is the only practical way to implement SCD2 without breaking referential integrity. Surrogate key in SQL is one of the most reliably-asked data-warehouse interview questions.

System-generated — GENERATED ALWAYS AS IDENTITY or BIGSERIAL.
Stable — never changes for the life of the row.
Fact join target — fact.customer_id references dim_customer.customer_id (the surrogate).
SCD2 enabler — multiple rows for the same person, each with a different surrogate.
Performance — small fixed-width integer; B-tree-friendly joins.

Worked example. SCD2 dimension with surrogate keys:

customer_sk	customer_id (natural)	city	valid_from	valid_to	is_current
1	42	Hyderabad	2025-01-01	2026-03-14	FALSE
2	42	Bangalore	2026-03-15	NULL	TRUE

Step-by-step.

Customer 42 (natural key) has two surrogate keys: 1 for the Hyderabad period, 2 for the Bangalore period.
Historical sales reference customer_sk = 1; new sales reference customer_sk = 2.
"Revenue by city last quarter" joins on customer_sk and naturally splits the customer's revenue between the two cities by date.
The natural key customer_id = 42 is preserved on the dim row for traceability.
Without the surrogate, you'd be stuck either overwriting history (Type 1) or breaking the FK.

Worked-example solution. Surrogate-key SCD2 dim:

CREATE TABLE dim_customer (
    customer_sk  NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- surrogate
    customer_id  NUMBER NOT NULL,                                  -- natural / business key
    name         TEXT,
    city         TEXT,
    valid_from   DATE,
    valid_to     DATE,
    is_current   BOOLEAN
);

CREATE TABLE fact_sales (
    sale_id      NUMBER PRIMARY KEY,
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,          -- joins to surrogate
    product_sk   NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    revenue      NUMBER(14,2)
);

Rule of thumb: every dimension gets a surrogate. The business may give you a natural key; the warehouse always generates its own.

Common beginner mistakes

Stating grain after picking columns — the grain drives the columns, not vice versa.
Using a natural key (email, SKU) as a join key in a fact — when the natural key changes, the fact silently drifts.
Treating customer_id and customer_sk as the same thing — they are not; one is business-stable, the other is warehouse-stable.
Forgetting the degenerate dimension on the fact — operational IDs (order_number) get lost without it.
Building a composite key where a surrogate would do — joins get harder, indexes get bigger.

Data Warehouse Interview Question on Grain and Keys for an E-Commerce Order Fact

The team is modelling an e-commerce orders fact. Source data has 200 orders/day, average 3 items per order, average price changes daily, and customer addresses change occasionally. Pick the grain, name the keys (PK, FKs, natural, surrogate, degenerate), and defend each choice.

Solution Using One Row per Order Line + Surrogate Keys + a Degenerate `order_number`

Code solution.

-- Grain: one row per (order, product line)
CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,  -- surrogate PK
    order_number TEXT   NOT NULL,                                   -- degenerate dim
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,           -- SCD2-aware FK
    product_sk   NUMBER NOT NULL REFERENCES dim_product,
    date_id      NUMBER NOT NULL REFERENCES dim_date,
    quantity     NUMBER NOT NULL,
    unit_price   NUMBER(14,4) NOT NULL,
    revenue      NUMBER(14,4) GENERATED ALWAYS AS (quantity * unit_price) STORED
);
CREATE INDEX ON fact_order_lines (date_id);

Step-by-step trace.

design decision	choice	reason
grain	one row per (order, product line)	finest available; rollups are SQL
PK	`line_sk` (surrogate)	stable, integer, indexable
customer FK	`customer_sk` (surrogate to SCD2 dim)	customer city changes; surrogate captures history
product FK	`product_sk` (surrogate)	price changes; surrogate keeps history
date FK	`date_id`	conformed across every fact
degenerate	`order_number`	preserves operational ID without a dim
measure	`revenue` generated from `quantity * unit_price`	one source of truth

Output: the fact answers "revenue by product by day," "revenue by customer city by month" (using SCD2), and "average order size" — all from one well-shaped table. Historical accuracy is preserved because customer and product attributes are SCD2-tracked via the surrogate dimensions.

Why this works — concept by concept:

Grain stated explicitly — "one row per order line"; never violated.
Surrogate PK line_sk — small integer, stable across every join.
SCD2-aware FKs — historical city / price are attached to the correct dimension row.
Degenerate order_number — operational lookups still work without a dim_order.
Generated revenue — eliminates the "ETL computed qty * price but application computed something different" class of bugs.
Cost — O(rows) for the central fact; surrogate joins are O(log N) per dim with B-tree indexes.

Inline CTA: for end-to-end fact-and-dim design, see ETL System Design for Data Engineering Interviews.

SQL
Topic — joins
SQL join problems

Practice →

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

6. Slowly Changing Dimensions (SCD)

Types 1, 2, and 3 — how dimensions handle attribute change

Dimensions change. A customer moves cities, a product gets re-categorised, an employee changes departments. Slowly Changing Dimensions (SCD) are the canonical patterns for handling that change in a warehouse — Type 1 (overwrite, lose history), Type 2 (new row, keep full history), Type 3 (extra column, keep one prior value). Type 2 is the most-asked in interviews because it preserves historical accuracy at the cost of more rows and a surrogate key.

Pro tip: When asked "which SCD type do I use?", say: "Type 1 for attributes I never want to look at historically, Type 2 for anything that affects a report, Type 3 for the rare 'just show me the previous value' case." That answer covers 99% of real-world choices.

SCD Type 1 — overwrite in place, lose history

The Type-1 invariant: SCD Type 1 simply overwrites the dimension row when an attribute changes; the old value is lost; no history; cheapest and simplest to implement; the right choice for attributes you never query historically (typos, formatting normalisation). Use it sparingly and explicitly — every Type 1 attribute is a piece of history you're choosing to discard.

One row per business key — customer_id = 42 is exactly one row.
Overwrite on change — old value replaced; no audit trail in the dim.
Simplest ETL — UPDATE … SET … and you're done.
Right for — corrections, name-formatting fixes, low-value attributes.
Wrong for — anything that affects historical reports.

Worked example. Customer 42's name corrected from "Alce" to "Alice":

before	after
`customer_id=42, name="Alce"`	`customer_id=42, name="Alice"`

Step-by-step.

The CSV import accidentally created customer_id=42, name="Alce".
The data team notices the typo and runs an UPDATE.
The dim row is overwritten; future queries see Alice.
Historical sales joined to this customer now show Alice too — which is what we want for a typo fix.
No new row; no history kept; no surrogate key needed.

Worked-example solution. Type 1 update:

UPDATE dim_customer
SET name = 'Alice'
WHERE customer_id = 42;

Rule of thumb: Type 1 is correct when historical reports should retroactively reflect the corrected value. Otherwise it's wrong.

SCD Type 2 — new row, keep full history

The Type-2 invariant: SCD Type 2 inserts a new dimension row when an attribute changes, closes the old row with valid_to and is_current = FALSE, and points future facts at the new row's surrogate key; full history is preserved. This is the most common SCD type in production and the most-asked in interviews.

Multiple rows per business key — each row covers one period.
valid_from / valid_to columns — date range during which the row was current.
is_current BOOLEAN — shortcut for "give me the current row."
New surrogate key per change — facts joined by surrogate stay attached to the correct period.
Historical accuracy — last year's revenue still rolls up to last year's city.

Worked example. Customer 42 moves from Hyderabad to Bangalore on 2026-03-15:

customer_sk	customer_id	city	valid_from	valid_to	is_current
1	42	Hyderabad	2025-01-01	2026-03-14	FALSE
2	42	Bangalore	2026-03-15	NULL	TRUE

Step-by-step.

Customer 42 originally has one row: customer_sk=1, city=Hyderabad, is_current=TRUE.
On 2026-03-15 the customer moves; the ETL detects the change.
The old row is closed: valid_to = 2026-03-14, is_current = FALSE.
A new row is inserted: customer_sk=2, city=Bangalore, valid_from = 2026-03-15, valid_to = NULL, is_current = TRUE.
Future fact rows reference customer_sk = 2; historical facts reference customer_sk = 1 — each fact gets the right city for its time.

Worked-example solution. SCD2 update pattern:

-- close the old row
UPDATE dim_customer
SET valid_to   = DATE '2026-03-14',
    is_current = FALSE
WHERE customer_id = 42 AND is_current = TRUE;

-- insert the new current row
INSERT INTO dim_customer (customer_id, name, city, valid_from, valid_to, is_current)
VALUES (42, 'Alice', 'Bangalore', DATE '2026-03-15', NULL, TRUE);

Rule of thumb: if the attribute affects a historical report, it must be Type 2. The classic test: "would last year's revenue be wrong if I overwrote this?"

SCD Type 3 — extra column, one prior value

The Type-3 invariant: SCD Type 3 adds a previous_* column alongside the current_* column on the same row; one prior value is kept, no more; cheaper than Type 2 but loses everything beyond the most recent change. Used in special cases — e.g., territory reassignments where you want "current and last quarter's region" available without a join.

One row per business key — no row growth.
Both columns on the row — current_city + previous_city.
Loses older history — third change overwrites the previous.
Right for — "current vs immediately prior" comparison patterns.
Wrong for — anything that needs more than one period of history.

Worked example. Customer 42 moves once:

customer_id	name	current_city	previous_city
42	Alice	Bangalore	Hyderabad

Step-by-step.

The dim originally has current_city = Hyderabad and previous_city = NULL.
The customer moves; ETL detects.
Single UPDATE: previous_city = current_city, current_city = "Bangalore".
If the customer moves again to Chennai, previous_city becomes "Bangalore" — Hyderabad is lost forever.
Reports can answer "compared to where they used to live" but not "where they lived three moves ago."

Worked-example solution. Type 3 update:

UPDATE dim_customer
SET previous_city = current_city,
    current_city  = 'Bangalore'
WHERE customer_id = 42;

Rule of thumb: Type 3 is rare. Use it only when the business explicitly says "I want current vs previous side-by-side" and never asks for deeper history.

Common beginner mistakes

Defaulting to Type 1 because "it's simple" — overwriting historically-meaningful attributes silently rewrites past reports.
Implementing Type 2 without a surrogate key — joins break the moment the natural key has multiple rows.
Forgetting to close the old row in Type 2 — both rows look "current"; queries return duplicates.
Mixing SCD types within one dimension without documentation — analysts cannot predict whether history is preserved.
Using Type 3 for an attribute that changes many times — you keep "current and one prior," lose the rest, and miss the original analysis intent.

Data Warehouse Interview Question on Handling an Address Change Correctly

A dim_customer dimension has customer_id, name, city, email. Customers move cities occasionally; the marketing team wants quarterly revenue reports that attribute each sale to the city where the customer lived at the time of the sale. Pick the SCD type, write the update logic, and explain how the fact-side join works.

Solution Using SCD Type 2 + Surrogate Key + a Date-Range Join

Code solution.

CREATE TABLE dim_customer (
    customer_sk  NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    customer_id  NUMBER NOT NULL,
    name         TEXT,
    city         TEXT,
    email        TEXT,
    valid_from   DATE NOT NULL,
    valid_to     DATE,
    is_current   BOOLEAN NOT NULL
);

-- detect change, close old, insert new
UPDATE dim_customer
SET valid_to   = DATE '2026-03-14', is_current = FALSE
WHERE customer_id = 42 AND is_current = TRUE;

INSERT INTO dim_customer (customer_id, name, city, email, valid_from, valid_to, is_current)
VALUES (42, 'Alice', 'Bangalore', 'alice@x.com', DATE '2026-03-15', NULL, TRUE);

-- quarterly revenue by city — uses date-range join
SELECT c.city, SUM(f.revenue) AS revenue
FROM fact_sales f
JOIN dim_customer c
  ON c.customer_id = f.customer_id
 AND f.sale_date BETWEEN c.valid_from AND COALESCE(c.valid_to, DATE '9999-12-31')
WHERE f.sale_date >= DATE '2026-01-01' AND f.sale_date < DATE '2026-04-01'
GROUP BY c.city;

Step-by-step trace.

step	action	result
1	detect city change on 2026-03-15	row count for customer 42 changes from 1 to 2 in dim
2	close old Hyderabad row	`valid_to = 2026-03-14, is_current = FALSE`
3	insert new Bangalore row	`valid_from = 2026-03-15, valid_to = NULL, is_current = TRUE`
4	run quarterly report	each sale joins to the dim row that was current on its sale date
5	revenue split correctly between Hyderabad and Bangalore	historical accuracy preserved

Output: Q1 2026 revenue is split correctly — sales before March 15 attribute to Hyderabad, sales on or after attribute to Bangalore. The CEO's "revenue by city" report stays accurate even as customers move.

Why this works — concept by concept:

SCD Type 2 — full history; old rows live alongside new rows.
Surrogate key customer_sk — uniquely identifies each (customer, period); facts join to the right surrogate.
valid_from / valid_to date range — defines which dim row was current at any sale date.
COALESCE(valid_to, '9999-12-31') — handles the open-ended current row.
is_current = TRUE for "current state" queries — shortcut for dashboards that always want the latest.
Cost — modest dim growth (one extra row per change); fact-side join cost identical.

Inline CTA: drill SCD2 patterns and dim modelling on the ETL practice page.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

7. Partitioning, ETL/ELT, and the design process

How the warehouse actually gets built — and how data flows in

A warehouse is more than a schema — it is partitioned tables, ETL/ELT pipelines, and a repeatable design process. Partitioning (usually by date) is what turns multi-billion-row facts from "slow" into "sub-second." ETL/ELT is how source data gets into the schema you designed. And the design process — the Kimball six-step method — is how you make the schema choices in the first place. This section closes the loop from "I have a great schema in my head" to "the warehouse is in production."

Pro tip: When someone asks "design a warehouse for X," walk through the six Kimball steps in order — business process, grain, dimensions, facts, schema, optimisation. That ordering catches missing grain or missing dimensions before you write a line of DDL.

Partitioning — split big facts by date for prune-friendly queries

The partitioning invariant: partitioning splits a large fact table into smaller chunks (usually one per day or month) so that a query with a date predicate reads only the relevant partitions; this is how 5 B-row facts return in seconds. Every cloud warehouse (Snowflake, BigQuery, Redshift) supports partitioning natively.

Partition key — almost always the date column at the fact's natural grain.
Daily or monthly — daily for high-volume facts, monthly for low.
Partition pruning — the planner skips partitions whose stats prove they cannot match.
Loadable partition-by-partition — daily ETL can INSERT / MERGE only today's partition.
Partition-friendly predicates — WHERE date_col = '2026-05-10' prunes; WHERE DATE(ts) = '2026-05-10' may not.

Worked example. A 5 B-row fact_sales partitioned by date_id:

query	partitions scanned	latency
`WHERE date_id = 20260510`	1 of 1,825	~200 ms
`WHERE date_id BETWEEN 20260501 AND 20260531`	31 of 1,825	~1 s
`WHERE date_id >= 20260101`	130 of 1,825	~4 s
no date predicate (full scan)	1,825 of 1,825	~60 s

Step-by-step.

The fact is partitioned daily by date_id; one micro-partition (or table partition) per day.
A query with WHERE date_id = X scans exactly one partition — ~0.05% of the fact.
A monthly query scans 30 partitions — ~1.6% of the fact.
Without a date predicate, the warehouse must scan everything; that's almost always the wrong query.
Partition pruning is automatic but requires that the predicate sits on the raw partition column, not wrapped in a function.

Worked-example solution. Partitioning a fact (Snowflake CLUSTER BY / BigQuery PARTITION BY):

-- Snowflake
CREATE TABLE fact_sales (
    sale_id NUMBER PRIMARY KEY,
    date_id NUMBER NOT NULL,
    customer_sk NUMBER, product_sk NUMBER,
    revenue NUMBER(14,2)
)
CLUSTER BY (date_id);

-- BigQuery
CREATE TABLE fact_sales
PARTITION BY DATE(sale_date)
CLUSTER BY customer_sk
AS SELECT * FROM staging_sales;

Rule of thumb: every fact with more than ~100 M rows must be partitioned. Skip it and every analytical query degrades.

ETL vs ELT — transform outside or inside the warehouse

The ETL/ELT invariant: ETL transforms data before loading (older pattern, used Spark / Python externally); ELT loads raw data and transforms with SQL inside the warehouse (modern pattern, used dbt / SQL); modern columnar warehouses make ELT the better default in most cases. Both fit dimensional modelling — they differ only in where the transform happens.

ETL — Extract, Transform, Load; transform pre-warehouse.
ELT — Extract, Load, Transform; transform in-warehouse with SQL.
dbt — the de-facto SQL transformation framework for ELT.
Modern cloud warehouses — fast enough that ELT outperforms ETL for most workloads.
ETL tools — Informatica, Talend, Spark; legacy stronghold for highly-custom transforms.

Worked example. Same daily orders load, ETL vs ELT:

step	ETL flavour	ELT flavour
1 extract	pull Postgres rows into Spark	dump Postgres rows to S3
2 transform	Spark dedup, type-cast, enrich	(later)
3 load	write transformed rows to warehouse	`COPY INTO` raw rows to staging
4 transform	(done)	dbt SQL builds star schema from staging
5 publish	warehouse star schema ready	warehouse star schema ready

Step-by-step.

ETL: heavy work happens in Spark or Python before warehouse touches the data.
ELT: raw rows land in the warehouse first; SQL transforms produce the model.
ELT keeps the raw layer addressable — you can always re-derive the model.
ELT uses the warehouse's compute (and bills you for it) instead of an external cluster.
For most teams the simplification — "everything is SQL in one place" — outweighs the compute cost.

Worked-example solution. dbt-style ELT model (SQL only):

-- models/fact_orders.sql
WITH raw AS (
    SELECT * FROM staging.orders_raw WHERE load_date = CURRENT_DATE
),
deduped AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) AS rn
    FROM raw
)
SELECT
    order_id,
    customer_sk,
    product_sk,
    TO_NUMBER(TO_CHAR(placed_at, 'YYYYMMDD')) AS date_id,
    revenue
FROM deduped WHERE rn = 1;

Rule of thumb: default to ELT unless you have a specific reason (massive transform, regulatory pre-processing, latency-sensitive streaming) to do ETL.

The Kimball six-step design process

The design-process invariant: the canonical Kimball method walks any new subject area through six numbered steps — business process → grain → dimensions → facts → schema → optimisation — in that order; doing them out of order produces broken designs. Memorise the order; it works for every analytical domain.

Step 1 — Business process — name the operational activity ("sales", "support tickets").
Step 2 — Grain — say "one row per X" in one phrase.
Step 3 — Dimensions — list the "by" axes: customer, product, date, region.
Step 4 — Facts — list the numeric measures: revenue, quantity, duration.
Step 5 — Schema — draw the star (or hybrid); name the conformed dims.
Step 6 — Optimisation — partition, cluster, index, materialise.

Worked example. Designing an e-commerce orders warehouse:

step	output
1 business process	"online order placement and fulfilment"
2 grain	"one row per order line"
3 dimensions	`dim_customer`, `dim_product`, `dim_date`, `dim_region`, `dim_payment`
4 facts	revenue, quantity, discount, tax
5 schema	star with 5 dims, 1 fact, surrogate keys
6 optimisation	partition by `date_id`, cluster by `customer_sk`, SCD2 on customer + product

Step-by-step.

The business process is "order placement and fulfilment"; that frames every choice that follows.
Grain: one row per (order, product line) is the finest the source supports.
Dimensions: who (customer), what (product), when (date), where (region), how (payment).
Facts: revenue, quantity, discount, tax — additive numeric measures.
Schema: star with surrogate keys; one conformed dim_customer shared with other facts.
Optimisation: partition on date_id; cluster on customer_sk for customer-by-customer rollups.

Worked-example solution. End-to-end design output:

-- minimal six-step output
CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    order_number TEXT NOT NULL,
    customer_sk  NUMBER NOT NULL,
    product_sk   NUMBER NOT NULL,
    date_id      NUMBER NOT NULL,
    region_sk    NUMBER NOT NULL,
    payment_sk   NUMBER NOT NULL,
    revenue      NUMBER(14,2),
    quantity     NUMBER,
    discount     NUMBER(14,2),
    tax          NUMBER(14,2)
)
CLUSTER BY (date_id, customer_sk);

Rule of thumb: every design conversation starts with step 1 and walks forward. If someone hands you DDL without a grain statement, your first question is "what does one row mean?"

Common beginner mistakes

Designing the schema before stating the grain — every column choice becomes a guess.
Building ETL when ELT would work — extra cluster, extra tool, extra ops cost.
Skipping partitioning on big facts — every query slows linearly with row count.
Picking partition keys that don't match the most common predicate — pruning never engages.
Treating the design as one-shot — every warehouse evolves; document the choices so the next iteration is informed.

Data Warehouse Interview Question on Designing an Online-Shopping Warehouse from Scratch

You are asked to design a warehouse for an online shopping app. The business wants daily revenue dashboards, monthly customer-segment reports, and real-time top-N best-selling products. Walk through the six-step Kimball process and produce the resulting schema.

Solution Using the Six-Step Kimball Process with a Star Schema

Code solution.

-- Step 1 (business process): online order placement
-- Step 2 (grain):           one row per order line
-- Step 3 (dimensions):      customer, product, date, region, payment
-- Step 4 (facts):           revenue, quantity, discount, tax
-- Step 5 (schema):          star with 5 dims + 1 fact
-- Step 6 (optimisation):    partition by date_id, cluster by customer_sk

CREATE TABLE dim_customer (customer_sk NUMBER PRIMARY KEY, customer_id NUMBER, name TEXT, segment TEXT, city TEXT, valid_from DATE, valid_to DATE, is_current BOOLEAN);
CREATE TABLE dim_product  (product_sk  NUMBER PRIMARY KEY, product_id  NUMBER, name TEXT, category TEXT, brand TEXT);
CREATE TABLE dim_date     (date_id NUMBER PRIMARY KEY, date DATE, month INT, quarter INT, year INT);
CREATE TABLE dim_region   (region_sk NUMBER PRIMARY KEY, region TEXT, country TEXT);
CREATE TABLE dim_payment  (payment_sk NUMBER PRIMARY KEY, method TEXT);

CREATE TABLE fact_order_lines (
    line_sk      NUMBER GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    order_number TEXT NOT NULL,
    customer_sk  NUMBER NOT NULL REFERENCES dim_customer,
    product_sk   NUMBER NOT NULL REFERENCES dim_product,
    date_id      NUMBER NOT NULL REFERENCES dim_date,
    region_sk    NUMBER NOT NULL REFERENCES dim_region,
    payment_sk   NUMBER NOT NULL REFERENCES dim_payment,
    revenue      NUMBER(14,2),
    quantity     NUMBER,
    discount     NUMBER(14,2),
    tax          NUMBER(14,2)
)
CLUSTER BY (date_id, customer_sk);

Step-by-step trace.

step	choice
1 process	online order placement & fulfilment
2 grain	one row per order line
3 dimensions	customer (SCD2), product, date, region, payment
4 facts	revenue, quantity, discount, tax
5 schema	star with surrogate keys on every dim
6 optimisation	partition by date_id, cluster by customer_sk

Output: the resulting schema answers all three business questions — daily revenue (GROUP BY date_id), monthly customer-segment (GROUP BY month, segment joining dim_customer), and top-N best-sellers (ORDER BY SUM(revenue) DESC LIMIT N joining dim_product). Each query is a simple star-shaped join with date-aware partition pruning.

Why this works — concept by concept:

Step 1: business process — frames every choice; "order placement" not "orders table."
Step 2: explicit grain — "one row per order line" prevents double-counting bugs.
Step 3: conformed dimensions — same dim_customer reused by future facts.
Step 4: additive measures — revenue, quantity, discount, tax all SUM-able.
Step 5: star schema — simple, fast, columnar-friendly.
Step 6: partition + cluster — daily reports prune by date; customer rollups prune by customer.
Cost — each business question runs in seconds because the schema and partitioning anticipate the question pattern.

Inline CTA: the full Kimball-to-warehouse design syllabus is in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

Choosing a schema (checklist)

If you are designing…	Pick…	Watch out for…
A new analytical subject area	Kimball star schema	Skipping the grain statement
A fact table with finite lifecycle (order, application)	Accumulating snapshot	Open-ended workflows that never "complete"
A balance/level metric over time	Periodic snapshot	Summing balances across days
A dimension whose attributes change	SCD Type 2 + surrogate key	Forgetting to close the old row
A correction or typo fix	SCD Type 1	Overwriting historically-important attributes
A very large hierarchical dimension	Snowflake (only this one)	Snowflaking every dimension
A 1 B+ row fact	Partition by date, cluster by access pattern	Predicates that wrap the partition column

Pro tip: Reach for Kimball data warehouse principles by default. Inmon's normalised EDW pattern works for some enterprise contexts, but most modern teams ship faster with subject-area marts joined by conformed dimensions.

Frequently asked questions

What is a fact table?

A fact table stores measurable business events — orders, clicks, payments — with one row per event and numeric measures plus foreign keys to dimensions. Fact tables are usually the largest tables in a warehouse and the focus of every analytical query.

What is a dimension table?

A dimension table stores descriptive attributes that put facts into business context — customer name and city, product category, calendar date. Dimensions answer the "by" questions ("revenue by category") and are joined to facts by foreign keys.

What is a star schema?

A star schema has one fact table at the centre joined to N denormalised dimension tables; the shape looks like a star. It is the default analytical schema because joins are simple and columnar warehouses optimise it natively. The star schema vs snowflake schema trade-off favours star in nearly every modern warehouse.

What is grain in data warehouse design?

Grain is the meaning of one row in a fact table — "one row per order line," "one row per (day, product)," "one row per session." It must be stated explicitly before columns are chosen, and mixing grains in a single fact is the most common modelling bug.

What is a surrogate key?

A surrogate key is a system-generated stable identifier (typically a BIGINT sequence) attached to every dimension row. Facts join on the surrogate; the natural business key (customer_email) lives on the dim for traceability. Surrogate keys are required for SCD Type 2 because the natural key isn't unique anymore.

What is SCD Type 2?

SCD Type 2 inserts a new dimension row whenever an attribute changes — the old row is closed with valid_to and is_current = FALSE; the new row gets a fresh surrogate key. Historical accuracy is preserved: last year's revenue rolls up to last year's city, not today's.

What's the difference between a data warehouse, a data lake, a data mart, and a data lakehouse?

A data warehouse holds modelled analytical data (star schemas, conformed dimensions). A data lake holds raw files (Parquet / JSON / CSV) on object storage without modelled schemas. A data mart is a subject-area subset of a warehouse (e.g., mart_finance). A data lakehouse layers ACID table formats (Iceberg, Delta) on top of lake storage to give warehouse-style semantics on raw files. Pick by the workload and the team's needs.

Practice on PipeCode

PipeCode ships 450+ data engineering practice problems — SQL uses the PostgreSQL dialect, with editorials and topics aligned to the same patterns warehouse interviewers ask. Start from Explore practice →, open SQL practice →, filter by ETL → or aggregations →, and see plans → when you want the full library.

ETL Pipeline for Data Engineering: A Beginner's Guide to Extract, Transform, and Load

Gowtham Potureddi — Tue, 12 May 2026 04:37:35 +0000

An ETL pipeline is the core data-engineering workflow that turns scattered raw payloads — database rows, API responses, log files, SaaS exports — into clean, trusted data inside a warehouse where analysts and BI tools can use it. ETL stands for Extract, Transform, Load: pull raw data from many source systems, reshape and clean it into a consistent schema, then write it into a destination like Amazon Redshift, Snowflake, or a data lake. Every fresher data-engineering interview probes the same three letters — and the candidate who can name the failure modes per stage wins the round.

Think of this as a beginner-friendly ETL pipeline tutorial for data engineers — a first-principles walk through the Extract → Transform → Load loop, the orchestration tools that automate it (Airflow, dbt, Spark, AWS Glue), the ETL-vs-ELT trade-off that defines modern cloud warehouses, and a runnable Python pandas example you can adapt to your own pipeline. Every section ships worked examples and an ETL interview questions-style problem with a full traced solution, in the same shape PipeCode practice problems use.

On this page

Why ETL pipelines matter
Extract — pulling raw data from sources
Transform — cleaning, dedup, standardization, aggregation
Load — destinations from warehouses to BI tools
ETL vs ELT — transform before or after loading
ETL orchestration tools — Airflow, dbt, Spark, AWS Glue
Building a Python pandas ETL pipeline
Choosing your ETL stack (checklist)
Frequently asked questions
Practice on PipeCode

1. Why ETL pipelines matter

Clean, trusted data is the foundation of every analytics decision

So, why ETL? Because raw source data is messy — duplicates, nulls, mixed formats, inconsistent customer IDs across systems — and dashboards can't tolerate that mess. An ETL pipeline is the automated cleaning step between the noisy source-of-truth (operational databases, third-party APIs, file dumps) and the curated layer (data warehouse, lakehouse, BI tool). Without it, analytics teams answer the same question three different ways and trust in the data evaporates.

Pro tip: When an interviewer asks "what's an ETL pipeline?", lead with the contract it provides, not the steps. The contract is: "given any source payload, downstream consumers see clean, deduplicated, type-coerced, time-aligned rows on a known schema with a known freshness SLA." The three letters (E, T, L) are just how you keep that contract.

Raw data is noisy — duplicates, nulls, mixed formats

The noise invariant: source systems were built for their own workload, not for analytics; they ship duplicates from CDC retries, nulls where the user skipped a field, three different date formats from three different teams, and inconsistent capitalisation that breaks joins. Every one of those defects either becomes a bug in the dashboard or gets cleaned out by an ETL stage.

Duplicates — same customer recorded multiple times (CDC retries, late events).
Nulls — missing amounts, missing emails, optional fields left blank.
Mixed formats — 2026-05-11, 11/05/26, May 11 all mean the same date.
Inconsistent identifiers — Ram, RAM, Ram (trailing space) all refer to one customer.

Worked example. A raw orders extract straight from the CRM contains three flavours of "Ram" and a null amount:

order_id	customer_name	amount
1	Ram	500
2	RAM	1000
3	Ram	NULL

Step-by-step.

SELECT COUNT(DISTINCT customer_name) FROM orders returns 2 (Ram and RAM) when the business wants 1.
SELECT AVG(amount) FROM orders returns 750 instead of the correct 500 because NULL is excluded from AVG, but the consumer expected it to be treated as 0.
A dashboard built directly on this raw table publishes wrong numbers to the CFO.
The fix isn't a smarter query — it's an ETL pipeline that normalises the casing, deduplicates customers on a canonical key, and resolves null amounts using a business rule.

Worked-example solution. A minimal Transform step in SQL:

SELECT
    LOWER(TRIM(customer_name))             AS customer_key,
    COALESCE(amount, 0)                    AS amount
FROM raw_orders;

Rule of thumb: the deeper you get into a pipeline, the more expensive the fix; clean the raw payload as close to ingest as possible.

Multiple sources, one consistent schema

The unification invariant: every analytical query joins data from at least two systems — the e-commerce orders table, the payment provider's ledger, the CRM customer record, the marketing platform's campaign IDs — and they don't agree on schema, primary keys, or freshness; the ETL pipeline is what produces a **single conformed schema with a shared customer key, a shared product key, and a shared time grain**.

Common sources — SQL databases (Postgres, MySQL), APIs (Stripe, Salesforce), CSV / Excel dumps, log files, cloud storage (S3, GCS), SaaS tools (HubSpot, Segment).
Conformed dimensions — dim_customer, dim_product, dim_date shared across every fact table.
Surrogate keys — never join on a source-system natural key; map it to an internal BIGINT key in the warehouse.
Shared time grain — every fact aligns to a common date grain (day, hour, minute) so cross-source joins work.

Worked example. Three sources, three different customer-identifier columns:

source	identifier column	example value
website (Postgres)	`users.id` (BIGINT)	`42`
payments (Stripe API)	`customer.id` (string)	`cus_abc123`
CRM (HubSpot export)	`contact.email` (string)	`ram@example.com`

Step-by-step.

The website ships orders keyed by Postgres users.id.
The payment provider returns transaction rows keyed by Stripe cus_abc123.
The CRM exports customer contacts keyed by email address.
A direct three-way join is impossible — there is no shared key.
The ETL pipeline builds a dim_customer table that carries all three identifiers as columns plus a single internal customer_key BIGINT that every downstream fact uses. After that, the three-way join is one line of SQL.

Worked-example solution. A bridge dimension that unifies all three identifiers:

CREATE TABLE dim_customer (
    customer_key   BIGSERIAL    PRIMARY KEY,
    website_id     BIGINT       UNIQUE,
    stripe_id      TEXT         UNIQUE,
    crm_email      TEXT         UNIQUE,
    name           TEXT         NOT NULL,
    created_at     TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);

Rule of thumb: if you find yourself joining on a string from a third party, the join key belongs in a dimension table, not in your fact-table predicates.

Automation, repeatability, and observability

The automation invariant: an ETL pipeline runs on a schedule (or in response to an event), produces the same output on every rerun (idempotent), and emits enough metadata (row counts, hashes, error logs) for an on-call engineer to debug a failure at 3 a.m.. A "pipeline" without these properties is a one-off script, and one-off scripts always rot.

Schedule — cron (0 2 * * * for 2 a.m. daily), event-driven (S3 ObjectCreated), or continuous (Kafka stream).
Idempotent — rerunning the same job produces the same output; achieved with MERGE, INSERT OVERWRITE PARTITION, or DELETE + INSERT.
Observable — row counts logged per stage, schema drift alerts, success / failure notifications to Slack or PagerDuty.
Reproducible — the pipeline definition lives in git; deploys are versioned; rollbacks are one PR away.

Worked example. An idempotent daily load that overwrites a single date partition.

run	rows in target	resulting count
original	0	12,835 (first load for 2026-05-11)
retry after partial failure	6,420 (partial write)	12,835 (overwrite produces clean state)
accidental rerun next morning	12,835	12,835 (same data, no duplicates)

Step-by-step.

Day-1 run: load lands 12,835 rows for partition ingest_date='2026-05-11'.
The pipeline crashes mid-write, leaving 6,420 partial rows.
Retry runs the same job; the INSERT OVERWRITE PARTITION semantics drop the partial rows first, then write the full 12,835 — net state is correct.
An accidental rerun a day later does the same thing — overwrite the partition, end at 12,835. No duplicates.
The key property is that the final state depends on the input, not on how many times the job ran. That's idempotency.

Worked-example solution. A partition-overwrite load (works in PostgreSQL, Snowflake, Spark SQL):

INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-05-11')
SELECT
    order_id,
    customer_id,
    amount::NUMERIC(14, 2) AS amount,
    source_ts
FROM bronze.orders
WHERE ingest_date = '2026-05-11';

Rule of thumb: if your job's output depends on NOW() or on previous state in the target, it is not idempotent — restructure.

Common beginner mistakes

Conflating an ETL pipeline with a one-off SQL script — pipelines are scheduled, versioned, and observable.
Skipping the deduplication step in Transform — assuming source data is "clean enough" and shipping doubled metrics.
Hand-mapping identifiers in every dashboard instead of building a conformed dim_customer once.
Writing non-idempotent loads (INSERT INTO ... SELECT ...) and discovering duplicates only after the second run.
Treating ETL as code-only without observability — silent failures rot trust faster than loud ones.

ETL Interview Question on Designing a First-Pass Pipeline

A retail company has order data spread across three systems — the Postgres-backed e-commerce site, a Stripe payments account, and a HubSpot CRM. The CFO wants a daily revenue dashboard sliced by customer segment and product category. Design the simplest end-to-end ETL pipeline that gives the CFO a trustworthy answer by tomorrow morning.

Solution Using a Daily ETL Pipeline with Bronze / Silver / Gold Layers

Code solution.

EXTRACT
   ├── Postgres orders   → s3://lake/bronze/orders/ingest_date=…/         (Debezium CDC or daily snapshot)
   ├── Stripe charges    → s3://lake/bronze/charges/ingest_date=…/        (REST API → JSON to S3)
   └── HubSpot contacts  → s3://lake/bronze/contacts/ingest_date=…/       (nightly CSV export)

TRANSFORM   (dbt or Spark SQL)
   ├── silver.orders     ← dedup + type coercion + customer_key surrogate
   ├── silver.charges    ← join orders ↔ charges by stripe transaction_id
   ├── silver.contacts   ← deduped contact rows keyed by email
   └── silver.dim_customer ← unify all three identifiers in one dimension

LOAD
   └── gold.fact_revenue ← grain: one row per (date, customer_key, product_key)
                            partitioned by date_key
                            joined to dim_customer + dim_product

ORCHESTRATION
   └── Airflow DAG, daily at 02:00 UTC, with reconciliation gate before gold promotion

Step-by-step trace of the daily pipeline:

stage	what runs	output
02:00	Airflow triggers DAG	start
02:05	Extract: Postgres snapshot → S3 bronze	12,835 raw orders
02:08	Extract: Stripe API pulls yesterday's charges → S3	12,712 charges
02:10	Extract: HubSpot nightly CSV → S3	8,210 contacts
02:15	Transform: dbt builds silver layer (dedup, type coercion, surrogate keys)	12,835 silver orders
02:25	Transform: dim_customer unified across all three sources	8,210 dim rows
02:30	Reconciliation gate: silver.orders count vs Postgres source	drift < 0.1% — PASS
02:32	Load: gold.fact_revenue partition for 2026-05-11	12,835 fact rows
02:35	DAG complete; CFO dashboard refreshes at 02:40	clean numbers

Output: a single gold.fact_revenue table the BI tool reads. Every row has a customer_key, a product_key, a date_key, an exact revenue decimal, and a pipeline_version lineage column. The CFO opens Tableau, picks a date range, and sees correct numbers segmented by customer cohort.

Why this works — concept by concept:

Bronze (append-only) → Silver (conformed) → Gold (star schema) — the medallion layering keeps source drift contained to the bronze layer; consumers never see raw payloads.
Daily partition overwrite — every load is idempotent; reruns and backfills don't produce duplicates.
dim_customer bridging three identifiers — joins are one line in the gold layer; the warehouse query plan uses a single surrogate key everywhere.
Reconciliation gate before gold promotion — drift > tolerance pages on-call; the BI tool never sees a bad load.
Airflow orchestration — DAG definition is versioned in git; retries, alerts, SLAs are first-class.
Cost — one daily run scales O(|daily delta|) not O(|all-time data|); backfill is O(|range|); clear observability everywhere.

Inline CTA: drill the ETL practice page for end-to-end pipeline shapes and the SQL practice page for transformation-style queries.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews

View course →

2. Extract — pulling raw data from sources

How to read from databases, APIs, files, and SaaS without breaking the source

The Extract stage is the gateway between source systems and the rest of the pipeline — and the choices you make here cascade into everything downstream. Different sources have different protocols (SQL, REST, file drops, CDC streams), different freshness expectations (sub-second to weekly), and different failure modes (rate limits, schema drift, network flakiness). The right Extract strategy is the one that pulls complete, ordered, replayable raw data without disrupting the source system's own workload.

Pro tip: the most common Extract failure mode in production isn't "data is wrong" — it's "data is silently missing because the source paginated and we didn't follow the cursor." Always log the source-system pagination cursor / last-modified marker for every batch so you can replay from exactly where you stopped.

SQL databases — snapshots, CDC, and the read-replica rule

The relational-source invariant: never run extract queries against the OLTP primary database — that's the production transactional workload; extract from a read replica, a CDC stream, or a periodic snapshot to S3 / GCS. Within those three patterns, CDC is the modern default for high-freshness pipelines because it has near-zero impact on the source.

Snapshot extract — SELECT * FROM orders WHERE updated_at > $cursor against a read replica.
CDC (Change Data Capture) — Debezium reads the Postgres WAL or MySQL binlog and emits change events to Kafka.
Read replica — point extracts at a follower, not the primary, so the OLTP workload is unaffected.
Cursor / watermark — persist the last successfully-extracted timestamp or LSN so the next run resumes correctly.

Worked example. A daily extract that pulls only the previous day's orders.

approach	source impact	freshness	complexity
Full table dump	high (lock + IO)	24 h	low
Cursor-based incremental	medium (one indexed scan)	24 h	medium
CDC (Debezium + Kafka)	near-zero	~1 min	high

Step-by-step.

The naïve approach SELECT * FROM orders against the primary scans the entire table — millions of rows of IO that competes with the live website.
The cursor-based approach WHERE updated_at >= '2026-05-10' AND updated_at < '2026-05-11' against a read replica reads only ~1 day of data — much cheaper.
CDC reads the database's own write-ahead log, so the extract has zero query cost on the source — only the disk-tail read of the WAL.
The right choice depends on freshness needs: daily dashboards → cursor; sub-minute analytics → CDC; one-off backfill → full snapshot to S3.
In every case, log the cursor (or LSN) per batch — that's how you replay after a failure.

Worked-example solution. A cursor-based daily extract against a Postgres read replica:

-- Run nightly; bind $cursor to the previous run's max_updated_at
COPY (
    SELECT order_id, customer_id, amount, status, updated_at
    FROM orders
    WHERE updated_at >= '2026-05-10'::timestamptz
      AND updated_at <  '2026-05-11'::timestamptz
    ORDER BY updated_at
)
TO 's3://lake/bronze/orders/ingest_date=2026-05-10/orders.csv'
WITH (FORMAT csv, HEADER true);

Rule of thumb: every extract you write should be resumable from a cursor. Without one, a network blip becomes a full re-extract.

APIs — pagination, rate limits, and idempotent paging

The API-source invariant: most REST APIs return a paginated response; the extract job must follow the cursor / next-link until all pages are consumed; rate limits force exponential backoff; the same call run twice should produce the same data unless the upstream changed. Idempotency on the extract side protects the rest of the pipeline.

Pagination — cursor-based (next_cursor token), offset / limit, or since timestamps.
Rate limits — read the X-RateLimit-Remaining header; back off when it hits 0.
Idempotency — repeating the same API call returns the same rows (modulo true new data).
Auth — OAuth refresh tokens, API keys via secrets manager (never in code).

Worked example. A Stripe charges extract that pages through ~10,000 records per day.

step	request	result
1	`GET /v1/charges?limit=100&created[gte]=…`	100 charges + `has_more=true` + `next_id`
2	`GET /v1/charges?limit=100&starting_after=cha_99`	100 more + `has_more=true`
3-99	repeat with new `starting_after` cursor	100 each
100	last page	12 charges + `has_more=false`
total	99 × 100 + 12 = 9,912	written to S3 as one JSON Lines file

Step-by-step.

The first call grabs the first 100 charges with limit=100.
Each response contains has_more=true plus the ID of the last record (cha_99).
The next call uses starting_after=cha_99 to fetch the next page — that's cursor-based pagination.
Loop until has_more=false; concatenate all pages into a single JSON Lines file.
Persist the final cursor ID so a retry knows exactly where to resume; without that, you re-extract from the start every time.

Worked-example solution. A simple Python pagination loop with rate-limit handling:

import os, time, json, requests

cursor = None
records = []
while True:
    params = {"limit": 100, "created[gte]": "1715040000"}
    if cursor:
        params["starting_after"] = cursor
    r = requests.get(
        "https://api.stripe.com/v1/charges",
        params=params,
        auth=(os.environ["STRIPE_SECRET_KEY"], "")
    )
    if r.status_code == 429:                # rate-limited
        time.sleep(int(r.headers.get("Retry-After", 10)))
        continue
    page = r.json()
    records.extend(page["data"])
    if not page["has_more"]:
        break
    cursor = page["data"][-1]["id"]

with open("/tmp/charges.jsonl", "w") as fh:
    for rec in records:
        fh.write(json.dumps(rec) + "\n")

Rule of thumb: assume every API call can fail; build retries, backoff, and cursor-resume in from day one.

Files and SaaS — schema drift and the contract problem

The file-source invariant: CSV / Excel / JSON dumps from third parties are the highest-drift source in any pipeline — a column rename, a date-format change, or a quote-character switch silently breaks the load; defend with strict schema validation at ingest and explicit alerts on drift.

CSV — csv module or pandas.read_csv with explicit dtype map and parse_dates list.
Excel — openpyxl for .xlsx, but really pressure the source to send CSV / Parquet.
JSON / JSONL — line-delimited JSON for streaming-friendly reads; flatten nested objects in Transform, not Extract.
Schema validation — assert column names, types, and required-not-null status at ingest; fail loudly on drift.

Worked example. A nightly CSV export from a CRM that quietly renamed Email → email_address.

date	columns extracted	result
2026-05-09	`Name, Email, Phone`	parses fine
2026-05-10	`Name, email_address, Phone`	column `Email` not found → loud error
2026-05-10 (no validation)	`Name, email_address, Phone`	silently writes NULL into `email` — dashboard breaks

Step-by-step.

The CRM team renamed Email to email_address in a release note nobody read.
The extract script asks for df["Email"]; pandas raises KeyError.
With strict validation, the script alerts on-call and halts before writing bad data.
Without validation, the script writes NULL for every email; the downstream dashboard's "users with no email" panel jumps from 0% to 100% overnight.
The fix is to assert column presence and types at ingest — and to publish that schema to the source team as a contract.

Worked-example solution. A defensive CSV ingest with schema assertion:

import pandas as pd

EXPECTED = {"Name": "object", "Email": "object", "Phone": "object"}

df = pd.read_csv("crm_contacts.csv", dtype=EXPECTED)
missing = set(EXPECTED) - set(df.columns)
if missing:
    raise RuntimeError(f"CRM CSV missing columns: {missing}")

# only after validation, write to bronze
df.to_parquet("/tmp/contacts.parquet", index=False)

Rule of thumb: every file source has a contract — declare it in code and fail loudly when the source breaks it.

Common beginner mistakes

Pointing extract queries at the OLTP primary instead of a read replica or CDC stream.
Skipping pagination — pulling page 1 of 100 and assuming the API gave you everything.
Storing API secrets in code instead of a secrets manager / environment variable.
Trusting the source schema without validation — silent drift becomes silent data loss.
Forgetting to persist the cursor / watermark — failures force a full re-extract.

ETL Interview Question on Extracting from a Rate-Limited API

A pipeline needs to pull users data from a third-party API daily. The API returns 200 users per page, is paginated by next_cursor, and rate-limits at 60 requests / minute. The user base is ~50,000. Walk through the extract design that finishes in under 30 minutes without hitting rate-limit errors.

Solution Using Cursor-Based Pagination + Rate-Limit-Aware Backoff

Code solution.

import os, time, json, requests

API   = "https://api.example.com/v1/users"
TOKEN = os.environ["API_TOKEN"]
PAGE  = 200
out   = []

session = requests.Session()
session.headers.update({"Authorization": f"Bearer {TOKEN}"})

cursor = None
while True:
    params = {"page_size": PAGE}
    if cursor:
        params["next_cursor"] = cursor

    r = session.get(API, params=params)
    if r.status_code == 429:
        time.sleep(int(r.headers.get("Retry-After", 5)))
        continue
    r.raise_for_status()

    body = r.json()
    out.extend(body["users"])
    cursor = body.get("next_cursor")
    if not cursor:
        break

    # Stay well under 60 req/min — sleep 1.1s between requests
    time.sleep(1.1)

with open("/tmp/users.jsonl", "w") as fh:
    for u in out:
        fh.write(json.dumps(u) + "\n")

Step-by-step trace for the 50k-user extract:

metric	value
pages required	⌈50,000 / 200⌉ = 250
time per request (with 1.1s sleep)	~1.4 s
total wall-clock time	250 × 1.4 s ≈ 6 min
rate-limit headroom	~52 req / min (under the 60 cap)
failure mode handled	`429` → backoff per `Retry-After`
final output	one JSONL file with 50,000 user rows

Output: a single JSONL file (users.jsonl) with one row per user. The cursor design means a retry resumes from the failure point, not the beginning. Total wall-clock: ~6 min, well under the 30-min budget. Zero rate-limit errors.

Why this works — concept by concept:

Cursor pagination + next_cursor follow — extracts every user without skipping or duplicating pages.
1.1-second sleep between calls — keeps the request rate at ~52 req / min, comfortably under the 60-req / min limit.
429 → Retry-After backoff — handles bursty rate-limit events without crashing the pipeline.
JSONL output — streaming-friendly; downstream Transform can read line-by-line without loading the whole file into memory.
Persisted cursor between runs — extend the script to write cursor to S3 / database; the next day resumes from there.
Cost — O(|users| / page_size) requests; bounded by the rate limit; failure recovery is O(1) cursor reload.

Inline CTA: more Python practice problems for API-extraction loops and the ETL practice page for full pipeline shapes.

PYTHON
Language — Python
Python practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

3. Transform — cleaning, dedup, standardization, aggregation

Where raw data becomes useful — the meat of every ETL pipeline

The Transform stage is where 80% of an ETL pipeline's value is created. Raw data lands as-is, and Transform applies the cleaning, deduplication, type coercion, joining, and business-rule logic that turn it into something consumers can trust. In modern pipelines, Transform is usually SQL (dbt, Spark SQL) on top of a staged copy of the raw data — easy to test, easy to version, easy to backfill.

Pro tip: Transform logic should be idempotent and unit-testable. Every transformation is "pure" if its output depends only on its input — no NOW(), no random IDs, no calls to external services. That property is what lets you backfill, rerun, and refactor without fear.

Cleaning and deduplication

The cleaning invariant: bronze data contains duplicates from CDC retries, retries on flaky network, and late-arriving records; silver must collapse them to one canonical row per business key, choosing the latest version when conflicts exist. The standard pattern is ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1.

Duplicates — same business key with multiple physical rows.
ROW_NUMBER dedup — partition by business key, order by latest source_ts, keep rn = 1.
Bad rows — invalid types, broken refs, business-rule violations; quarantine to a separate reject table.
COALESCE — replace nulls with a known default at the boundary, not inside business logic.

Worked example. Bronze orders arrives with two rows for order_id = 448 due to a CDC retry.

order_id	source_ts	amount
448	2026-05-11 09:30:00	500
448	2026-05-11 09:30:15	520 (corrected)
449	2026-05-11 10:00:00	800

Step-by-step.

Bronze has two rows for order 448 — the first one (500) is the original CDC event, the second (520) is the post-correction CDC event.
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) numbers them: the 09:30:15 row gets rn = 1, the 09:30:00 row gets rn = 2.
For order 449, only one row → rn = 1.
WHERE rn = 1 keeps the latest version of order 448 (the corrected $520) and the only version of order 449.
Silver now has one deterministic row per order_id — joins and aggregates produce the right numbers.

Worked-example solution. Dedup in SQL:

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY order_id
               ORDER BY source_ts DESC
           ) AS rn
    FROM bronze.orders
    WHERE ingest_date = '2026-05-11'
)
INSERT INTO silver.orders
SELECT order_id, customer_id, amount, source_ts
FROM ranked
WHERE rn = 1;

Rule of thumb: dedup belongs in the silver layer and the silver layer alone; if downstream needs to dedup again, your silver contract is leaking.

Standardization — types, casing, dates, casing

The standardization invariant: source systems disagree on date format, capitalisation, currency, and units; Transform converts everything to one canonical representation so downstream queries can compare them without LOWER() calls in every WHERE.

Dates — '11/05/26', '2026-05-11', 'May 11' → all become DATE '2026-05-11'.
Casing — 'Ram', 'RAM', 'ram' → LOWER(TRIM(name)) → 'ram'.
Currency — multiple feeds in USD, EUR, INR → convert to one reporting currency in silver.
Units — distance in miles vs km, weight in lb vs kg — canonicalise once at ingest.

Worked example. Three rows from three sources with three date formats:

source	raw date	canonical date
Postgres app	`2026-05-11` (ISO)	`2026-05-11`
CRM CSV	`11/05/26` (DD/MM/YY)	`2026-05-11`
API JSON	`"May 11, 2026"`	`2026-05-11`

Step-by-step.

Postgres uses ISO 8601 (YYYY-MM-DD) natively — no transformation needed.
The CRM exports DD/MM/YY — needs TO_DATE(raw_date, 'DD/MM/YY').
The API returns prose-style "May 11, 2026" — needs TO_DATE(raw_date, 'Mon DD, YYYY').
After Transform, every row has a single DATE value in the canonical column.
Downstream SQL is then trivial: WHERE order_date = '2026-05-11' works across all three sources.

Worked-example solution. Standardize dates and casing in one pass:

SELECT
    CASE
        WHEN raw_date ~ '^\d{4}-\d{2}-\d{2}$'   THEN raw_date::DATE
        WHEN raw_date ~ '^\d{2}/\d{2}/\d{2}$'   THEN TO_DATE(raw_date, 'DD/MM/YY')
        WHEN raw_date ~ '^[A-Z][a-z]{2} \d{1,2}, \d{4}$' THEN TO_DATE(raw_date, 'Mon DD, YYYY')
        ELSE NULL                               -- send to reject table
    END                                     AS order_date,
    LOWER(TRIM(customer_name))              AS customer_key,
    COALESCE(amount, 0)::NUMERIC(14, 2)     AS amount
FROM bronze.orders_raw;

Rule of thumb: standardize at the bronze → silver boundary; never let two different formats coexist past that line.

Aggregation and business rules

The aggregation invariant: silver carries one row per source event; gold often carries pre-aggregated metrics for fast dashboard reads — daily revenue, monthly active users, average order value per cohort; the aggregation logic is a SQL GROUP BY plus business-rule CASE WHEN expressions.

Daily totals — SELECT date, SUM(amount) FROM silver.orders GROUP BY date.
CASE WHEN — classify rows into business buckets at aggregation time.
Window aggregates — running totals, rolling averages, MoM deltas.
Cohort metrics — GROUP BY signup_month, days_since_signup for retention curves.

Worked example. A daily revenue rollup with a High Value business rule.

date	order_count	total_revenue	high_value_count
2026-05-09	4,210	1,051,200	38
2026-05-10	4,832	1,224,500	47
2026-05-11	5,118	1,387,420	52

Step-by-step.

Silver orders has one row per order with order_date, amount, and customer_id.
The aggregate GROUP BY order_date collapses to one row per date.
SUM(amount) produces the daily total revenue.
COUNT(*) FILTER (WHERE amount > 10000) produces the count of High Value orders for that day.
The result lands in gold.daily_revenue for the dashboard to read in milliseconds — no full-table scan per page load.

Worked-example solution. Daily rollup with business rule:

INSERT INTO gold.daily_revenue
SELECT
    order_date,
    COUNT(*)                                            AS order_count,
    SUM(amount)                                         AS total_revenue,
    COUNT(*) FILTER (WHERE amount > 10000)              AS high_value_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (order_date) DO UPDATE SET
    order_count       = EXCLUDED.order_count,
    total_revenue     = EXCLUDED.total_revenue,
    high_value_count  = EXCLUDED.high_value_count;

Rule of thumb: never recompute aggregates inside a BI tool when the warehouse can pre-compute them at load time.

Common beginner mistakes

Skipping ROW_NUMBER dedup and assuming "the source won't send duplicates" — every CDC pipeline eventually retries.
Mixing canonical and source-system date formats — every downstream query needs WHERE casts.
Doing aggregation inside the BI tool instead of in the warehouse — slow dashboards, no reuse.
Putting business rules in Transform code instead of declarative SQL — harder to test, harder to version.
Forgetting to write rejected rows to a reject table — quietly losing data on cleanup.

ETL Interview Question on Cleaning Drifted Source Data

A CRM dumps a daily CSV with customer_name values like Ram, RAM, Ram (trailing space), and Ram@ (corrupted). The downstream dashboard counts distinct customers and currently reports 4 unique names when the truth is 1. Write the Transform step that produces a single canonical customer_key per real-world customer and quarantine the corrupted row.

Solution Using `LOWER + TRIM + REGEXP` + Reject-Table Quarantine

Code solution.

-- 1) Quarantine corrupted rows
INSERT INTO silver.customer_reject
SELECT raw_id, customer_name, 'invalid_char' AS reason, NOW() AS rejected_at
FROM bronze.customers
WHERE customer_name !~ '^[A-Za-z][A-Za-z .-]+$';

-- 2) Standardise the good rows into silver
INSERT INTO silver.customers (raw_id, customer_key, full_name)
SELECT
    raw_id,
    LOWER(TRIM(customer_name))   AS customer_key,
    INITCAP(TRIM(customer_name)) AS full_name
FROM bronze.customers
WHERE customer_name ~ '^[A-Za-z][A-Za-z .-]+$'
ON CONFLICT (raw_id) DO UPDATE SET
    customer_key = EXCLUDED.customer_key,
    full_name    = EXCLUDED.full_name;

-- 3) Verify
SELECT COUNT(DISTINCT customer_key) AS unique_customers
FROM silver.customers;

Step-by-step trace of the cleanup:

raw row	regex pass?	customer_key	landed in
`Ram`	✓	`ram`	silver.customers
`RAM`	✓	`ram`	silver.customers (same key as above)
`Ram`	✓ (`TRIM` strips the space)	`ram`	silver.customers (same key)
`Ram@`	✗ (`@` not allowed)	—	silver.customer_reject

Output: silver.customers has 3 rows mapped to a single customer_key = 'ram'; the dashboard now correctly reports 1 unique customer. The corrupted row sits in silver.customer_reject for the data steward to investigate.

Why this works — concept by concept:

Quarantine before standardise — corrupted rows go to a separate table; clean rows enter silver; you never silently drop data.
LOWER(TRIM(...)) as the canonical key — collapses case + whitespace variants into one bucket.
INITCAP(TRIM(...)) for display name — produces a clean human-readable version while keeping the join key normalised.
Regex gate ^[A-Za-z][A-Za-z .-]+$ — explicit allow-list of valid name characters; everything else routes to reject.
Idempotent INSERT ... ON CONFLICT — rerunning produces the same final state; backfills are safe.
Cost — single linear scan over bronze; reject volume is observable as a metric (alert when >1% of rows reject).

Inline CTA: drill the SQL practice page for cleanup queries and the ETL practice page for staged-transform patterns.

SQL
Language — SQL
SQL practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

4. Load — destinations from warehouses to BI tools

Where clean data lands — warehouse, lake, lakehouse, or directly into a dashboard

The Load stage writes the curated data to one or more destinations: a cloud data warehouse (Snowflake, Redshift, BigQuery), a data lake (S3, GCS), a lakehouse (Iceberg / Delta on object storage), or directly into a BI tool's cache. The right destination depends on the access pattern — interactive analyst SQL favours warehouses; ML feature stores favour lakes; cross-team reuse favours a shared lakehouse.

Pro tip: Loads should be partitioned and idempotent. Partition by ingest_date so backfills touch only the affected day, and use INSERT OVERWRITE PARTITION or MERGE so reruns don't duplicate rows.

Data warehouses — Redshift, Snowflake, BigQuery

The warehouse-load invariant: cloud warehouses (Redshift, Snowflake, BigQuery) prefer bulk loads from object storage; the canonical commands are COPY INTO (Snowflake / Redshift) and LOAD DATA (BigQuery); never use single-row INSERT INTO ... VALUES for production loads — it's 10-100× slower and defeats columnar storage.

COPY INTO (Snowflake / Redshift) — bulk-load Parquet / CSV / JSON from S3 / GCS in parallel.
bq load (BigQuery) — same shape; loads from GCS with auto-detect schema.
File splitting — split source into N × num_slices files for parallel ingest.
COMPUPDATE ON — auto-pick column compression on first load (Redshift); Snowflake does this automatically.

Worked example. Daily load of 50 GB of CSV from S3 into Snowflake.

step	command	wall-clock
1. Stage	`PUT file://... @stage` (already in S3)	n/a
2. Copy	`COPY INTO orders FROM @stage/2026-05-11/`	~3 min
3. Verify	`SELECT COUNT(*) FROM orders WHERE …`	<1 s

Step-by-step.

The Transform stage has already written 40 Parquet files of ~1.25 GB each to s3://lake/silver/orders/ingest_date=2026-05-11/.
The warehouse stage points at that S3 prefix via an EXTERNAL STAGE.
COPY INTO orders FROM @stage/2026-05-11/ ingests all 40 files in parallel across the warehouse's compute slices.
The whole load finishes in 2-3 minutes — vs hours for the row-by-row INSERT approach.
STATUPDATE ON (Redshift) or auto-stats (Snowflake) refreshes the planner so subsequent queries pick the right plan.

Worked-example solution. A Snowflake COPY INTO for daily orders:

COPY INTO silver.orders
FROM @lake_stage/silver/orders/ingest_date=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET)
ON_ERROR = 'ABORT_STATEMENT'
PURGE = FALSE;

Rule of thumb: every production load uses bulk COPY; reserve single-row INSERT for tests and one-off corrections.

Data lakes — S3, GCS, ADLS with Iceberg / Delta

The lake-load invariant: a data lake load is just a write to object storage in a columnar file format (Parquet / ORC); a lakehouse load wraps that write in a table format (Iceberg / Delta) that adds ACID, time travel, and partition evolution; both patterns scale storage and compute independently.

Plain lake — write Parquet to a prefix; register with a catalog (Glue, Hive Metastore).
Lakehouse — INSERT INTO iceberg.orders or MERGE INTO delta.orders — ACID on object storage.
Partitioning — by ingest_date or event_date for prune-friendly reads.
Compaction — periodic batch job rewrites many small files into fewer large ones.

Worked example. A Spark Structured Streaming job that writes micro-batches to an Iceberg table.

time	event count in batch	files written	total bytes
09:00	1,200	1	8 MB
09:01	980	1	7 MB
09:02	1,350	1	9 MB

Step-by-step.

The Spark streaming job reads events from Kafka in 1-minute trigger windows.
Each batch writes one Parquet file (~8 MB) to the Iceberg table's S3 location.
The Iceberg metadata layer records a new snapshot per batch — ACID is preserved across concurrent writers.
After an hour, the table has 60 small files; a nightly compaction job rewrites them into a single ~500 MB file for better read performance.
Trino, Spark, and Snowflake (via Iceberg external tables) can all read the same data without copying.

Worked-example solution. A Spark write to Iceberg:

events.writeStream \
    .format("iceberg") \
    .outputMode("append") \
    .option("path", "s3://lake/lakehouse/orders") \
    .option("checkpointLocation", "s3://lake/checkpoints/orders") \
    .trigger(processingTime="1 minute") \
    .start()

Rule of thumb: if multiple engines need to read the same data (Spark + Trino + Snowflake), use a lakehouse table format; if only the warehouse reads it, a managed warehouse table is simpler.

BI tools and serving layers

The serving-load invariant: BI tools (Tableau, Power BI, Looker, Metabase) read from the warehouse / lakehouse; the load stage's job is to materialise the exact shape the dashboard expects so the BI tool runs sub-second queries.

Pre-aggregated marts — gold-layer tables shaped for one dashboard each.
Materialised views — warehouse-native auto-refresh of frequently-queried aggregates.
Caching layer — BI tools cache for 5-60 min after the load completes.
Reverse-ETL — push curated data back to operational systems (Salesforce, HubSpot).

Worked example. A gold.daily_revenue mart powering the CFO dashboard.

dashboard query	source table	rows scanned	response time
Yesterday's revenue	`gold.daily_revenue`	1	<100 ms
Last 30 days	`gold.daily_revenue`	30	<200 ms
Without the mart (raw silver)	`silver.orders`	5 M	~5 s

Step-by-step.

The CFO dashboard wants "revenue per day for the last 30 days" — a tiny output but a huge underlying scan.
Without a gold mart, the BI tool would aggregate 5 M silver rows on every refresh — ~5 s per refresh.
With a gold.daily_revenue mart (one row per day), the BI tool reads 30 rows in <200 ms.
The Load step writes one row per day to gold.daily_revenue after the silver layer finishes.
End-to-end: ETL produces the mart; the BI tool reads the mart; the CFO sees a sub-second dashboard.

Worked-example solution. The Load step that produces the daily mart:

INSERT INTO gold.daily_revenue (date_key, revenue_total, order_count)
SELECT
    order_date    AS date_key,
    SUM(amount)   AS revenue_total,
    COUNT(*)      AS order_count
FROM silver.orders
WHERE order_date = '2026-05-11'
GROUP BY order_date
ON CONFLICT (date_key) DO UPDATE SET
    revenue_total = EXCLUDED.revenue_total,
    order_count   = EXCLUDED.order_count;

Rule of thumb: every dashboard should read from a gold-layer mart, never from silver — fast dashboards keep stakeholder trust.

Common beginner mistakes

Loading row-by-row with INSERT INTO ... VALUES instead of bulk COPY INTO.
Forgetting to partition the destination table — queries scan the whole table for one day's data.
Skipping ON CONFLICT / MERGE and discovering duplicates on the second run.
Letting BI tools query silver directly — slow dashboards, no reuse.
Not refreshing planner statistics after load — wrong join plans, 10-100× slower queries.

ETL Interview Question on Choosing a Load Destination

A retail company has 50 TB of clickstream events generated daily, plus a 5 GB curated gold.fact_orders table that powers BI dashboards. For each, recommend the load destination and the load command — and justify the choice.

Solution Using Lakehouse for Clickstream + Managed Warehouse for Gold

Code solution.

CLICKSTREAM (50 TB / day)
  Destination: S3 + Iceberg (lakehouse)
  Command:    Spark Structured Streaming with .writeStream.format("iceberg")
  Reason:     too big for managed warehouse storage cost; date-partition-pruned access; ML reuse

GOLD.FACT_ORDERS (5 GB)
  Destination: Snowflake managed table
  Command:    COPY INTO gold.fact_orders FROM @stage/orders/...
  Reason:     hot, joined, sub-second dashboard reads; ACID; analyst-friendly SQL ergonomics

CROSS-LAYER JOIN
  SELECT u.region, COUNT(*) clicks, SUM(o.amount) revenue
  FROM lake.clickstream c
  JOIN gold.fact_orders o ON o.user_id = c.user_id
  WHERE c.event_date = '2026-05-11'
  GROUP BY u.region;

Step-by-step trace of the architectural decision:

step	question	answer
1	What's the data volume?	clickstream 50 TB, gold 5 GB
2	Hot or cold?	clickstream cold (date-pruned access); gold hot (every dashboard refresh)
3	Single-engine or multi-engine?	clickstream needs ML (Spark) + analyst SQL (Trino); gold only needs warehouse SQL
4	Pick destination for clickstream	S3 + Iceberg (lakehouse) — open format, ACID, multi-engine
5	Pick destination for gold	Snowflake managed — fast, ergonomic, ACID across multi-table updates
6	Cross-layer joins?	yes — Snowflake reads Iceberg via external tables

Output: the clickstream lakehouse holds 50 TB in S3 at ~$1,150/month; the gold warehouse holds 5 GB in Snowflake at ~$200/month; the BI tool reads gold in <200 ms; the ML pipeline reads clickstream features directly from S3 without copying.

Why this works — concept by concept:

Lakehouse for volume + ML reuse — 50 TB at warehouse storage cost would be ~$5K/month; on S3 it's ~$1,150 and ML can read the files directly without an export step.
Managed warehouse for hot SQL — gold is small, hot, frequently joined; warehouse storage cost is negligible at 5 GB; SQL ergonomics + sub-second response is what BI users need.
Iceberg as the open boundary — Snowflake reads Iceberg natively; no nightly copy job, no schema drift between systems.
Spark Structured Streaming for clickstream — micro-batch writes; ACID via Iceberg; replay-friendly.
COPY INTO for gold — bulk Parquet load; sub-3-min load time; auto-compression.
Cost — each layer carries its own cost characteristic; the right destination per workload is what keeps the AWS bill sane.

Inline CTA: more ETL practice problems for load patterns and the dimensional modeling practice page for star-schema design.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

SQL
Language — SQL
SQL practice problems

Practice →

5. ETL vs ELT — transform before or after loading

When to push transforms to the warehouse vs run them upstream

The ETL vs ELT distinction is the single most-asked architecture question in fresher data-engineering interviews. The mental model: ETL transforms data before loading it into the warehouse — clean Python / Spark jobs land curated rows in the destination; ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (dbt is the canonical example). Modern cloud warehouses tilt the answer toward ELT because compute is elastic and SQL is the most-debugged transform language on earth.

Pro tip: the textbook "ETL vs ELT" answer is "depends on the warehouse" — but in interviews, name the specific feature that flips the choice. ELT wins when the warehouse has elastic compute (Snowflake's separate warehouses, BigQuery's slots) and a mature SQL transform layer (dbt). ETL wins when the warehouse can't afford the raw-data storage cost or when transforms are non-SQL (image processing, ML feature engineering).

ETL — transform before loading

The ETL invariant: raw data is transformed by an upstream compute layer (Python, Spark, custom services) into curated rows that land directly in the warehouse; the warehouse holds only the cleaned data; transforms happen on dedicated compute (often cheaper than warehouse credits).

Compute layer — Spark cluster, Python on Kubernetes, AWS Glue, custom services.
Warehouse holds — only the cleaned silver / gold tables.
Storage cost — lower (no raw data in the warehouse).
Best fit — legacy warehouses without elastic compute, non-SQL transforms (ML features, image pipelines).

Worked example. A Python pipeline that cleans CSV → loads cleaned Parquet → Snowflake.

stage	runs on	data shape
Extract	Python on EC2	raw CSV (10 GB)
Transform	Python + pandas (4 CPUs)	cleaned Parquet (3 GB compressed)
Load	`COPY INTO snowflake.gold.orders`	warehouse holds 3 GB

Step-by-step.

A Python script reads the raw CSV from S3 into a pandas DataFrame.
The script applies cleaning (dropna, type coercion, business rules) — all in Python memory.
The script writes the cleaned data as Parquet back to S3.
A COPY INTO ships the Parquet into Snowflake — only the cleaned 3 GB lands in the warehouse.
The warehouse never sees the raw CSV; storage cost is bounded by the curated output.

Worked-example solution. A minimal ETL outline:

import pandas as pd
import snowflake.connector

# Extract
df = pd.read_csv("s3://lake/raw/orders.csv")

# Transform
df = df.drop_duplicates(subset=["order_id"]).dropna(subset=["amount"])
df["amount"] = df["amount"].astype(float).round(2)
df.to_parquet("s3://lake/staging/orders.parquet", index=False)

# Load — Snowflake COPY INTO from the staged Parquet
ctx = snowflake.connector.connect(...)
ctx.cursor().execute("""
    COPY INTO gold.orders
    FROM @lake_stage/staging/orders.parquet
    FILE_FORMAT = (TYPE = PARQUET);
""")

Rule of thumb: ETL is the right shape when transforms need a non-SQL runtime (Python ML features, image processing, custom validation services).

ELT — load first, transform inside the warehouse

The ELT invariant: raw data is loaded directly into the warehouse with minimal transformation; transforms run as SQL inside the warehouse (typically orchestrated by dbt); the warehouse's elastic compute and parallel SQL engine handle the transform workload at scale.

Tools — dbt (canonical SQL transformation framework), Dataform, custom SQL.
Pattern — raw.orders → silver.orders (dbt model) → gold.fact_orders (dbt model).
Storage cost — higher (warehouse holds both raw and curated data).
Best fit — modern cloud warehouses with elastic compute (Snowflake, BigQuery, Databricks SQL).

Worked example. A dbt project that builds silver and gold from raw.

layer	dbt model	runs in
raw	direct `COPY INTO raw.orders`	Snowflake
silver	`models/silver/silver_orders.sql`	Snowflake SQL via dbt
gold	`models/gold/fact_orders.sql`	Snowflake SQL via dbt

Step-by-step.

The raw CSV is loaded straight into raw.orders via COPY INTO — no Python in the middle.
A dbt model silver_orders.sql reads from raw.orders and applies dedup + type coercion as SQL.
A downstream dbt model fact_orders.sql reads from silver_orders and applies aggregation.
dbt runs the whole DAG on the warehouse's compute; transforms are SQL, version-controlled, testable.
The whole "transform" stage is a dbt run command — minutes for hundreds of models.

Worked-example solution. A dbt silver model:

-- models/silver/silver_orders.sql
{{ config(materialized='incremental', unique_key='order_id') }}

SELECT
    order_id,
    customer_id,
    amount::NUMERIC(14, 2)   AS amount,
    source_ts                 AS source_ts,
    CURRENT_TIMESTAMP         AS silver_loaded_at
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
  WHERE source_ts > (SELECT MAX(source_ts) FROM {{ this }})
{% endif %}
QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1

Rule of thumb: ELT is the modern default; reach for it whenever your warehouse has elastic compute (Snowflake, BigQuery) and your transforms can be expressed as SQL.

When ELT beats ETL — modern cloud warehouses

The cloud-warehouse invariant: elastic-compute warehouses (Snowflake's virtual warehouses, BigQuery's slot model, Redshift Concurrency Scaling) let you spin up and tear down transform compute on demand; the marginal cost of a transform job is small; SQL transforms become first-class with dbt.

Snowflake — separate virtual warehouses per workload; XS for cheap transforms, XL for hourly loads.
BigQuery — slot-based pricing; transforms run on the same pool as queries.
Redshift — Concurrency Scaling for elastic warehouse compute.
Databricks SQL — serverless SQL warehouse + Spark for non-SQL transforms.

Worked example. Same daily transform run two ways.

approach	compute	wall-clock	monthly cost
ETL (Python on EC2)	dedicated EC2 instance	30 min	$300 (always-on)
ELT (dbt on Snowflake XS warehouse)	warehouse, on-demand	8 min	$40 (pay-per-second)

Step-by-step.

The ETL approach runs a dedicated Python service on EC2 — the server is always on, even when the pipeline isn't running.
The ELT approach runs dbt run against a Snowflake XS warehouse — the warehouse spins up when the job starts and suspends when it finishes.
Daily cost: $300 (always-on EC2) vs $40 (pay-per-second warehouse).
The SQL transforms in dbt are version-controlled, tested, and reviewable in PRs — easier collaboration than Python ETL.
For most analytical workloads on modern warehouses, ELT is cheaper, faster, and easier to maintain.

Worked-example solution. A dbt project layout that captures the ELT pattern:

my_dbt_project/
├── dbt_project.yml
├── models/
│   ├── staging/
│   │   ├── stg_orders.sql        -- raw → typed
│   │   └── stg_customers.sql
│   ├── silver/
│   │   ├── silver_orders.sql     -- typed → deduped
│   │   └── silver_customers.sql
│   └── gold/
│       ├── fact_orders.sql        -- deduped → aggregated
│       └── dim_customer.sql
├── tests/
│   └── orders_amount_positive.sql -- assertion: amount > 0
└── snapshots/
    └── customer_history.sql       -- SCD2 snapshots

Rule of thumb: on Snowflake / BigQuery, ELT with dbt is the modern default for ~80% of analytical pipelines.

Common beginner mistakes

Saying "ELT is always better" — false; non-SQL transforms (ML features, image processing) still need ETL.
Loading raw data straight into gold tables — skips the silver layer's dedup + type coercion contract.
Running ELT on a legacy on-prem warehouse without elastic compute — transforms compete with analyst queries.
Treating dbt as a "SQL runner" — it's also a testing framework, a docs generator, and a lineage tool.
Not separating raw / silver / gold layers in dbt — every model becomes a tangle.

ETL Interview Question on ETL vs ELT Pattern Selection

A team is building a new analytics platform. Sources: 200 GB / day of clickstream events (Kafka), a 5 GB Postgres OLTP database, and a 1 GB nightly SaaS export. Warehouse choice: Snowflake. For each source, recommend ETL or ELT and justify.

Solution Using ELT for Most + ETL for Heavy-Compute Streaming

Code solution.

CLICKSTREAM (200 GB / day, Kafka)
  Pattern: ETL upstream + ELT downstream
  Why:     volume is too high for raw-to-warehouse; Spark streaming pre-aggregates and lands hourly batches
  Tools:   Spark Structured Streaming → Iceberg → Snowflake external table

POSTGRES OLTP (5 GB)
  Pattern: ELT (raw load + dbt transform)
  Why:     small, schema-stable, SQL transforms natural
  Tools:   Fivetran CDC → Snowflake raw schema → dbt models for silver/gold

SAAS NIGHTLY EXPORT (1 GB)
  Pattern: ELT
  Why:     tiny, low-frequency; SQL-friendly transforms
  Tools:   Airbyte → Snowflake raw schema → dbt models

Step-by-step trace of the pattern selection:

source	volume	freshness	compute	pattern
Clickstream	200 GB / day	sub-minute	Spark streaming	ETL upstream + ELT downstream
Postgres OLTP	5 GB	daily	warehouse SQL	ELT (Fivetran + dbt)
SaaS export	1 GB	daily	warehouse SQL	ELT (Airbyte + dbt)

Output: a hybrid architecture — Spark handles the high-volume streaming pre-aggregation (ETL pattern), then Snowflake + dbt handles the curated SQL transforms (ELT pattern) for everything that lands in the warehouse.

Why this works — concept by concept:

ETL for clickstream — 200 GB / day raw into the warehouse is expensive; Spark pre-aggregates to a more compact form first.
ELT for OLTP + SaaS — small data, SQL-friendly transforms, version-controlled in dbt.
Fivetran / Airbyte for raw load — managed connectors handle schema evolution and incremental sync.
dbt for transforms — SQL is reviewable, testable, and runs on Snowflake's elastic compute.
Iceberg + Snowflake external tables — clickstream stays queryable from Snowflake without copying.
Cost — Spark cluster for streaming (always-on); Snowflake compute pay-per-second for transforms; total bill stays bounded.

Inline CTA: for the structured warehouse-and-transform path see ETL System Design for Data Engineering Interviews and the related data lake architecture for data engineering interviews blog.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for Data Engineering Interviews

View course →

SQL
Language — SQL
SQL practice problems

Practice →

6. ETL orchestration tools — Airflow, dbt, Spark, AWS Glue

The tools that turn an ETL script into a production pipeline

A single Python script that runs once is not a pipeline — it's a one-off job. Real production ETL needs scheduling, dependency management, retries, alerts, lineage, and observability, and the modern stack is built around a handful of tools that each solve one slice of that problem: Apache Airflow for orchestration / DAGs, dbt for SQL transforms / tests / docs, Apache Spark for distributed batch + streaming, AWS Glue / Azure Data Factory / Dataflow for managed serverless ETL, and Fivetran / Airbyte for managed source connectors.

Pro tip: the modern ETL stack is layered, not monolithic — Airflow orchestrates dbt, dbt invokes Spark, Spark reads from S3, and Fivetran handles the source connectors. Knowing which tool owns which slice (and why) is the senior signal in any ETL design round.

Apache Airflow — DAG orchestration and scheduling

The Airflow invariant: Airflow defines pipelines as Python DAGs (Directed Acyclic Graphs); each DAG is a graph of tasks with explicit dependencies; the scheduler triggers DAGs on cron / sensor / event; failures retry per task with exponential backoff. It's the canonical orchestrator for batch ETL.

DAG — Python file that declares tasks + dependencies + schedule.
Operators — BashOperator, PythonOperator, SnowflakeOperator, DbtRunOperator, ...
Sensors — wait for an external event (S3 file arrival, table refresh, API webhook).
XCom — pass small values between tasks; for large data, use object storage as the boundary.

Worked example. A 4-task daily DAG: extract → transform → reconcile → notify.

task	runs	depends on
`extract_postgres`	02:00 daily	—
`transform_dbt`	after extract	`extract_postgres`
`reconcile_counts`	after transform	`transform_dbt`
`notify_slack`	after reconcile	`reconcile_counts`

Step-by-step.

The Airflow scheduler triggers the DAG at 02:00.
extract_postgres runs a BashOperator invoking a Python script — extracts yesterday's rows to S3.
Once it succeeds, transform_dbt runs dbt run against the warehouse.
reconcile_counts queries silver vs source for drift > 0.1%; if it fails, the DAG halts and pages on-call.
On success, notify_slack posts a "✓ pipeline complete" message; the BI dashboard refreshes.

Worked-example solution. A minimal Airflow DAG:

from datetime import datetime
from airflow import DAG
from airflow.operators.bash import BashOperator

with DAG(
    "daily_orders_etl",
    schedule_interval="0 2 * * *",     # 02:00 UTC daily
    start_date=datetime(2026, 5, 1),
    catchup=False,
    tags=["orders"],
) as dag:

    extract  = BashOperator(task_id="extract_postgres",
                            bash_command="python /opt/scripts/extract.py")
    transform = BashOperator(task_id="transform_dbt",
                            bash_command="cd /opt/dbt && dbt run --target prod")
    reconcile = BashOperator(task_id="reconcile_counts",
                            bash_command="python /opt/scripts/reconcile.py")
    notify    = BashOperator(task_id="notify_slack",
                            bash_command="python /opt/scripts/notify.py 'Pipeline OK'")

    extract >> transform >> reconcile >> notify

Rule of thumb: Airflow is the right choice for batch-scheduled pipelines with explicit dependencies. For event-driven streaming pipelines, look at Prefect, Dagster, or native cloud orchestrators.

dbt and Apache Spark — SQL transforms and distributed processing

The dbt invariant: dbt is a SQL transformation framework that runs models (SELECT statements) against your warehouse, with built-in tests, lineage graphs, and documentation. The Spark invariant: Spark is a distributed compute engine that runs Python / Scala / SQL transforms across a cluster of machines, scaling from gigabytes to petabytes.

dbt — models/, tests/, snapshots/, seeds/; runs on Snowflake / BigQuery / Redshift / Databricks SQL.
Spark — pyspark.sql, structured streaming, MLlib; reads from S3, Iceberg, Delta, Hive.
When dbt wins — SQL transforms on a managed warehouse; small-to-medium-scale analytics.
When Spark wins — non-SQL transforms (Python ML, image / log processing), petabyte-scale data.

Worked example. A dbt model + a Spark job for the same logical transform.

approach	code	runs on	best for
dbt	`models/silver_orders.sql`	Snowflake	SQL-first analytics
Spark	`pyspark` script	EMR / Databricks cluster	petabyte-scale or custom Python

Step-by-step.

For a 1 GB dedup transform, dbt + Snowflake is simpler — write SQL, push to git, run dbt run.
For a 1 TB transform with custom Python UDFs (ML features), Spark on EMR is the better tool.
dbt is declarative — write the desired output as SELECT, let dbt handle the materialisation strategy.
Spark is imperative — write the transform as code, hand-tune partitioning and caching.
The two tools coexist: dbt for SQL warehouse layers, Spark for upstream lake processing.

Worked-example solution. A Spark transform that dedups orders:

from pyspark.sql import functions as F, Window

orders = spark.read.parquet("s3://lake/bronze/orders/ingest_date=2026-05-11/")

w = Window.partitionBy("order_id").orderBy(F.col("source_ts").desc())

deduped = (
    orders.withColumn("rn", F.row_number().over(w))
          .filter("rn = 1")
          .drop("rn")
)

deduped.write.mode("overwrite").parquet(
    "s3://lake/silver/orders/ingest_date=2026-05-11/"
)

Rule of thumb: use dbt for SQL transforms on the warehouse; reach for Spark when transforms are non-SQL or when data outgrows warehouse compute budgets.

Managed services — AWS Glue, Fivetran, Airbyte

The managed-service invariant: managed ETL services (AWS Glue, Azure Data Factory, GCP Dataflow, Fivetran, Airbyte) trade flexibility for operational simplicity — they handle the infrastructure, retries, scaling, and connector maintenance so the team writes business logic, not boilerplate.

AWS Glue — managed serverless Spark; auto-scales; AWS-native pricing.
Fivetran — managed connectors from 300+ SaaS sources to warehouses; usage-based pricing.
Airbyte — open-source alternative to Fivetran; self-hosted or managed.
Dataflow / ADF — GCP / Azure equivalents.

Worked example. Pulling Stripe charges via Fivetran vs writing a custom script.

approach	time to deploy	maintenance	cost
Custom Python	~2 weeks	high (rate-limit + schema-drift bugs)	engineer time
Fivetran connector	~2 hours	low (managed)	~$200/month for 100K MAR

Step-by-step.

The custom approach: write a Python script, handle pagination, build retries, deploy to Airflow, write tests, monitor schema drift, fix bugs for years.
Fivetran approach: click "Connect Stripe", paste API key, choose destination warehouse, click "Sync". Done in 2 hours.
Custom code wins when the source is custom or volume is huge; Fivetran wins for the 80% case (standard SaaS sources).
The choice depends on team size and operational maturity — a 3-person data team is better off paying Fivetran than burning engineer hours.
Hybrid is common: Fivetran for standard sources, custom Python / Spark for unique ones.

Worked-example solution. AWS Glue ETL skeleton (Spark-based, managed):

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
glueContext = GlueContext(SparkContext.getOrCreate())

orders = glueContext.create_dynamic_frame.from_catalog(
    database="bronze_db", table_name="orders"
)
cleaned = DropNullFields.apply(frame=orders)

glueContext.write_dynamic_frame.from_options(
    frame=cleaned,
    connection_type="s3",
    connection_options={"path": "s3://lake/silver/orders/"},
    format="parquet",
)

Rule of thumb: managed services are the right starting point — only go custom when a specific source or transform genuinely needs it.

Common beginner mistakes

Building a custom orchestrator instead of using Airflow (or Prefect / Dagster) — wasted years.
Running dbt locally only — needs a deployment story (Airflow, dbt Cloud, GitHub Actions).
Spinning up a Spark cluster for a 1 GB transform that dbt + Snowflake would do faster.
Writing custom Stripe / Salesforce / HubSpot connectors when Fivetran / Airbyte handle them.
Skipping monitoring / lineage — the first incident is when you learn what observability you needed.

ETL Interview Question on Choosing an Orchestration Stack

A 4-person data team is bootstrapping a new analytics platform on Snowflake. Sources: 10 SaaS tools (Stripe, Salesforce, HubSpot, …), a Postgres OLTP database, and 50 GB / day of clickstream from Kafka. Pick the orchestration / ingestion / transform tools and explain the choices.

Solution Using Fivetran + Airflow + dbt + Spark Streaming

Code solution.

INGESTION
  ├── SaaS sources (Stripe, Salesforce, HubSpot, ...) → Fivetran → Snowflake raw.* schemas
  ├── Postgres OLTP → Fivetran CDC → Snowflake raw.postgres_*
  └── Kafka clickstream → Spark Structured Streaming → Iceberg → Snowflake external table

ORCHESTRATION
  └── Airflow (managed via MWAA or Astronomer)
      - Daily DAG: trigger dbt run after Fivetran loads
      - Sensor: wait for S3 clickstream Parquet arrival, then trigger dbt clickstream model

TRANSFORM
  └── dbt (Cloud or self-hosted)
      - models/staging/   ← raw → typed
      - models/silver/    ← typed → deduped + conformed
      - models/gold/      ← curated marts for BI

OBSERVABILITY
  ├── Airflow alerts → Slack on DAG failure
  ├── dbt tests → fail loudly on schema / quality assertions
  └── Snowflake account_usage → cost & query monitoring dashboard

Step-by-step trace of the architectural decisions:

concern	answer	reason
SaaS ingestion	Fivetran	10 sources × 2 weeks of custom code per source = 20 engineer-weeks; Fivetran handles it for ~$1K/month
OLTP ingestion	Fivetran CDC	zero impact on source; sub-minute freshness
Clickstream ingestion	Spark Structured Streaming + Iceberg	50 GB / day too big for Fivetran's pricing tier
Orchestration	Airflow (MWAA)	mature DAG semantics; team can hire engineers who know it
Transform	dbt	SQL-first; version-controlled; testable; runs on Snowflake's elastic compute
Storage	Snowflake + Iceberg lakehouse	warehouse for curated; lakehouse for high-volume clickstream

Output: an end-to-end stack that the 4-person team can build in ~2 months. Fivetran handles 80% of ingestion; Spark covers the streaming edge case; Airflow + dbt provide a clean orchestration + transform layer; Snowflake is the warehouse. Total operational burden is low; engineering time goes into business logic, not boilerplate.

Why this works — concept by concept:

Fivetran for standard SaaS / OLTP — buys 10 connectors for the price of one engineer; bug-fixes are someone else's problem.
Spark for streaming clickstream — 50 GB / day is outside Fivetran's sweet spot; Spark + Iceberg handles it cleanly.
Airflow as the orchestrator — mature, hireable, integrates with everything.
dbt for SQL transforms — version control, tests, lineage all built in; reviewable in PRs.
Snowflake + Iceberg — warehouse for hot curated; lakehouse for high-volume cold; cross-layer joins via external tables.
Cost — Fivetran ~$1K/month, Snowflake ~$3-5K/month, Airflow (MWAA) ~$300/month, Spark cluster ~$1K/month → total ~$5-7K/month for a 4-person team's full stack.

Inline CTA: for the structured pipeline-design path see ETL System Design for Data Engineering Interviews and PySpark Fundamentals.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — PySpark Fundamentals
PySpark Fundamentals

View course →

PYTHON
Language — Python
Python practice problems

Practice →

7. Building a Python pandas ETL pipeline

A runnable end-to-end example you can adapt

To make all of the above concrete, here's a runnable Python pandas ETL pipeline that pulls a CSV of orders, cleans + deduplicates the data, and writes a curated Parquet ready for warehouse load. It's short enough to read in five minutes and structured enough to extend into a production pipeline.

Pro tip: pandas is great for the first 1-10 GB of data; beyond that, you want pyspark, polars, or duckdb. The shape of the pipeline is the same; only the engine changes.

Setup — installing pandas and pyarrow

The setup invariant: pandas reads / writes CSV, JSON, and Parquet natively; pyarrow is the Parquet engine pandas calls under the hood; install both with pip install pandas pyarrow. Everything else (S3 access, database connectors) is optional.

pandas — DataFrame manipulation; the backbone of single-node Python ETL.
pyarrow — Parquet read / write engine; columnar format for warehouse-friendly Parquet.
boto3 — AWS SDK for S3 reads / writes.
sqlalchemy / psycopg2 — database connectors for Postgres extraction.

Worked example. Install the minimal set:

package	role
`pandas`	DataFrame ops, CSV / Parquet IO
`pyarrow`	Parquet engine
`boto3`	S3 access (optional for local files)
`requests`	HTTP / API extraction (optional)

Step-by-step.

Create a virtual environment: python3 -m venv .venv && source .venv/bin/activate.
Install dependencies: pip install pandas pyarrow boto3 requests.
Verify: python -c "import pandas; print(pandas.__version__)" prints a 2.x version.
Place a sample orders.csv in the current directory (or use the snippet's path).
Run the script: python etl.py produces cleaned_orders.parquet.

Worked-example solution. A one-line install:

pip install pandas pyarrow boto3 requests

Rule of thumb: pin versions in requirements.txt so reruns produce identical environments.

Extract + Transform + Load in 30 lines of Python

The pipeline invariant: extract reads from a source into a DataFrame; transform applies cleaning + dedup + typing; load writes to the destination as Parquet. The whole script fits in one file and is testable end-to-end.

Extract — pd.read_csv / pd.read_sql / pd.read_json.
Transform — .drop_duplicates(), .dropna(), .astype(), business rules with np.where.
Load — .to_parquet() for warehouse-friendly columnar output.
Idempotency — overwrite the destination; never append blindly.

Worked example. Cleaning an orders.csv with 12,847 rows down to 12,835 unique rows.

metric	before	after
rows	12,847	12,835 (12 duplicates dropped)
null amounts	38	0 (replaced with 0.0)
`amount` dtype	object	float64
`order_date` dtype	object	datetime64

Step-by-step.

pd.read_csv("orders.csv") reads the raw file into a DataFrame.
.drop_duplicates(subset=["order_id"], keep="last") collapses 12,847 rows to 12,835 unique orders.
.fillna({"amount": 0}) replaces 38 null amounts with 0.0 per the business rule.
.astype({"amount": "float64", "order_date": "datetime64[ns]"}) coerces types.
.to_parquet("cleaned_orders.parquet", index=False) writes the curated output; downstream COPY INTO reads it.

Worked-example solution. A complete ETL script:

import pandas as pd
from pathlib import Path

RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")

# ── Extract ─────────────────────────────────────────────────────────────
df = pd.read_csv(RAW)

# ── Transform ───────────────────────────────────────────────────────────
df = (
    df.drop_duplicates(subset=["order_id"], keep="last")
      .fillna({"amount": 0.0, "status": "unknown"})
      .astype({"amount": "float64"})
)
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df = df.dropna(subset=["order_date"])

# Business rule
df["is_high_value"] = df["amount"] > 10_000

# ── Load ────────────────────────────────────────────────────────────────
df.to_parquet(OUT, index=False, engine="pyarrow")

print(f"Loaded {len(df)} rows → {OUT}")

Rule of thumb: keep extract, transform, and load as separate functions so you can unit-test each independently.

Scaling out — when to move beyond pandas

The scale-out invariant: pandas is single-threaded and in-memory; beyond ~10 GB it slows; the upgrade path is polars (single-machine, multi-core, Arrow-backed) or pyspark (distributed across many machines).

polars — drop-in pandas alternative; 5-10× faster on a single machine; lazy evaluation.
pyspark — distributed DataFrame API; same shape as pandas but scales to terabytes.
duckdb — embedded analytical SQL engine; great for ~100 GB datasets on a laptop.
dbt + warehouse — when transforms can be expressed as SQL, push them into the warehouse.

Worked example. Same dedup transform, three engines.

engine	time on 10 GB	RAM needed
`pandas`	~3 min	~40 GB
`polars`	~30 s	~12 GB
`pyspark` (10-node)	~10 s	distributed

Step-by-step.

For 1 GB, pandas is fine — 10-30 seconds on a laptop.
At 10 GB, pandas starts hitting memory limits on a 16 GB laptop; polars reduces RAM by 4× and runs 5× faster.
At 100 GB, you want duckdb (single-machine columnar) or pyspark (distributed).
At 1 TB+, pyspark on a real cluster is the only practical option.
The transform shape (dedup + cast + write) doesn't change — only the engine does.

Worked-example solution. The same dedup in polars:

import polars as pl

df = (
    pl.read_csv("orders.csv")
      .unique(subset=["order_id"], keep="last")
      .with_columns(
          pl.col("amount").fill_null(0.0).cast(pl.Float64),
          pl.col("order_date").str.strptime(pl.Date, "%Y-%m-%d", strict=False),
      )
)

df.write_parquet("cleaned_orders.parquet")

Rule of thumb: start with pandas; switch to polars / pyspark only when data outgrows the laptop.

Common beginner mistakes

Reading a 50 GB CSV with pd.read_csv and watching the laptop OOM.
Appending to the output file instead of overwriting — silently growing duplicates.
Skipping dtype coercion — amount stays object and aggregations break.
Forgetting errors="coerce" on pd.to_datetime — one bad date kills the whole load.
Not separating extract / transform / load into functions — untestable script.

ETL Interview Question on Building a First Pandas Pipeline

You're asked in a live coding round to write a Python ETL pipeline that reads orders.csv (~1 GB), removes duplicate order_id rows keeping the latest by source_ts, replaces null amount with 0, and writes a Parquet file. Write the runnable script.

Solution Using pandas + pyarrow

Code solution.

import pandas as pd
from pathlib import Path

RAW = Path("orders.csv")
OUT = Path("cleaned_orders.parquet")

# 1) Extract
df = pd.read_csv(RAW, parse_dates=["source_ts"])

# 2) Transform
# Sort by source_ts ascending so .drop_duplicates(keep="last")
# keeps the most-recent row per order_id.
df = (
    df.sort_values(["order_id", "source_ts"])
      .drop_duplicates(subset=["order_id"], keep="last")
      .assign(amount=lambda d: d["amount"].fillna(0).astype("float64"))
)

# 3) Load
df.to_parquet(OUT, index=False, engine="pyarrow")

print(f"Wrote {len(df):,} rows → {OUT}")

Step-by-step trace for a 12,847-row input:

stage	rows	notes
Read CSV	12,847	raw input
Sort by `(order_id, source_ts)`	12,847	sorted in-place
`drop_duplicates(keep="last")`	12,835	12 duplicates dropped, latest kept
`fillna(amount=0)`	12,835	38 nulls replaced
`astype(float64)`	12,835	type coerced
Write Parquet	12,835	columnar output, ~80 MB

Output: cleaned_orders.parquet — 12,835 unique orders with non-null amount, ready for COPY INTO into the warehouse. Total wall-clock: ~10 seconds on a laptop for 1 GB of input.

Why this works — concept by concept:

parse_dates=["source_ts"] — pandas reads timestamps directly into datetime64, skipping a later to_datetime call.
Sort before drop_duplicates(keep="last") — guarantees the latest source_ts per order_id survives; the alternative (sorting after) keeps an arbitrary row.
assign(amount=lambda d: d["amount"].fillna(0).astype("float64")) — chain-friendly transform that returns a new DataFrame, easier to test than mutating in place.
Parquet output — columnar, compressed, warehouse-friendly; ~10× smaller than CSV for the same data.
pyarrow engine — the modern Parquet backend; faster and more compatible than fastparquet.
Cost — O(N log N) for the sort, O(N) for the rest; trivially scales to 10 GB on a 16 GB laptop.

Inline CTA: more Python practice problems for pandas-style ETL and the ETL practice page for end-to-end pipeline shapes.

PYTHON
Language — Python
Python practice problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

COURSE
Course — PySpark Fundamentals
PySpark Fundamentals

View course →

Choosing your ETL stack (checklist)

Pick the right tool for the workload, not the other way around:

For SaaS sources (Stripe, Salesforce, HubSpot) → managed connectors (Fivetran, Airbyte).
For Postgres / MySQL OLTP → CDC (Debezium + Kafka, or Fivetran CDC).
For high-volume clickstream / logs → Spark Structured Streaming + Iceberg / Delta.
For SQL transforms on the warehouse → dbt + Snowflake / BigQuery / Redshift.
For non-SQL transforms (ML features, image / NLP processing) → Spark or custom Python.
For batch orchestration → Apache Airflow (or Prefect / Dagster for newer projects).
For BI / serving → gold-layer marts in the warehouse; never let dashboards query silver.

Frequently asked questions

What is an ETL pipeline?

An ETL pipeline is an automated workflow that Extracts raw data from multiple sources, Transforms it into a clean and structured format, and Loads it into a destination system (data warehouse, data lake, or BI tool) for reporting and analysis. The three letters describe the stages; in practice the pipeline also handles scheduling, retries, observability, and idempotency.

What is the difference between ETL and ELT?

ETL transforms data before loading it into the warehouse (the curated, traditional pattern). ELT loads raw data into the warehouse first, then transforms it using the warehouse's own SQL engine (the modern cloud-warehouse pattern, typically with dbt). Modern Snowflake / BigQuery / Redshift workloads tilt heavily toward ELT because the warehouse compute is elastic and SQL is the most-debugged transform language.

Which tool is best for ETL orchestration?

Apache Airflow is the most-deployed batch orchestrator and the safest hireable choice. For event-driven streaming pipelines, Prefect and Dagster are modern alternatives with better Python ergonomics. For SQL-only pipelines on cloud warehouses, dbt with its built-in scheduler (dbt Cloud) handles most cases without needing a separate orchestrator.

Can I build an ETL pipeline in Python?

Yes — pandas + pyarrow for single-machine ETL, pyspark for distributed, polars for fast single-machine. The pipeline shape (extract → transform → load) is the same regardless of the engine. Real production pipelines wrap the Python script in Airflow / Prefect for scheduling, retries, and observability.

What's the most common ETL pipeline failure mode?

Schema drift — the source system silently renames a column, changes a date format, or splits a field, and the pipeline either crashes or writes wrong data. Defend with strict schema assertions at ingest, explicit alerts on drift, and a published contract with the source team. The runner-up is non-idempotent loads — reruns produce duplicates and silent data corruption.

Practice on PipeCode

Reading is one thing; reps are another. To turn the ETL primitives in this guide into reliable interview answers, pair the reading with practice on real problems and structured courses.

Drill SQL transformations — the workhorse of ELT — at the SQL practice page.
Practice ETL pipeline design with company-flavoured problems at the ETL practice page.
Build pandas / Python ETL fluency at the Python practice page.
Take the structured path with ETL System Design for Data Engineering Interviews and PySpark Fundamentals.
Read related guides — the data lake architecture for data engineering interviews blog covers the medallion zones every ETL pipeline writes into, and the SQL interview questions for data engineering blog drills the SQL primitives every Transform step uses.

The ETL pipelines you'll be asked to design in interviews use exactly the primitives in this guide — extract from messy sources, transform with idempotent SQL or Python, load into a warehouse the BI team trusts, and orchestrate the whole thing with Airflow + dbt. Practice them in the practice surface and the design rounds become reps, not surprises.

Snowflake for Data Engineering Interviews: A Beginner's Guide to the Cloud Data Warehouse

Gowtham Potureddi — Tue, 12 May 2026 04:34:10 +0000

Snowflake is the cloud-native data warehouse most modern data teams stand on — it stores petabytes, runs analytics in seconds, scales compute and storage independently, and ships features (Time Travel, zero-copy cloning, secure data sharing) that legacy MPP warehouses cannot match. For freshers preparing for data-engineering interviews, Snowflake is a high-leverage skill: the architecture is fundamentally different from Postgres or MySQL, and the same two or three concepts show up in every interview loop.

Think of this as a beginner-friendly Snowflake tutorial for data engineers — a first-principles walk through the Snowflake data warehouse from three-layer architecture to performance tuning. We start with "what is Snowflake database" in plain English, cover the killer "separation of compute and storage" idea, virtual warehouses, COPY INTO for loading, Time Travel and cloning for recovery / dev, micro-partitions and query pruning for performance, and how Snowflake compares to Redshift, BigQuery, Databricks, and Azure Synapse. Every section ships worked examples and a Snowflake interview questions-style problem with a full solution tail, in the same shape PipeCode practice problems use.

On this page

Why Snowflake matters
The three-layer architecture
Separation of compute and storage
Loading and querying data
Time Travel and zero-copy cloning
Performance optimization
Snowflake vs Redshift vs BigQuery
Choosing Snowflake (checklist)
Frequently asked questions
Practice on PipeCode

1. Why Snowflake matters

What is Snowflake database — cloud-native data warehousing for analytics at scale

So, what is Snowflake database in one sentence? Snowflake is a cloud-based Snowflake data warehouse platform used to store, process, and analyze enormous amounts of data — orders of magnitude beyond what a single Postgres or MySQL instance can handle. Companies use it for data warehousing, analytics, BI dashboards, Snowflake data sharing, ETL/ELT pipelines, and ML feature storage, and it runs as a managed service on AWS, GCP, and Azure — the same Snowflake SQL surface, same UI, same features regardless of which cloud you pick.

Pro tip: When an interviewer asks "why Snowflake?", lead with the workload, not the brand. Snowflake exists because OLTP databases (Postgres, MySQL) become slow once a single table crosses a few hundred million rows under heavy analytical reads. Snowflake separates the analytical workload from the transactional one and scales each part independently.

Data warehouse vs OLTP database — different shapes for different jobs

The warehouse invariant: OLTP databases (Postgres, MySQL) are optimised for high-frequency single-row reads and writes; data warehouses (Snowflake, Redshift, BigQuery) are optimised for low-frequency, very wide scans over billions of rows; using one for the other workload produces a system that is slow on both axes. The line between them is mostly about row-store vs columnar storage and transactional vs analytical query patterns.

OLTP — row-store: Postgres / MySQL store rows contiguously; reading one row of 30 columns is one disk seek.
OLAP — columnar: Snowflake / Redshift / BigQuery store columns contiguously; reading one column of 100 M rows is one sequential scan.
Transactions: OLTP holds row-level locks for ACID writes; warehouses commit in batches.
Concurrency model: warehouses scale by spinning up parallel compute clusters; OLTP scales vertically.

Worked example. Same 100 M-row orders table, two workloads:

query	Postgres (OLTP)	Snowflake (warehouse)
`INSERT INTO orders … VALUES (…)`	~1 ms	~500 ms (batched)
`SELECT category, SUM(amount) FROM orders GROUP BY 1`	60 s scan	2 s columnar scan
`UPDATE orders SET status='shipped' WHERE order_id = 42`	~1 ms	~500 ms
50 concurrent analyst dashboards	dies under load	each on its own warehouse

Step-by-step.

Postgres is great for the transactional inserts — single-row writes complete in milliseconds.
Once an analyst runs GROUP BY category over 100 M rows, Postgres scans every row and blocks the OLTP workload.
Snowflake stores amount and category as separate compressed column files; the same GROUP BY reads ~5% of the bytes Postgres reads.
Snowflake also supports many parallel "virtual warehouses" so the 50 dashboards do not contend with the daily ETL.
The right move is to keep transactional work in Postgres and ELT the data into Snowflake for analytics.

Worked-example solution. A typical split:

Application writes        Daily ELT                Analytics
─────────────────────     ────────────────────     ─────────────────
Postgres                  source → Snowflake       Snowflake
(orders, users,             every hour or            BI dashboards,
 payments)                  every minute             ML features,
                                                     ad-hoc SQL

Rule of thumb: if a query joins many tables, scans many rows, and runs on a schedule for humans to read, it belongs in a warehouse — not in your OLTP database.

Multi-cloud as a feature, not a buzzword

The multi-cloud invariant: Snowflake runs the same control plane and SQL surface on AWS, GCP, and Azure; an account is bound to one cloud and one region, but secure data sharing crosses cloud boundaries and replication is built in. You pick the cloud that matches the rest of your stack; you do not get locked into the warehouse vendor's preferred cloud.

Account region — one cloud (AWS / GCP / Azure) and one region (e.g. us-east-1).
Cross-region replication — built-in; for HA and analytics close to consumers.
Cross-cloud data sharing — shared databases work even when provider and consumer live on different clouds.
Same SQL surface — CREATE WAREHOUSE, COPY INTO, Time Travel work identically across clouds.

Worked example. A SaaS company runs ingestion on GCP and BI on AWS:

component	cloud	reason
product backend	GCP	existing team
ingestion → Snowflake	GCP-region Snowflake	same-cloud latency
BI dashboards	AWS-region Snowflake (read replica)	analyst tools on AWS

Step-by-step.

Ingestion writes raw events to a GCP Snowflake account; same-cloud egress is free / minimal.
A Snowflake replication policy mirrors the curated schema to an AWS Snowflake account every 15 minutes.
Analyst tools (Tableau, Looker, Mode) all live on AWS; queries hit the AWS account with low latency.
Disaster recovery comes for free — either account can survive a single-cloud incident.
The application teams pick whichever cloud suits them; the warehouse is a no-fight.

Worked-example solution. Cross-cloud replication setup (concept):

-- on the primary (GCP) account
CREATE REPLICATION GROUP analytics_repl
    OBJECT_TYPES = (DATABASES)
    ALLOWED_DATABASES = ('PROD_DW')
    ALLOWED_ACCOUNTS = ('aws_account_locator');

-- on the secondary (AWS) account
CREATE DATABASE PROD_DW
    AS REPLICA OF gcp_account_locator.PROD_DW;
ALTER DATABASE PROD_DW REFRESH;

Rule of thumb: let the team that writes the data pick the cloud and let everyone else attach via shared databases or replicated copies.

Real-world use cases — where Snowflake earns its keep

The use-case invariant: Snowflake is the right tool when the workload is analytical, the data volume is large, and the user count is concurrent; it is the wrong tool for low-latency single-row reads, sub-second OLTP transactions, or kilobyte-scale lookup tables. Recognising the workload is half the interview answer.

BI dashboards — Looker, Tableau, Mode, Power BI all read from Snowflake natively.
Customer analytics — clickstream, retention cohorts, funnel analysis.
ML feature stores — typed, time-partitioned features served to training and online inference.
Financial reporting — NUMERIC(38,6) precision, ACID transactions, audit history via Time Travel.
Secure data sharing — sell anonymised datasets to partners without ETL or file transfer.

Worked example. An e-commerce company's Snowflake schema:

table	grain	source	consumer
`fact_orders`	one row per order line	Postgres CDC	BI, finance, ML
`fact_clicks`	one row per page view	Kafka → Kinesis	marketing
`dim_customer`	one row per customer (SCD2)	Postgres CDC	every fact
`dim_product`	one row per product	Postgres CDC	every fact

Step-by-step.

Orders, clicks, payments live transactionally in Postgres; events stream through Kafka.
A CDC pipeline (Fivetran, Airbyte, custom Debezium) lands raw rows into Snowflake every few minutes.
dbt models build star-schema fact / dimension tables from the raw layer.
BI tools query the gold layer via SELECT … FROM dim_customer JOIN fact_orders ….
Same fact_orders table feeds the daily revenue dashboard, the monthly investor report, and the ML feature pipeline — no copies, no drift.

Worked-example solution. A minimal fact_orders schema:

CREATE TABLE fact_orders (
    order_id    NUMBER(38,0) PRIMARY KEY,
    customer_id NUMBER(38,0) NOT NULL,
    product_id  NUMBER(38,0) NOT NULL,
    order_date  DATE         NOT NULL,
    amount      NUMBER(14,2) NOT NULL
)
CLUSTER BY (order_date);

Rule of thumb: "is this a dashboard, an ML feature, or a recurring report?" → Snowflake. "Is this a real-time write?" → Postgres / DynamoDB / Cassandra.

Common beginner mistakes

Treating Snowflake as a faster Postgres — running single-row INSERTs in a loop is slow because every commit is batched on object storage.
Picking Snowflake when the dataset fits on one machine — a daily 1 GB CSV does not need a cloud warehouse; SQLite or DuckDB are cheaper and faster.
Forgetting to suspend warehouses — every minute a warehouse runs is billed; idle warehouses are real money.
Storing OLTP-shaped row-by-row data — Snowflake compresses columns; wide schemas with few rows are an anti-pattern.
Skipping the architecture layer in interviews — "Snowflake is fast" is not an answer; "it separates compute and storage" is.

Snowflake Interview Question on Picking a Warehouse vs Database

A team is debating whether to put a 100 M-row monthly aggregate report on top of their OLTP Postgres database or load it into Snowflake first. The Postgres database also serves the live shopping cart. Lay out the decision criteria and propose an architecture that keeps both the cart and the report performant.

Solution Using Postgres for OLTP + Snowflake for OLAP via Daily ELT

Code solution.

-- Postgres holds the transactional truth
CREATE TABLE postgres.public.orders (
    order_id    BIGSERIAL    PRIMARY KEY,
    customer_id BIGINT       NOT NULL,
    placed_at   TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    amount      NUMERIC(14,2)
);

-- A daily ELT job lands the same rows in Snowflake
COPY INTO snowflake.dw.fact_orders
FROM @postgres_stage/orders/dt=2026-05-11/
FILE_FORMAT = (TYPE = PARQUET);

-- The monthly report runs entirely in Snowflake
SELECT DATE_TRUNC('month', placed_at) AS month,
       SUM(amount)                    AS revenue
FROM snowflake.dw.fact_orders
GROUP BY 1
ORDER BY 1;

Step-by-step trace.

step	actor	action	outcome
1	shopping cart	inserts new order into Postgres	row committed in ms
2	nightly ELT	exports yesterday's orders to Parquet on S3	one staged file per day
3	nightly ELT	runs `COPY INTO` Snowflake	rows added to `fact_orders`
4	analyst	runs monthly report on Snowflake	2 s columnar scan
5	Postgres	only sees OLTP load	cart stays fast

Output: the cart latency stays under 100 ms because Postgres only runs OLTP work; the monthly report returns in seconds because Snowflake scans the partitioned, compressed columnar copy; the two systems never block each other.

Why this works — concept by concept:

Postgres for OLTP — row-store + indexes + ACID transactions; perfect for single-order writes.
Snowflake for OLAP — columnar + massively parallel; perfect for full-table aggregations.
Daily ELT — moves the analytical workload to the analytical engine; freshness is "yesterday's data" which is fine for monthly reports.
COPY INTO — Snowflake's bulk loader; parallelises file ingestion across compute nodes.
One source of truth — Postgres remains the system of record; Snowflake is a derived copy that can be rebuilt at any time.
Cost — Postgres reads stay O(1) per cart op; Snowflake aggregation is O(N) on the columnar copy but runs in parallel and never touches Postgres.

Inline CTA: drill the ETL practice page for ingestion patterns and the SQL practice page for analytical SQL fluency.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

2. The three-layer architecture

Storage, compute (virtual warehouses), and cloud services — decoupled by design

Snowflake's killer architectural decision is three independent layers that scale and pay for themselves separately: a storage layer that holds your data on cloud object storage forever, a compute layer of "virtual warehouses" that run queries in isolated clusters, and a cloud services layer that handles authentication, optimisation, metadata, and security. Legacy MPP warehouses (and older Redshift) couple compute and storage into a single cluster; Snowflake's split is what makes everything else (Time Travel, cloning, multi-tenant compute) possible.

Pro tip: When an interviewer asks "explain Snowflake's architecture," name the three layers in the order storage → compute → cloud services and immediately add "and the key idea is that compute and storage scale independently." That single sentence covers 70% of the architecture answer; the rest is detail.

Database storage layer — compressed columnar files on object storage

The storage invariant: Snowflake stores every table as a set of compressed, columnar **micro-partitions (50–500 MB each) on cloud object storage (S3 / GCS / ADLS); the database engine manages compression, encryption, metadata, and file organisation automatically — you write SQL, Snowflake handles the rest**. There is no VACUUM, no manual partitioning, no index maintenance.

Micro-partitions — 50–500 MB compressed columnar files; automatically sized.
Columnar format — every column stored separately; analytical scans read only needed columns.
Automatic compression — Snowflake picks the codec per column based on data distribution.
Immutable files — updates write new files; old files retained for Time Travel.
Per-column statistics — min/max/distinct count per micro-partition; powers query pruning.

Worked example. A 100 M-row orders table laid out internally:

component	what Snowflake stores
`orders.order_id`	~5,000 micro-partitions, sorted by order_id, RLE-compressed
`orders.amount`	same partitions, ZSTD-compressed
`orders.placed_at`	same partitions, dictionary-encoded dates
metadata	per-partition min/max/distinct for every column

Step-by-step.

You write INSERT INTO orders SELECT … FROM staging; Snowflake doesn't write rows — it writes columnar files.
Each batch produces a handful of new micro-partitions (typical batch ≈ 16 MB compressed).
The cloud-services layer records per-column min/max in metadata for every new partition.
A later WHERE placed_at = '2026-05-10' can skip ~99% of partitions using the date min/max — that's query pruning.
The original files are never modified; an UPDATE writes new partitions and marks the old ones as expired (visible via Time Travel for the retention window).

Worked-example solution. A typical Snowflake CREATE TABLE with Snowflake data types (NUMBER(p,s), TIMESTAMP_TZ) and a CLUSTER BY for predictable partition layout:

CREATE TABLE fact_orders (
    order_id    NUMBER(38,0),
    customer_id NUMBER(38,0),
    placed_at   TIMESTAMP_TZ,
    amount      NUMBER(14,2)
)
CLUSTER BY (placed_at);
-- micro-partitions now naturally co-locate by date,
-- making date-range queries skip more partitions

Rule of thumb: you do not manage storage; you do not run VACUUM. If a query is slow on a clustered table, the answer is usually change the cluster key, not "rewrite the table."

Compute layer — virtual warehouses run the queries

The compute invariant: a virtual warehouse is a named, sized, isolated MPP compute cluster that runs your SQL; warehouses can be created, resumed, suspended, and resized independently; many warehouses can read the same storage simultaneously without contention. The pricing model is straightforward — you pay per credit per second of warehouse uptime; suspending a warehouse stops the meter.

Warehouse sizes — X-SMALL (1 node), SMALL (2), MEDIUM (4), … up to 6X-LARGE (512).
Multi-cluster warehouses — auto-scale parallel clusters when concurrency grows.
Auto-suspend / auto-resume — pause after N minutes idle; wake on demand.
Per-team isolation — ETL on warehouse A, analysts on warehouse B; one cannot slow the other.
Billing — per-second after a 60-second minimum.

Worked example. A team-isolated warehouse design:

warehouse	size	who uses	typical workload
`WH_ETL`	MEDIUM	nightly pipeline	one heavy `MERGE` per night
`WH_BI`	SMALL	dashboard tools	hundreds of small concurrent queries
`WH_ANALYSTS`	LARGE	ad-hoc SQL	occasional 10 B-row scans
`WH_ML`	XLARGE	feature pipeline	scheduled hourly batches

Step-by-step.

Each team's queries route to their own warehouse — a misbehaving analyst query cannot block the BI dashboard.
The ETL warehouse runs for ~45 min/night, then auto-suspends; you pay only for that window.
The BI warehouse stays warm during business hours with multi-cluster auto-scaling so 200 concurrent dashboards never queue.
The analyst warehouse spins up only when someone runs a big ad-hoc query.
All four warehouses read and write the same underlying tables — there is one source of truth.

Worked-example solution. Create and size a warehouse:

CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60     -- pause after 60s idle
         AUTO_RESUME    = TRUE
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 4
         SCALING_POLICY = 'STANDARD';

Rule of thumb: every team gets their own warehouse named after the team. Cost attribution and noise isolation come together that way.

Cloud services layer — the brain that ties it all together

The services invariant: the cloud services layer handles authentication, query optimisation, metadata, transaction management, security, and access control; it is shared across all warehouses; you never interact with it directly but it powers every "Snowflake feels magic" experience. It is also what makes zero-copy cloning, secure data sharing, and Time Travel cheap.

Authentication & RBAC — users, roles, grants; role-based access at every level.
Query optimiser — column statistics + cost model produce the execution plan.
Metadata store — micro-partition stats, transaction log, history.
Result cache — recent query results returned without warehouse compute.
Background services — re-clustering, materialised view maintenance.

Worked example. A query lifecycle:

step	layer	action
1	cloud services	authenticate user, check role grants
2	cloud services	parse SQL, compile execution plan, fetch metadata
3	cloud services	check result cache — if hit, return immediately (no compute)
4	virtual warehouse	nodes fetch needed micro-partitions from storage
5	virtual warehouse	run scan / filter / aggregate in parallel
6	cloud services	gather results, cache, return to client

Step-by-step.

The client submits SQL with a session token; cloud services verifies the token and resolves grants.
The optimiser uses table metadata (partition stats, clustering, statistics) to pick the cheapest plan.
Result cache — if the same SQL on the same data was answered in the last 24 h, the result is returned instantly with no warehouse usage.
On a miss, the active warehouse spins up nodes that fetch the needed columns from object storage.
Compute aggregates and returns; the result is cached and the warehouse goes back to idle.

Worked-example solution. Result-cache demo:

ALTER SESSION SET USE_CACHED_RESULT = TRUE;        -- default
SELECT COUNT(*) FROM fact_orders;                  -- first run: 4 s warehouse compute
SELECT COUNT(*) FROM fact_orders;                  -- second run: 60 ms cache hit

Rule of thumb: if a repeated query suddenly takes seconds again, suspect that someone modified the underlying table — cache invalidates on any change.

Common beginner mistakes

Confusing virtual warehouses with databases — a warehouse is compute, a database is storage; both are needed.
Sizing the warehouse for the peak instead of the average — bigger warehouses cost linearly more; right-size and use multi-cluster scaling.
Leaving warehouses without auto-suspend — every idle minute is a real charge.
Putting every team's queries on one warehouse — one slow query starves everyone.
Forgetting the result cache exists — re-running benchmarks without flushing the cache reports unrealistic numbers.

Snowflake Interview Question on Designing a Multi-Team Warehouse Strategy

A 50-person data team complains that "Snowflake is slow at 9 AM." Everyone shares one XLARGE warehouse: ETL, analysts, BI, ML. Propose a multi-warehouse design that fixes the 9 AM contention without paying more in total credits.

Solution Using Per-Team Warehouses with Auto-Suspend and Multi-Cluster Scaling

Code solution.

-- ETL: heavy, scheduled, short bursts
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'MEDIUM'
         AUTO_SUSPEND   = 60
         AUTO_RESUME    = TRUE;

-- BI: many small queries, business-hours concurrency
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 4;       -- multi-cluster for concurrency

-- Analysts: occasional big queries
CREATE WAREHOUSE WH_ANALYSTS
    WITH WAREHOUSE_SIZE = 'LARGE'
         AUTO_SUSPEND   = 60;

-- ML: scheduled feature jobs
CREATE WAREHOUSE WH_ML
    WITH WAREHOUSE_SIZE = 'XLARGE'
         AUTO_SUSPEND   = 60;

Step-by-step trace.

step	observation
1	original shared XLARGE
2	split into 4 warehouses, each right-sized
3	`AUTO_SUSPEND = 60` everywhere
4	`WH_BI` adds multi-cluster scaling
5	total credits/day drops
6	nobody waits behind another team's query

Output: the 9 AM dashboard lag disappears (BI auto-scales horizontally), the nightly ETL stops fighting the analyst's ad-hoc queries, and the total credit bill drops because nothing is "always on" anymore.

Why this works — concept by concept:

Per-team warehouses — each team's queries route to their own compute; nobody else can starve them.
Right-sized warehouses — BI needs concurrency (multi-cluster small); analysts need vertical power (large); they are not the same shape.
AUTO_SUSPEND = 60 — the silver bullet of Snowflake cost — warehouses billing stops 60s after the last query.
Multi-cluster scaling on WH_BI — additional clusters spin up when queue depth grows, then drop when it falls; no human tuning.
Same storage, isolated compute — all four warehouses read identical tables; one source of truth.
Cost — moves billing from "one big always-on warehouse" to "many right-sized warehouses billed only while running"; typical savings 30–60%.

Inline CTA: see ETL System Design for Data Engineering Interviews for end-to-end warehouse-shaping playbooks.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

3. Separation of compute and storage

Independent scaling of warehouses and data — the single most important Snowflake idea

In a legacy warehouse (Teradata, classic Redshift), compute and storage are bolted together — you buy a "cluster" with both, and if you need more of either you have to buy both. Snowflake's defining decision is that compute (virtual warehouses) and storage (cloud object storage) scale independently: spin up a XLARGE warehouse for a one-hour backfill, then drop back to SMALL; add a petabyte of data without touching compute; never pay for capacity you are not using right now.

Pro tip: This is the single most-asked Snowflake interview question, in some form, across data-engineering loops. Memorise the one-line answer: "Compute lives in virtual warehouses that I size and suspend independently; storage lives once on object storage and every warehouse reads the same files."

How the scaling actually works

The scaling invariant: adding a node to a warehouse, resizing a warehouse, or creating a new warehouse never moves data; the new compute simply fetches the same micro-partitions from object storage. The implication is huge — you can resize compute in seconds (no rebalancing) and the data layer never blocks an operational change.

Resize — ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='LARGE' — takes seconds; no data motion.
New warehouse — creates a new cluster pointing at the same storage; you can have N warehouses on the same data.
Suspend — ALTER WAREHOUSE WH_ETL SUSPEND — stops the compute meter; data remains on object storage.
Resume — ALTER WAREHOUSE WH_ETL RESUME — spins compute back up in seconds.
Storage grows independently — adding 1 PB does not change warehouse sizing.

Worked example. A one-hour backfill at the end of the quarter:

time	warehouse size	what's running
00:00–08:00	`MEDIUM`	normal nightly ETL
08:00–09:00	resized to `XLARGE` (8× faster)	one-time quarterly backfill
09:00 onwards	back to `MEDIUM`	resume normal work

Step-by-step.

The team has a 2 B-row backfill that would take 8 hours on the regular MEDIUM warehouse.
At 08:00 the operator runs ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE='XLARGE'.
The warehouse scales from 4 to 64 nodes in seconds — no data moves.
The backfill completes in ~1 hour because compute is 8× larger.
At 09:00 the operator runs SET WAREHOUSE_SIZE='MEDIUM' and the credit bill goes back to the steady-state rate.

Worked-example solution. Temporary upsize for a backfill:

ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'XLARGE';

-- run the backfill
INSERT INTO fact_orders_history
SELECT * FROM raw.orders WHERE order_date < '2026-01-01';

-- back to steady state
ALTER WAREHOUSE WH_ETL SET WAREHOUSE_SIZE = 'MEDIUM';

Rule of thumb: resize for the slow query, then resize back. Snowflake makes this a 5-minute operation, not a migration.

Why this breaks legacy assumptions

The legacy-comparison invariant: Teradata, classic Redshift, and on-prem MPP warehouses all couple compute and storage; adding storage means adding compute, and resizing compute requires a data rebalance. Snowflake's separation removes both constraints — the cost model and operational model are fundamentally different.

Coupled (legacy) — pay for over-provisioned compute year-round to handle peak storage.
Coupled — resize = rebalance = downtime.
Decoupled (Snowflake) — pay for compute by the second, only when running.
Decoupled — resize = seconds; new warehouse = seconds; no data motion.

Worked example. Same workload, two architectures:

dimension	legacy MPP (Teradata / old Redshift)	Snowflake
add 1 TB of data	requires bigger cluster	no compute change needed
resize compute	hours of rebalance	seconds
dev/test workload	needs own cluster	new warehouse on same data
paying for peak	always	only during the peak

Step-by-step.

A legacy 8-node Teradata cluster sized for the year-end peak runs at 20% utilisation the other 51 weeks.
The same workload on Snowflake uses a SMALL warehouse 51 weeks of the year, scales to XLARGE for one week, then back.
Storage costs are roughly the same (both are object-store class).
Compute costs drop ~70% because you only pay for the XLARGE during the week it is needed.
New environments (dev, staging) are free — they are just new warehouse names pointing at the same storage (or a clone).

Worked-example solution. Dev environment using a zero-copy clone:

CREATE DATABASE PROD_DW_DEV CLONE PROD_DW;
-- new warehouse for the dev team
CREATE WAREHOUSE WH_DEV WITH WAREHOUSE_SIZE = 'XSMALL';
USE DATABASE PROD_DW_DEV;
USE WAREHOUSE WH_DEV;

Rule of thumb: if a Snowflake setup ever "feels like" a legacy MPP cluster — always-on, hard to resize, single-tenant — it is being run wrong.

Cost implications and credit economics

The credit invariant: storage cost is roughly constant (compressed columnar on cloud storage); compute cost is variable and dominated by how long warehouses run; suspending warehouses and right-sizing them is the single highest-leverage cost lever Snowflake gives you. The default account settings are not always cost-optimal; tuning them matters.

Credits — Snowflake's compute currency; price varies by region and edition.
Warehouse credit-per-hour — XS=1, S=2, M=4, L=8, XL=16 (linear with size).
AUTO_SUSPEND — default may be 10 min; set to 60 s for spiky workloads.
Storage — flat $/TB/month; minor compared to compute for most workloads.
Result cache — free; queries served from cache don't burn credits.

Worked example. Monthly bill comparison:

design	warehouse	hours/month running	credits	cost
naive — always-on XL	XLARGE	730	11,680	$35,040
auto-suspended XL	XLARGE	80	1,280	$3,840
right-sized + auto-suspend	MEDIUM most, XL spike	60	320	$960

Step-by-step.

Always-on XL: 730 hours × 16 credits/hr × $3/credit = $35 k/month.
Same XL but with AUTO_SUSPEND = 60s: only runs when queries are active; ~80 hours/month → $3,840.
Right-sized — MEDIUM for the steady state, XL only for the weekly backfill: ~$960.
The data is identical; only the compute schedule changes.
Result cache further reduces this for repeated queries.

Worked-example solution. Cost-aware warehouse config:

CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         AUTO_RESUME    = TRUE
         INITIALLY_SUSPENDED = TRUE;

Rule of thumb: the first rule of Snowflake cost is suspend warehouses; the second rule is right-size warehouses; everything else is rounding error.

Common beginner mistakes

Resizing a warehouse and waiting for "data to move" — it does not; the resize is metadata-only.
Running XLARGE always-on for occasional queries — pay for an XSMALL 23 hours a day and an XLARGE for the one hour it is needed.
Treating the result cache as a free pass for "fast" queries that are actually expensive on a cold cache.
Ignoring AUTO_SUSPEND — the default of 10 minutes is wasteful for low-frequency workloads.
Building a single shared warehouse for everyone — undoes the entire isolation benefit.

Snowflake Interview Question on Cost-Optimising a $50k Monthly Bill

The CFO points at a $50k/month Snowflake bill. Your single XLARGE warehouse is AUTO_SUSPEND = NULL (never suspends). Average usage is 4 hours/day across two distinct workloads (BI in business hours, ETL at night). Cut the bill by at least 60% without losing performance.

Solution Using Workload Isolation + Auto-Suspend + Right-Sizing

Code solution.

-- Stop the always-on XL
ALTER WAREHOUSE WH_OLD SUSPEND;
DROP WAREHOUSE WH_OLD;

-- BI: business-hours, many small queries
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL'
         AUTO_SUSPEND   = 60
         MIN_CLUSTER_COUNT = 1
         MAX_CLUSTER_COUNT = 3;

-- ETL: nightly, single big batch
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'LARGE'
         AUTO_SUSPEND   = 60;

Step-by-step trace.

step	observation	monthly cost
1	original `XLARGE` always-on	730 h × 16 credits × $3 ≈ $35k/month per warehouse
2	usage audit: 4 BI hours/day + 1 ETL hour/night	most of the bill was idle time
3	split into BI SMALL + ETL LARGE	both auto-suspend
4	BI: 4 h/day × 30 = 120 h × 2 credits = 240 credits	~$720
5	ETL: 1 h/night × 30 = 30 h × 8 credits = 240 credits	~$720
6	new total	$1,440 ≈ 97% reduction

Output: monthly Snowflake spend drops from $50k to roughly $1.5k. BI users still see sub-second dashboards (multi-cluster scaling absorbs the morning spike). ETL still completes in its nightly window (LARGE is fast enough). Nothing breaks.

Why this works — concept by concept:

Workload isolation — BI and ETL have different concurrency profiles; one warehouse cannot serve both well.
AUTO_SUSPEND = 60 — the warehouse meter stops 60s after the last query; idle time is no longer paid for.
Right-sizing — BI gets SMALL with multi-cluster (concurrency); ETL gets LARGE (throughput). No need for XL.
Same storage — no data motion; both warehouses read the same tables.
Visible per-warehouse cost — separate warehouses surface per-team spend in WAREHOUSE_METERING_HISTORY.
Cost — credit consumption proportional to active query time, not wall-clock time.

Inline CTA: drill the ETL practice page for warehouse-sizing scenarios.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

4. Loading and querying data

Stages, Snowflake COPY INTO, file formats, and the Snowflake SQL surface

The Snowflake COPY INTO command is the primary bulk-load mechanism — it reads files from a stage (an internal or external file location) and inserts them into a table in parallel. The file format is declared explicitly (CSV, JSON, Parquet, Avro, ORC). Once data is in, you query it with standard Snowflake SQL — SELECT / JOIN / GROUP BY look identical to the dialect you already know.

Pro tip: Interviewers love the COPY INTO question because it has clear right answers — file format, error handling, parallelism, and idempotency are all observable design choices. Practise saying "I stage the files, declare the format, and run COPY INTO with ON_ERROR = SKIP_FILE_AND_CONTINUE and a load history check" in one sentence.

Stages — external and internal file locations

The stage invariant: a stage is a named file location Snowflake knows how to read from; **internal stages live inside Snowflake (managed for you); external stages point at S3 / GCS / ADLS buckets you manage; both behave identically for COPY INTO**. Stages are also reusable — one stage definition can be reused by many COPY INTO statements.

Internal stage — @~/path (user), @%TABLE (table), @stage_name (named).
External stage — points at s3://bucket/path/, gs://bucket/path/, azure://….
Storage integration — security object that grants Snowflake permission to read the bucket.
Listing — LIST @my_stage; shows files visible in the stage.

Worked example. Define an external S3 stage:

object	purpose
`STORAGE INTEGRATION`	IAM trust between Snowflake and AWS
`FILE FORMAT`	declares CSV / JSON / Parquet rules
`EXTERNAL STAGE`	named location pointing at the bucket
`COPY INTO`	the load command that uses the stage + format

Step-by-step.

Create a STORAGE INTEGRATION in Snowflake; this generates an IAM trust policy you paste into AWS.
Create a FILE FORMAT describing the data — TYPE = PARQUET is the simplest; CSV needs more options.
Create an EXTERNAL STAGE that combines the integration and the bucket path.
List files in the stage to confirm permissions: LIST @prod_s3_stage.
Run COPY INTO against the stage; Snowflake fetches files in parallel across warehouse nodes.

Worked-example solution. End-to-end stage setup:

CREATE STORAGE INTEGRATION s3_int
    TYPE = EXTERNAL_STAGE
    STORAGE_PROVIDER = 'S3'
    ENABLED = TRUE
    STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/SnowflakeReadRole'
    STORAGE_ALLOWED_LOCATIONS = ('s3://my-bucket/snowflake/');

CREATE FILE FORMAT ff_parquet
    TYPE = PARQUET;

CREATE STAGE prod_s3_stage
    STORAGE_INTEGRATION = s3_int
    URL = 's3://my-bucket/snowflake/'
    FILE_FORMAT = ff_parquet;

Rule of thumb: one storage integration per AWS account, one stage per logical bucket path, one file format per file shape — keeps grants and schemas tidy.

`COPY INTO` — the bulk loader

The COPY invariant: COPY INTO table FROM @stage parallelises file ingestion across all nodes of the active warehouse; errors are handled by ON_ERROR policy; COPY INTO is idempotent on a per-file basis — re-running the command will skip files already loaded (tracked in LOAD_HISTORY). The same command works for any file format the stage's file format declared.

Parallelism — file count × warehouse nodes; more files + bigger warehouse = faster load.
ON_ERROR — CONTINUE (skip rows), SKIP_FILE_AND_CONTINUE, ABORT_STATEMENT (fail fast).
PATTERN — regex filter on filenames; load only .parquet etc.
Load history — LOAD_HISTORY view records every file's commit; re-running skips loaded files.
PURGE = TRUE — delete source files after successful load.

Worked example. Daily Parquet drop into fact_orders:

stage prefix	file	loaded?
`s3://…/orders/dt=2026-05-10/`	`part-0000.parquet`	✓ (yesterday's run)
`s3://…/orders/dt=2026-05-11/`	`part-0000.parquet`	new
`s3://…/orders/dt=2026-05-11/`	`part-0001.parquet`	new

Step-by-step.

The previous day's COPY INTO loaded dt=2026-05-10/part-0000.parquet; it appears in LOAD_HISTORY.
Today's COPY INTO runs against the same stage path.
Snowflake consults LOAD_HISTORY, sees dt=2026-05-10/part-0000.parquet was already loaded, and skips it.
The two new files for dt=2026-05-11/ are loaded in parallel.
Re-running tonight's command would skip every file because all three are now in LOAD_HISTORY — idempotency.

Worked-example solution. Daily idempotent load:

COPY INTO fact_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_parquet)
PATTERN = '.*[.]parquet'
ON_ERROR = 'SKIP_FILE_AND_CONTINUE';

-- inspect what loaded
SELECT * FROM TABLE(INFORMATION_SCHEMA.COPY_HISTORY(
    TABLE_NAME => 'FACT_ORDERS',
    START_TIME => DATEADD(hours, -1, CURRENT_TIMESTAMP)
));

Rule of thumb: always wrap COPY INTO with a post-load row-count assertion and an alert on LOAD_HISTORY errors — silent skips are how data drifts.

File formats — CSV vs JSON vs Parquet

The format invariant: Parquet and other columnar formats (ORC) load faster, compress better, and preserve types; CSV is the lowest-common-denominator and pays a real cost in load time and schema fidelity; JSON works with VARIANT columns and is fine for semi-structured payloads. Default to Parquet for anything you control.

Parquet / ORC — columnar; preserves types; smallest on-disk size; fastest load.
CSV — text; needs explicit schema + quote / escape rules; slowest.
JSON — semi-structured; loads into a VARIANT column; query with :key dot notation.
Avro — common in streaming; binary; schema embedded.

Worked example. Loading the same 1 M-row dataset:

format	file size	COPY time on MEDIUM	notes
CSV (gzipped)	220 MB	90 s	requires explicit format definition
JSON (gzipped)	280 MB	70 s	landing into `VARIANT` column
Parquet (snappy)	60 MB	12 s	columnar; types preserved

Step-by-step.

CSV file is largest and slowest because every row is reparsed as text, type-coerced, and validated.
JSON is similar but lands into a single VARIANT column — fast for sparse / nested data, awkward for analytics SQL.
Parquet is columnar and binary; Snowflake reads only the columns it needs; load time drops 7×.
Storage costs follow the same ratio — Parquet files compress better.
If your source format is a choice (ETL between systems you control), pick Parquet.

Worked-example solution. Three file-format definitions:

CREATE OR REPLACE FILE FORMAT ff_csv
    TYPE = CSV
    FIELD_DELIMITER = ','
    SKIP_HEADER = 1
    NULL_IF = ('','NULL','null')
    FIELD_OPTIONALLY_ENCLOSED_BY = '"';

CREATE OR REPLACE FILE FORMAT ff_json
    TYPE = JSON
    STRIP_OUTER_ARRAY = TRUE;

CREATE OR REPLACE FILE FORMAT ff_parquet
    TYPE = PARQUET;

Rule of thumb: between systems you control, Parquet. Across vendor boundaries you cannot change, CSV. For event streams, JSON or Avro.

Common beginner mistakes

Loading raw CSVs without explicit FILE FORMAT — fields with embedded commas break silently.
Running COPY INTO without ON_ERROR — one bad row aborts the entire load.
Forgetting LOAD_HISTORY and reloading files twice — your fact table doubles silently.
Picking JSON for tabular data — wastes Snowflake's columnar strengths.
Storing the access keys in code instead of a STORAGE INTEGRATION — leaks the secret.

Snowflake Interview Question on a Daily S3 Drop That Sometimes Has Bad Rows

The team gets a daily 5 GB CSV drop from a partner into S3. About 0.1% of rows have malformed amounts. The current COPY INTO errors and the daily load fails. Design an ingestion pipeline that loads the good rows, captures the bad ones for review, and is idempotent on rerun.

Solution Using `ON_ERROR = CONTINUE` + a Rejected-Rows Table + `LOAD_HISTORY` Check

Code solution.

CREATE TABLE raw_orders (
    order_id   NUMBER,
    customer_id NUMBER,
    amount     NUMBER(14,2),
    placed_at  TIMESTAMP_NTZ
);

CREATE TABLE rejected_orders LIKE raw_orders;

-- main load: skip bad rows but keep going
COPY INTO raw_orders
FROM @prod_s3_stage/orders/
FILE_FORMAT = (FORMAT_NAME = ff_csv)
PATTERN = '.*[.]csv'
ON_ERROR = 'CONTINUE'
RETURN_FAILED_ONLY = FALSE;

-- capture the rejected rows for review
INSERT INTO rejected_orders
SELECT *
FROM TABLE(VALIDATE(raw_orders, JOB_ID => '_LAST'));

Step-by-step trace.

step	action	result
1	partner drops `orders_2026-05-11.csv`	5 M rows; ~5 k malformed
2	`COPY INTO` with `ON_ERROR = CONTINUE`	loads 4,995,000 good rows
3	`LOAD_HISTORY` shows file committed	won't reload on rerun
4	`VALIDATE(... JOB_ID => '_LAST')` returns 5 k bad rows	inserted into `rejected_orders`
5	partner sees rejection report, fixes upstream	next day cleaner

Output: the daily load completes; good rows land in raw_orders; bad rows are captured in rejected_orders with their reasons for inspection; LOAD_HISTORY ensures the same file is never loaded twice on rerun.

Why this works — concept by concept:

ON_ERROR = CONTINUE — partial loads succeed; one bad row doesn't kill the daily pipeline.
VALIDATE(... JOB_ID => '_LAST') — captures rejected rows from the most recent load for forensic review.
LOAD_HISTORY idempotency — reruns skip files already committed; safe to retry.
Separate rejected_orders table — keeps the failure rate visible and reviewable, not silently lost.
Pattern-based file selection — .*[.]csv ensures only intended files load.
Cost — load time O(rows / warehouse size); the validate call is metadata-only.

Inline CTA: the canonical ingestion-design syllabus is in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

5. Time Travel and zero-copy cloning

Recovery, audit, and instant dev environments

Snowflake's Time Travel lets you query a table as it existed at any point in the recent past (1 day by default, up to 90 days on Enterprise). Zero-copy cloning lets you create a new database, schema, or table that shares the same underlying micro-partitions as the source — no data is copied, the clone is free, and edits diverge from that moment onward. Snowflake Dynamic Tables layer on top of these primitives to give you declarative, automatically-refreshed materialised views for the ELT layer. Together, these features turn data recovery, dev-environment provisioning, and forensic debugging from multi-hour ordeals into single SQL statements.

Pro tip: When asked "what makes Snowflake different operationally?", the two-word answer is "Time Travel and cloning." Both are direct consequences of the immutable-micro-partition storage layer; legacy warehouses cannot offer them because their storage isn't shaped this way.

Time Travel — querying historical state

The Time-Travel invariant: for every table, Snowflake retains the micro-partitions that made up its state for a retention period (default 1 day on Standard, configurable up to 90 days on Enterprise); within that window, AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses return the table's historical state. The feature is the cheapest "we accidentally dropped a table" recovery on the market.

AT (OFFSET => -3600) — table state 1 hour ago.
AT (TIMESTAMP => '2026-05-11 14:00:00') — table state at that exact moment.
BEFORE (STATEMENT => '<query_id>') — table state just before a specific query ran.
DATA_RETENTION_TIME_IN_DAYS — column-level setting; default 1, max 90 (Enterprise).
UNDROP TABLE / DATABASE — short-cut to restore a dropped object within retention.

Worked example. A junior accidentally truncates dim_customer:

time	event	what Time Travel can do
14:00:00	`dim_customer` healthy	(normal)
14:05:32	`TRUNCATE TABLE dim_customer`	rows gone
14:07:10	data team notices	panic
14:08:00	run `INSERT INTO dim_customer SELECT * FROM dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00')`	rows restored

Step-by-step.

The truncate runs; Snowflake marks the micro-partitions as expired but retains them for the retention window.
The team finds the query in QUERY_HISTORY and notes its query_id.
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def') returns the table as it existed just before the truncate.
Wrapping the same query in an INSERT INTO dim_customer restores the data in seconds — no backup tape, no S3 restore.
Future incidents within the retention window are recoverable the same way.

Worked-example solution. Full recovery script:

-- find the offending query
SELECT query_id, query_text, start_time
FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY())
WHERE query_text ILIKE '%TRUNCATE%dim_customer%'
ORDER BY start_time DESC
LIMIT 1;

-- restore using BEFORE
INSERT INTO dim_customer
SELECT * FROM dim_customer BEFORE (STATEMENT => 'abc-123-def-456');

Rule of thumb: Time Travel saves your weekend the first time someone runs a destructive query in prod. Configure retention to match your "how long until someone notices" SLA.

Zero-copy cloning — instant dev / test environments

The cloning invariant: CREATE … CLONE produces a new object (database, schema, or table) that shares the source's underlying micro-partitions; until either side writes, the clone is free in storage; once a clone writes, only the diverged partitions cost extra. Cloning a 100 TB database for a feature branch takes seconds and costs near-zero.

CREATE TABLE x_clone CLONE x — clone one table.
CREATE SCHEMA s_dev CLONE s_prod — clone all tables in a schema.
CREATE DATABASE db_dev CLONE db_prod — clone an entire database.
Copy-on-write — clones diverge only on the rows that actually change.
Clone-at-point-in-time — CLONE … AT (TIMESTAMP => …) — combines cloning and Time Travel.

Worked example. Stand up a dev environment in 30 seconds:

step	action	storage cost
1	prod `DB_PROD` has 100 TB of orders	$$$
2	`CREATE DATABASE DB_DEV CLONE DB_PROD`	0 (shared micro-partitions)
3	dev team runs `INSERT` and `UPDATE` over a few thousand rows	+ a few MB diverged
4	prod queries still see prod state	isolated
5	dev queries see clone + diverged state	isolated

Step-by-step.

CREATE DATABASE DB_DEV CLONE DB_PROD returns in seconds.
Snowflake records that every table in DB_DEV points at the same micro-partitions as DB_PROD.
Dev team can run any DDL or DML against DB_DEV without touching prod data.
Each write to DB_DEV writes new partitions; the unchanged partitions remain shared.
When dev is done, DROP DATABASE DB_DEV removes only the diverged partitions; the shared ones stay with prod.

Worked-example solution. Feature-branch dev environment:

-- Snapshot prod at a clean moment
CREATE DATABASE DB_DEV_FEATURE_X
    CLONE DB_PROD
    AT (TIMESTAMP => CURRENT_TIMESTAMP());

GRANT USAGE  ON DATABASE DB_DEV_FEATURE_X TO ROLE dev_role;
GRANT USAGE  ON SCHEMA   DB_DEV_FEATURE_X.public TO ROLE dev_role;
GRANT SELECT ON ALL TABLES IN SCHEMA DB_DEV_FEATURE_X.public TO ROLE dev_role;

Rule of thumb: every dev team gets their own clone. The cost of cloning is so low that "no shared dev environment" should be your default policy.

Retention windows and the cost of long Time Travel

The retention-cost invariant: retaining expired micro-partitions for the Time-Travel window costs storage, not compute; longer retention = more storage; the math is small for tables that change slowly, larger for high-churn tables. Pick the retention window per table based on (a) how long it takes to notice mistakes and (b) how much the table churns.

Standard edition — max retention 1 day.
Enterprise edition — max 90 days; default still 1.
ALTER TABLE … SET DATA_RETENTION_TIME_IN_DAYS = N — per-table override.
Fail-safe — additional 7-day retention beyond Time Travel; Snowflake-managed, not user-queryable.
Cost driver — high-churn tables (frequent UPDATE / DELETE) accumulate many historical partitions.

Worked example. Retention cost per table:

table	churn (rows/day)	retention (days)	extra storage
`dim_customer`	low (1 k changes)	30	tiny
`fact_clicks` (insert-only)	high (50 M rows/day)	7	~350 M rows × 7
`dim_product` (rarely changes)	almost zero	90	tiny
`staging_*` (volatile)	high	1	minimal

Step-by-step.

For low-churn dimensions, longer retention costs almost nothing — those tables rarely write new partitions.
For high-churn facts, retention multiplies the storage cost proportionally.
Staging tables don't need 7+ days — they're rebuilt daily; set retention to 1.
Production facts that you'd want to recover from a logic bug deserve 7-14 days.
Per-table tuning is more cost-effective than a single account-wide retention.

Worked-example solution. Right-sized retention:

ALTER TABLE dim_customer  SET DATA_RETENTION_TIME_IN_DAYS = 30;
ALTER TABLE fact_clicks   SET DATA_RETENTION_TIME_IN_DAYS = 7;
ALTER TABLE staging_orders SET DATA_RETENTION_TIME_IN_DAYS = 1;

Rule of thumb: the longer retention, the longer your safety net; the longer retention on high-churn tables, the higher the storage bill. Tune per table.

Common beginner mistakes

Assuming Time Travel works forever — the default is 1 day; past that, you need Enterprise + per-table retention.
Treating fail-safe as user-queryable — it is not; only Snowflake support can recover from fail-safe.
Cloning to "back up" — clones share storage; if you DROP the source, the clone is unaffected, but they aren't a true off-cluster backup.
Forgetting that updates erode retention — a heavy UPDATE on a fact table can blow up storage costs if retention is long.
Querying historical data with AT (OFFSET => -86400) when the table's retention is 0 — silently errors.

Snowflake Interview Question on Recovering a Mistakenly Dropped Production Table

A junior runs DROP TABLE dim_customer; in prod at 14:05:32 UTC. The team notices at 14:08:00. The account is on Enterprise edition; the dim_customer table has 90-day retention configured. Recover the table with zero data loss and zero downtime for downstream dashboards.

Solution Using `UNDROP TABLE` (and a fallback to `CLONE … AT (TIMESTAMP => …)`)

Code solution.

-- fast path: UNDROP restores the table object and its data
UNDROP TABLE dim_customer;

-- or, if the table name has already been reused, clone the historical state
CREATE TABLE dim_customer_restored
    CLONE dim_customer AT (TIMESTAMP => '2026-05-11 14:05:00'::TIMESTAMP);

-- verify row counts vs the source-of-truth replica
SELECT COUNT(*) FROM dim_customer;

Step-by-step trace.

step	time	action	result
1	14:05:32	`DROP TABLE dim_customer` runs	table dropped; metadata moved to "dropped"
2	14:08:00	engineer notices	table still recoverable via Time Travel
3	14:08:30	`UNDROP TABLE dim_customer`	table restored with full data
4	14:08:45	`SELECT COUNT(*) FROM dim_customer`	matches the row count before drop
5	14:09:00	downstream dashboards re-run	green

Output: the table is back in place with every row intact; downstream queries that fired between 14:05:32 and 14:08:30 errored but those errors are transient and the next refresh succeeds; total recovery time ≈ 3 minutes.

Why this works — concept by concept:

UNDROP TABLE — Snowflake's shortcut for restoring a dropped object within the retention window; one statement, instant.
90-day retention — purchased via Enterprise edition; absorbs the worst-case "we noticed a week later" recovery scenario.
CLONE … AT (TIMESTAMP => …) — fallback if the table name was reused; recreates the historical state as a new table.
Zero data motion — the dropped table's micro-partitions never left storage; recovery is metadata-only.
No external backup needed — Time Travel is the backup for the retention window.
Cost — restore is metadata-only; the retention storage cost was paid throughout the 90 days regardless of whether anyone used it.

Inline CTA: drill recovery and data-quality scenarios on the ETL practice page.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

6. Performance optimization

Micro-partitions, query pruning, result caching, and clustering

Snowflake's performance story is built on three automatic layers — micro-partitions (the storage unit), query pruning (skip partitions whose stats prove they cannot match), and result caching (serve identical recent queries with no compute). On top of that, you can guide the optimiser with clustering keys for very large tables that need predictable partition layout. You will rarely tune indexes (there aren't any) — instead, you tune which partitions exist and which the planner can skip.

Pro tip: When a Snowflake query is slow, the right diagnostic is the query profile in the UI — look at partitions scanned vs total. If you're scanning 100% of partitions for a date-range query, the table either lacks a useful cluster key or the predicate isn't pruneable.

Micro-partitions and automatic clustering

The micro-partition invariant: Snowflake automatically chops every table into immutable 50–500 MB compressed columnar files, each carrying per-column min/max/distinct statistics; "clustering" is the optional act of giving Snowflake a hint about which column(s) should drive partition layout for predictable date-range or key-range pruning. Most tables don't need explicit clustering; the very large ones do.

Automatic partitioning — every insert produces new partitions; no DDL needed.
Statistics per partition — min/max/distinct for every column; powers pruning.
CLUSTER BY (col) — hint that Snowflake should keep partitions ordered on that column.
Re-clustering — background service that recompacts partitions when clustering drifts.
Pruning ratio — partitions scanned / total partitions; visible in query profile.

Worked example. A fact_clicks table at 50 B rows:

design	predicate `WHERE click_date = '2026-05-10'`	partitions scanned
no cluster key	natural append order	100% (no pruning)
`CLUSTER BY (click_date)`	partitions sorted by date	0.3%

Step-by-step.

Without clustering, Snowflake's per-partition min/max for click_date covers the whole date range — every partition might match.
With CLUSTER BY (click_date), each partition's date range is tight; the planner can skip every partition outside the predicate.
The query goes from a full table scan to a needle-in-haystack pull.
Re-clustering runs in the background as new data arrives, keeping the layout tight.
The cluster key should match the most common range predicate, not every predicate.

Worked-example solution. Cluster a high-volume fact:

ALTER TABLE fact_clicks CLUSTER BY (click_date, customer_id);
-- check clustering depth (1.0 = perfectly clustered, higher = worse)
SELECT SYSTEM$CLUSTERING_INFORMATION('fact_clicks', '(click_date)');

Rule of thumb: most tables (< 1 TB) don't need explicit clustering. The very large date-partitioned facts do, and the cluster key is almost always the date column.

Query pruning — skip partitions whose stats prove they cannot match

The pruning invariant: the optimiser uses per-partition column statistics to prove that some partitions cannot contain rows matching the WHERE predicate; those partitions are skipped — never read from storage — and the query reads only the relevant subset. Pruning is automatic and invisible until you check the query profile.

Date predicates — WHERE date_col BETWEEN x AND y prunes by partition date range.
Equality predicates — WHERE col = x prunes by per-partition min/max.
IN (...) lists — pruned by each value.
Function-wrapped columns — WHERE DATE(ts) = … may not prune; raw column comparisons do.
Profile shows Partitions scanned : 12 / 4567 — that ratio is the pruning signal.

Worked example. Same fact_clicks query, different predicates:

predicate	partitions scanned
`WHERE click_date = '2026-05-10'`	12 / 4,567 (0.3%)
`WHERE customer_id = 4242`	4,567 / 4,567 (no pruning unless clustered)
`WHERE DATE(click_ts) = '2026-05-10'`	4,567 / 4,567 (function disables pruning)

Step-by-step.

Date predicate matches the cluster key; pruning is excellent.
Customer-id predicate cannot prune because customer_ids are scattered across all partitions.
Wrapping the date column in DATE(…) disables pruning because Snowflake cannot use min/max on the computed value.
The query profile makes this visible — "Partitions scanned: X / Y" is the first line to read.
The fix for the function-wrapped predicate is to compare the raw column: WHERE click_ts >= '2026-05-10' AND click_ts < '2026-05-11'.

Worked-example solution. Pruning-friendly date filter:

-- prunes
SELECT * FROM fact_clicks
WHERE click_ts >= '2026-05-10'
  AND click_ts <  '2026-05-11';

-- does NOT prune
SELECT * FROM fact_clicks
WHERE DATE(click_ts) = '2026-05-10';

Rule of thumb: keep predicates on the raw clustered column. Anything that wraps the column in a function disables the planner's ability to use partition statistics.

Result caching — free wins for repeated queries

The cache invariant: the cloud-services layer remembers the result of every query for 24 hours; an identical query against unchanged tables returns the cached result instantly, with zero warehouse compute. The cache is account-wide — different users running the same SQL share the same cached result.

Cache lifetime — 24 hours of inactivity; extends with re-use up to 31 days.
Cache key — exact SQL text + same underlying data state.
Invalidation — any change to a referenced table or any non-deterministic function (CURRENT_TIMESTAMP).
No warehouse needed — the cache responds even when the warehouse is suspended.
Cache misses — the warehouse runs the query and the result is cached for the next user.

Worked example. Three runs of the same dashboard query:

run	warehouse compute	latency	credit cost
1	warehouse runs	2,400 ms	small
2 (cache hit)	none	80 ms	0
3 (after data change)	warehouse runs	2,400 ms	small

Step-by-step.

First analyst clicks the dashboard tile; the warehouse runs the query; result cached.
Second analyst clicks the same tile minutes later; cache hit; warehouse is suspended the entire time.
ETL adds new rows to the underlying table; cache invalidates.
Third analyst clicks; cache miss; warehouse spins back up to recompute.
The pattern dominates BI workloads — most dashboard refreshes hit the cache because the data only changes once a day.

Worked-example solution. Verify cache behaviour:

SHOW PARAMETERS LIKE 'USE_CACHED_RESULT';   -- TRUE by default
ALTER SESSION SET USE_CACHED_RESULT = TRUE;

SELECT COUNT(*) FROM fact_orders;            -- compute
SELECT COUNT(*) FROM fact_orders;            -- cache hit (~80 ms)

Rule of thumb: never benchmark Snowflake without disabling result cache for the test. Production benefits hugely from the cache; benchmarks lie when you don't account for it.

Common beginner mistakes

Trying to create B-tree indexes — Snowflake has none; tuning happens via clustering and partition pruning.
Clustering small tables — overhead exceeds benefit until you cross ~1 TB.
Wrapping clustered columns in functions in WHERE — disables pruning silently.
Believing the result cache is "always on" — any underlying-data change invalidates.
Benchmarking with cache enabled — produces misleadingly low numbers; disable cache for honest tests.

Snowflake Interview Question on Speeding Up a 60-Second Daily Report

The daily revenue report on a 5 B-row fact_orders table takes 60 seconds. The query is SELECT customer_id, SUM(amount) FROM fact_orders WHERE order_date = CURRENT_DATE GROUP BY 1. Get it under 5 seconds without buying a bigger warehouse.

Solution Using Clustering on `order_date` + a Raw-Column Predicate + Materialised View

Code solution.

-- 1. cluster the table by the most common range predicate
ALTER TABLE fact_orders CLUSTER BY (order_date);

-- 2. rewrite the predicate so it can prune
SELECT customer_id, SUM(amount)
FROM fact_orders
WHERE order_date = CURRENT_DATE          -- raw column, prunes
GROUP BY customer_id;

-- 3. for repeated daily access, create a materialised view
CREATE MATERIALIZED VIEW mv_daily_customer_revenue AS
SELECT order_date, customer_id, SUM(amount) AS revenue
FROM fact_orders
GROUP BY 1, 2;

Step-by-step trace.

step	action	result
1	baseline scan of entire 5 B rows	60 s
2	`CLUSTER BY (order_date)` (background re-cluster runs)	tighter date min/max per partition
3	rerun query — pruning kicks in	4 s
4	result cache hit on repeated daily reads	< 100 ms
5	optional MV for sub-second pre-aggregated rollup	< 50 ms

Output: the daily report goes from 60 s to 4 s on the first run of the day (clustered scan), then milliseconds for repeat hits (result cache). The materialised view turns even cold runs into milliseconds for the pre-aggregated rollup.

Why this works — concept by concept:

CLUSTER BY (order_date) — partitions co-locate by date so the predicate prunes ~99.5% of them.
Raw-column predicate — WHERE order_date = CURRENT_DATE is pruneable; wrapping in a function would not be.
Result cache for repeats — second and subsequent analysts to view the dashboard get sub-100 ms without warehouse compute.
Materialised view — pre-aggregates the rollup; daily query becomes a tiny aggregate read.
No bigger warehouse needed — performance gains come from scanning less data, not throwing more compute at the same scan.
Cost — clustering adds background re-cluster cost; MV adds maintenance cost; both are far smaller than running a XLARGE warehouse for 60 s × N analysts.

Inline CTA: sharpen pruning-aware SQL on the SQL practice page and the aggregation topic.

SQL
Topic — aggregations
SQL aggregation problems

Practice →

SQL
Topic — window functions
Window-function problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

7. Snowflake vs Redshift vs BigQuery

How the big warehouses differ — and how to pick one

Three cloud data warehouses dominate modern data-engineering interviews: Snowflake, Amazon Redshift, and Google BigQuery. Snowflake vs Databricks is the next-most-asked comparison, and Azure shops often add Synapse to the shortlist — so be ready for any pairing in the wider Snowflake / Databricks / BigQuery / Synapse decision. All four serve analytical workloads at scale; they differ in operational model, pricing, cloud lock-in, and how they pair with transformation tools like dbt (the dbt Snowflake integration is the de-facto modeling layer for most Snowflake teams). Knowing the trade-offs in one or two sentences each is enough to handle the "why this one?" interview follow-up.

Pro tip: Never say "X is best." Always frame the answer as which tool fits which workload. Interviewers test whether you understand the trade-offs, not whether you can pick a winner.

Snowflake vs Redshift — compute/storage coupling and cloud lock-in

The Redshift comparison invariant: classic Redshift (provisioned) tightly couples compute and storage; modern Redshift Serverless decouples them and looks more like Snowflake; Snowflake is multi-cloud while Redshift is AWS-only; Snowflake's ease-of-use is consistently rated higher but Redshift can be cheaper at steady-state on AWS. Pick Redshift if you are deep in AWS and want the cheapest steady-state bill; pick Snowflake if you need multi-cloud, easier ops, or per-team isolation.

Cloud support — Snowflake: AWS / GCP / Azure. Redshift: AWS only.
Compute/storage — Snowflake: fully separated. Redshift: provisioned = coupled; Serverless = separated.
Maintenance — Snowflake: near-zero. Redshift: some tuning (VACUUM, ANALYZE, distkey, sortkey).
Scaling — Snowflake: easier; resize in seconds. Redshift: provisioned resize involves data redistribution.
Semi-structured — Snowflake: excellent native VARIANT. Redshift: good SUPER type.

Worked example. Same workload on both:

dimension	Snowflake	Redshift (provisioned)
spin up a warehouse	5 s	minutes
resize compute	seconds, no data motion	minutes, data redistribution
add a TB of data	no compute change	may need to resize cluster
10 concurrent dashboards	multi-cluster auto-scales	needs Concurrency Scaling addon
credit billing	per-second	per-hour (provisioned) / per-second (Serverless)

Step-by-step.

Both can serve the same analytical SQL workload at scale.
Snowflake's operational ergonomics are simpler — no VACUUM, no manual sort/distkeys, easier resize.
Redshift on AWS is often cheaper at steady-state because AWS sells it at a discount within their ecosystem.
Multi-cloud needs (data in GCP + analytics in AWS) lean strongly Snowflake.
The choice is rarely about features — it is about who manages the warehouse and how aggressive the cost target is.

Worked-example solution. Quick comparison line for an interview:

Snowflake : multi-cloud, separated compute/storage, near-zero maintenance,
            per-second billing, excellent VARIANT.
Redshift  : AWS-only, provisioned = coupled / Serverless = separated,
            some tuning required, cheaper at steady-state on AWS.

Rule of thumb: deep AWS shop with steady utilisation → Redshift. Multi-cloud, spiky workload, small ops team → Snowflake.

Snowflake vs BigQuery — warehouse compute vs serverless

The BigQuery comparison invariant: BigQuery is serverless — no warehouses, just a query that scans bytes and is billed per byte scanned; Snowflake bills per second of warehouse uptime; BigQuery is GCP-only; both have excellent semi-structured support; the cost model fundamentally differs. Pick BigQuery if you're on GCP and want zero compute management; pick Snowflake if you want multi-cloud or predictable monthly compute spend.

Cloud — Snowflake: AWS / GCP / Azure. BigQuery: GCP only.
Compute model — Snowflake: virtual warehouses. BigQuery: serverless slots.
Pricing — Snowflake: warehouse credits per second. BigQuery: per TB scanned (on-demand) or flat-rate slots.
Concurrency — Snowflake: explicit warehouse choice. BigQuery: implicit; slots dynamically assigned.
Cost predictability — Snowflake: more predictable (you choose the warehouse). BigQuery: depends entirely on query patterns.

Worked example. Cost shape for one query:

query	Snowflake (SMALL warehouse)	BigQuery (on-demand)
1 TB scan	60 s × 2 credits = $6	1 TB × $5/TB = $5
same query, 100×	warehouse keeps running	100 TB × $5 = $500
same query, result cache hit (Snowflake)	free	not applicable in BigQuery
same query, cached (BigQuery)	—	free for 24 h

Step-by-step.

For one-off queries, costs are similar.
For repeated identical queries, both have result caches that make subsequent runs free.
For repeated different queries on the same data, BigQuery scales with bytes scanned per query; Snowflake scales with warehouse uptime.
Heavy ad-hoc exploration may be cheaper on Snowflake (one warehouse, many queries) than BigQuery (each query bills bytes).
Heavy variable workloads with idle gaps may be cheaper on BigQuery (no warehouse to suspend).

Worked-example solution. Quick interview-shape comparison:

Snowflake : warehouses, per-second compute billing, multi-cloud,
            predictable cost if warehouses are right-sized.
BigQuery  : serverless, per-byte-scanned billing, GCP-only,
            cost scales with bytes per query — partition / cluster
            tables aggressively to keep bytes small.

Rule of thumb: GCP shop with many small queries → BigQuery. Multi-cloud or heavy ad-hoc analyst workload → Snowflake.

When to pick what — the one-paragraph decision

The selection invariant: pick the warehouse that matches (a) the cloud your data already lives in, (b) the workload shape (steady vs spiky, dashboards vs ad-hoc), and (c) the size of your data-engineering team; do not over-rotate on features — all three serve the analytical workload well. The bigger questions are operational and economic, not technical.

Cloud-first — match the warehouse to where the data lives (cross-cloud egress is real money).
Workload-first — bursty / spiky → Snowflake or BigQuery on-demand; steady → Redshift Serverless or BigQuery flat-rate.
Team-size-first — small team needs near-zero maintenance → Snowflake or BigQuery; bigger team can afford Redshift tuning.
Cost-first — steady AWS workload → Redshift; multi-cloud or bursty → Snowflake.

Worked example. Decision matrix for three scenarios:

scenario	best fit	reason
5 TB, AWS-only, steady analyst workload, small team	Snowflake or Redshift Serverless	both work; Snowflake is easier
500 GB, GCP-only, dashboards	BigQuery	native fit; no warehouse to size
50 TB, multi-cloud, weekly backfills	Snowflake	only one that's multi-cloud + handles spikes

Step-by-step.

Start with cloud — if you're locked to AWS or GCP, you've narrowed the options.
Then workload shape — steady-state vs bursty changes whether warehouse-based billing or per-byte billing is cheaper.
Then team size — smaller teams pay for managed-service simplicity in implicit hours saved.
Cost is the last check — all three are within 2× of each other for most workloads.
The "right" answer is rarely a technical one; it's the one your team can operate without burning out.

Worked-example solution. Decision flowchart in text:

Q: which cloud is the source data on?
   AWS only       → Redshift or Snowflake (Snowflake if multi-cloud likely)
   GCP only       → BigQuery or Snowflake
   Azure only     → Snowflake (BigQuery is GCP-only)
   Multi-cloud    → Snowflake

Q: workload shape?
   Steady, predictable      → flat-rate (Redshift Serverless / BigQuery flat-rate)
   Bursty, mostly idle      → on-demand (Snowflake auto-suspend / BigQuery on-demand)

Q: team size + ops appetite?
   Small + want easy        → Snowflake or BigQuery
   Big + want control       → Redshift provisioned

Rule of thumb: the right answer is the one your team can operate at 3 AM without paging an expert.

Common beginner mistakes

Declaring one warehouse "best" — the correct answer is always conditional on the workload.
Comparing on-demand BigQuery to provisioned Redshift — different cost models entirely.
Forgetting cross-cloud egress charges when picking a warehouse on a different cloud than your source.
Overestimating Snowflake's premium over Redshift — at steady state, the gap is often smaller than the operational savings.
Underestimating ease-of-use — engineering hours saved by a managed warehouse are real money.

Snowflake Interview Question on Choosing a Warehouse for a Specific Scenario

You're advising a startup: their product runs on AWS, they have ~10 TB of analytical data growing 1 TB/month, three full-time analysts, and a small data-engineering team. They want sub-second BI dashboards and a daily ETL. Budget is "reasonable, not unlimited." Pick one warehouse and defend the choice in two paragraphs.

Solution Using Snowflake on AWS with Per-Team Warehouses + Auto-Suspend

Code solution.

-- ETL warehouse
CREATE WAREHOUSE WH_ETL
    WITH WAREHOUSE_SIZE = 'MEDIUM' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;

-- BI warehouse with multi-cluster for concurrency
CREATE WAREHOUSE WH_BI
    WITH WAREHOUSE_SIZE = 'SMALL' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE
         MIN_CLUSTER_COUNT = 1 MAX_CLUSTER_COUNT = 3;

-- analyst warehouse for ad-hoc
CREATE WAREHOUSE WH_ANALYSTS
    WITH WAREHOUSE_SIZE = 'LARGE' AUTO_SUSPEND = 60 AUTO_RESUME = TRUE;

Step-by-step trace of the decision:

consideration	answer
source-data cloud	AWS — both Snowflake and Redshift fit
workload	mixed: nightly ETL (bursty) + BI (concurrent) + analyst (spiky)
team size	small data-eng team
size today	10 TB, growing 1 TB/month
operational burden tolerance	low
result	Snowflake on AWS

Output: the startup gets sub-second BI (result cache + multi-cluster WH_BI), idle warehouses auto-suspend (cost), the data-engineering team doesn't spend Friday afternoons on VACUUM and distkey tuning (no Redshift-style maintenance), and a future move off AWS doesn't force a warehouse migration.

Why this works — concept by concept:

Snowflake on AWS — keeps the data on the same cloud as the product; same-cloud egress is minimal.
Three warehouses — isolates ETL from BI from analysts; one slow query never blocks another.
AUTO_SUSPEND = 60s — idle compute is not billed; the warehouses run only when there's work.
Multi-cluster WH_BI — handles the 9 AM dashboard concurrency spike without a bigger warehouse.
Near-zero maintenance — no VACUUM, no distkey, no manual partition tuning; small team can run it.
Cost — proportional to actual usage; the always-on cost of a Redshift provisioned cluster is the worst fit for a startup's spiky workload.

Inline CTA: see the full ETL-and-warehouse playbook in ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

Choosing Snowflake (checklist)

If your workload looks like…	Snowflake is a good fit because…	Watch out for…
Analytical SQL over 10 GB–10 PB	Columnar storage + parallel compute	Tiny datasets are cheaper on DuckDB / SQLite
Multi-team concurrent dashboards	Per-team warehouses + multi-cluster	Forgetting `AUTO_SUSPEND`
Multi-cloud or cloud-agnostic	Runs on AWS, GCP, Azure	Cross-region egress costs
Spiky workloads with idle gaps	Per-second billing + auto-suspend	Always-on warehouses are wasteful
Need Time Travel + dev clones	Built-in, zero-cost cloning	Long retention on high-churn tables is expensive
Semi-structured JSON / Parquet	First-class `VARIANT` type + `COPY INTO` JSON	Storing tabular data as JSON wastes columnar

Pro tip: When you propose Snowflake in a system-design round, immediately name the three layers and the per-team warehouse split. Those two sentences turn a generic answer into one that signals you've actually run Snowflake in production.

Frequently asked questions

What is Snowflake?

Snowflake is a cloud-native data warehouse / data platform built on three independent layers — storage, compute (virtual warehouses), and cloud services. It runs as a managed service on AWS, GCP, and Azure, and is optimised for analytical SQL workloads over very large datasets.

What is a virtual warehouse?

A virtual warehouse is a named, sized, isolated compute cluster that runs your queries. Warehouses can be created, suspended, resumed, and resized independently of one another and independently of the data they read. The pricing model is per-second of warehouse uptime.

What is separation of compute and storage?

Snowflake stores every table on cloud object storage that is decoupled from any compute cluster. Compute (virtual warehouses) and storage scale independently — you can resize compute in seconds with no data motion and add petabytes of storage without changing compute sizing.

What is Time Travel?

Time Travel is the ability to query a table's historical state via AT (TIMESTAMP => …) or BEFORE (STATEMENT => …) clauses, within the table's retention window (1–90 days). It powers UNDROP TABLE and accidental-write recovery without external backups.

What is zero-copy cloning?

Zero-copy cloning uses CREATE … CLONE to produce a new database, schema, or table that shares the source's underlying micro-partitions. No data is copied at clone time; the clone diverges only when either side writes. Ideal for instant dev / test environments.

How does Snowflake compare to Redshift and BigQuery?

Snowflake runs on AWS, GCP, and Azure with fully separated compute and storage. Redshift is AWS-only and was historically coupled (Redshift Serverless changes that). BigQuery is serverless and GCP-only, billing per byte scanned. Pick by cloud, workload shape, and team-size, not by feature count.

How long does it take to learn Snowflake?

If your SQL fluency is solid, the core ideas (warehouses, separation of compute and storage, COPY INTO, Time Travel, cloning) take 1–2 weeks of focused practice. Advanced topics (clustering, materialised views, streams + tasks, multi-cluster tuning) take another 2–4 weeks of real-world use.

Practice on PipeCode

PipeCode ships 450+ data engineering practice problems — SQL uses the PostgreSQL dialect, with editorials and topics aligned to the same patterns Snowflake interviewers ask. Start from Explore practice →, open SQL practice →, filter by ETL → or aggregations →, and see plans → when you want the full library.

Amazon Redshift for Data Engineering — Columnar Storage, MPP, COPY, Distribution Keys, Spectrum

Gowtham Potureddi — Tue, 12 May 2026 04:31:32 +0000

Amazon Redshift is the AWS cloud data warehouse that data engineers reach for when an analytical workload outgrows a regular OLTP database (Postgres, MySQL) and needs to scan billions of rows in seconds. The mental model that holds the whole product together is four primitives: columnar storage plus massively parallel processing (MPP) for read-heavy analytics, distribution styles (EVEN, KEY, ALL) and sort keys for join and filter performance, the COPY command plus the leader/compute-node architecture for loading and executing queries, and Redshift Spectrum plus the VACUUM and ANALYZE maintenance commands for querying data directly in S3 and keeping the warehouse fast over time. Master those four and you can answer almost every Redshift interview question without memorizing AWS marketing.

This guide walks each cluster end-to-end with a detailed topic explanation, per-sub-topic explanation with a worked example and a step-by-step walkthrough, common beginner mistakes, and an interview-style scenario with a full traced answer that explains why the design is correct, what the cost is, and where beginners typically slip. Every example uses PostgreSQL-flavored SQL — the dialect Redshift speaks — so the patterns you learn here transfer directly to live coding rounds and production warehouse work.

1. Amazon Redshift Columnar Storage, MPP, and Compression

Why columnar storage + massively parallel processing makes Redshift fast for analytics

"Why is Redshift faster than Postgres for analytics queries?" is the signature opening question — and the answer is the columnar + MPP + compression triple. The mental model: a row-oriented database (Postgres) stores all columns of a row physically next to each other on disk; a columnar database (Redshift) stores all values of a single column next to each other; an aggregate query like SUM(revenue) reads only the revenue column block instead of every row's full payload — orders of magnitude less I/O. Layer in massively parallel processing (MPP) — the work is split across many compute nodes — and you get sub-second scans across billions of rows.

Pro tip: When asked "why is Redshift slow for single-row updates?", flip the columnar logic — to update one row, the engine has to find and rewrite the value in every column block. Row stores do this in one I/O; columnar stores do it in N I/Os (one per column). State this trade-off explicitly; it signals you understand the architecture, not just the marketing.

Columnar storage — column-block reads instead of full-row scans

The columnar invariant: Redshift stores each column as a separate sequence of values on disk; an analytic query like SELECT SUM(amount) FROM orders reads only the amount column block and skips the other columns entirely. For a 50-column table where the query touches one column, that's a 50× I/O reduction compared to a row-oriented scan.

Column blocks — values for a single column stored contiguously; 1MB blocks by default in Redshift.
Column projection — the planner reads only blocks for columns referenced in SELECT/WHERE/GROUP BY/JOIN.
Zone maps — per-block min/max metadata; if a WHERE predicate cannot match the block's range, the block is skipped entirely.
Encoding per column — Redshift picks a compression encoding (RAW, LZO, ZSTD, RUNLENGTH, BYTEDICT, …) per column based on data shape.

Worked example. A sales table with 5 columns and 100 million rows; query touches only amount.

storage layout	bytes read for `SUM(amount)`	scan time
row-oriented (Postgres)	100M × ~120 bytes per row = ~12 GB	60-90s on one node
column-oriented (Redshift)	100M × 8 bytes for the amount column = ~800 MB	2-3s on one node

Step-by-step explanation.

The query says SELECT SUM(amount) FROM sales — only one column is referenced.
In a row store, the engine reads every row's full payload (~120 bytes including order_id, customer_id, amount, status, ts) just to get the amount.
In a columnar store, the engine reads only the contiguous amount column block — ~8 bytes per value, no other columns touched.
Zone maps further skip blocks whose min/max don't satisfy any WHERE predicate (e.g., WHERE order_date >= '2026-05-01' skips every block with max(order_date) < 2026-05-01).
Combined with MPP (next sub-topic), the same scan that took 60-90s on a single Postgres node finishes in 2-3s across 10 Redshift compute nodes.

Worked-example solution.

-- Same query, dramatically different I/O profile in Redshift vs Postgres
SELECT SUM(amount) AS total_revenue
FROM sales;

Rule of thumb: the bigger the table and the fewer columns your query touches, the bigger the columnar win. Single-column aggregates over wide tables are the canonical "Redshift dominates Postgres" workload.

Massively parallel processing — split one query across many compute nodes

The MPP invariant: a Redshift cluster has one leader node and N compute nodes; data is partitioned across the compute nodes; the leader parses the query, generates a parallel plan, and ships sub-plans to each compute node; each node processes its slice independently; the leader aggregates the partial results into a final answer. A 1-billion-row scan on 10 nodes becomes 10 parallel 100-million-row scans.

Leader node — query parser, planner, coordinator; no data lives here.
Compute nodes — each holds a partition of the data and executes its slice of the plan.
Slices per node — each compute node has multiple slices (CPU cores); each slice processes a partition of the node's data.
Aggregation — partial sums/counts return to the leader for the final reduce.

Worked example. A scan over 1 billion rows on a 10-node cluster with 4 slices per node.

layer	parallelism	rows per unit
1 leader node	coordinator	0 (no data)
10 compute nodes	10×	100M rows each
4 slices per node	40× total	25M rows per slice
per-slice scan + partial sum	local	~0.3s
leader reduce of 40 partial sums	aggregate	<0.1s

Step-by-step explanation.

The leader parses SELECT SUM(amount) FROM sales and generates a parallel execution plan.
The plan tells each compute node: "scan your slice of sales, compute a local SUM(amount), ship the partial sum to the leader."
All 40 slices (10 nodes × 4 slices) execute their scans in parallel — each touches ~25M rows.
Each slice returns a single number (its local partial sum) to the leader — 40 numbers total, kilobytes of network traffic.
The leader sums the 40 partials into the final answer and returns it to the client — total wall-clock time ~0.3s + network + leader reduce ≈ 0.5s for a 1-billion-row scan.

Worked-example solution.

-- The MPP magic is invisible — same SQL, distributed plan under the hood
SELECT SUM(amount) AS total_revenue
FROM sales;

Rule of thumb: MPP wins are bounded by the slowest slice (the straggler). If one slice holds 5× more data than the others (skew), the query is 5× slower than it could be — which is exactly the problem DISTKEY solves (next H2).

Compression — smaller storage, faster scans

The compression invariant: Redshift compresses each column block using a per-column encoding chosen for the data shape; compressed blocks are smaller on disk (lower storage cost) AND smaller to read (faster scans); decompression happens on the compute nodes after the block is loaded into RAM. The standard recommendation is to let COPY choose encodings automatically via COMPUPDATE ON.

AUTO — Redshift picks the best encoding per column based on a sample of the data.
ZSTD — high-ratio general-purpose encoding; the modern default.
RUNLENGTH — best for columns with long runs of repeated values (booleans, low-cardinality flags).
BYTEDICT — best for low-cardinality string columns (status, region, category).

Worked example. Three columns in orders, each with a different encoding choice.

column	cardinality	best encoding	compression ratio
`order_id` (BIGINT, unique)	1B distinct	`RAW` or `ZSTD`	~2×
`status` (VARCHAR, low cardinality)	5 distinct	`BYTEDICT`	~30×
`created_at` (TIMESTAMP, sequential)	1B distinct (but ordered)	`ZSTD`	~4×

Step-by-step explanation.

During COPY, Redshift samples each column and picks the encoding that gives the best compression for that data shape.
order_id is unique and large — compression is limited to ~2× because there's no repetition pattern.
status has only 5 distinct values across 1B rows — BYTEDICT stores a 5-entry dictionary plus one tiny index per row, giving ~30× compression.
created_at is sequential timestamps — ZSTD compresses the deltas between consecutive timestamps to ~4×.
Net storage cost for the 1B-row orders table drops from ~120GB raw to ~25-30GB compressed — and the same column-block reads return ~4× faster because they're smaller.

Worked-example solution.

-- Let COPY pick encodings automatically (the standard recommendation)
COPY orders
FROM 's3://mybucket/orders.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopy'
FORMAT AS CSV
COMPUPDATE ON;

Rule of thumb: always run the first COPY of a new table with COMPUPDATE ON so Redshift can pick encodings. Override manually only if you have a workload-specific reason (most teams never do).

Common beginner mistakes

Assuming Redshift is just a "faster Postgres" — it's optimized for analytics; single-row inserts and updates are 10-100× slower than Postgres.
Loading a row at a time with INSERT INTO ... VALUES (...) — Redshift accumulates uncompressed blocks per insert; use COPY for bulk loads instead.
Using SELECT * against a wide table — defeats column projection; explicitly list the columns you need.
Mismatched zone maps from random insertion order — without a SORTKEY, zone maps are useless and the planner reads every block.
Not running COMPUPDATE on the first load — Redshift falls back to RAW encoding, giving you 0-2× compression instead of 10-30×.

Amazon Redshift Interview Question on OLTP vs OLAP

A retail company wants to add real-time analytics dashboards on top of their existing Postgres-backed e-commerce app, which currently handles ~5,000 orders per second. Should they run the analytics queries against Postgres directly, or load data into Redshift? Justify with the architectural primitives.

Solution Using OLTP/OLAP separation, columnar storage, and MPP

1. KEEP Postgres for the OLTP workload (the app)
   - 5,000 orders/sec is write-heavy: insert one row at a time.
   - Row-oriented storage is optimal for this — single I/O per row.
   - ACID transactions on multi-table updates (order + line items + payment) are non-negotiable.

2. ADD Redshift for the OLAP workload (the dashboards)
   - Dashboards run SUM/AVG/COUNT over millions of rows — columnar storage gives 10-50× I/O reduction.
   - MPP splits each query across the cluster — sub-second scans across billions of rows.
   - Compression cuts storage cost and further speeds up scans.

3. INGESTION pipeline
   - Stream Postgres changes via Debezium/Kafka → S3 (1-min batches).
   - Daily COPY from S3 into Redshift bronze.orders, partitioned by ingest_date.
   - Silver/gold transformations in Redshift SQL (PostgreSQL dialect, familiar to the team).

4. WHY NOT just run analytics on Postgres?
   - Each dashboard query would scan millions of rows row-by-row — minutes of wall-clock time, blocking the OLTP workload.
   - Postgres single-node scans don't parallelize across machines.
   - Storage cost grows linearly without columnar compression.

5. WHY NOT just run everything on Redshift?
   - Single-row inserts at 5,000/sec would saturate the cluster within minutes.
   - Redshift's COMMIT cost (block-level) is orders of magnitude higher than Postgres's per-row commit.
   - No multi-table ACID semantics for the order/payment write pattern the app needs.

Why this works: the two workloads have opposite I/O patterns — OLTP is many small writes (row store wins), OLAP is few large scans (column store + MPP wins). Forcing both onto one engine produces 10-100× worse performance for the loser of the architectural fight. The bronze/silver/gold lake/warehouse pattern lets each engine do what it's best at, with a one-minute Kafka latency between them.

Step-by-step trace of the architectural decision:

step	question	answer
1	Is the workload write-heavy (small commits)?	yes (5K orders/sec) → row store / OLTP / Postgres
2	Is there also a read-heavy analytic workload?	yes (dashboards) → column store / OLAP / Redshift
3	Are the workloads independent in time?	yes (dashboards = batch refresh) → can decouple with CDC + Kafka
4	Pick the CDC tool	Debezium (reads Postgres WAL; zero impact on OLTP)
5	Pick the warehouse	Redshift (columnar, MPP, PostgreSQL SQL dialect; team familiarity)
6	Define the boundary	bronze layer in Redshift mirrors Postgres tables 1:1 via daily COPY

Output: the recommended architecture summary:

layer	technology	role
Application	Postgres	OLTP — 5K writes/sec, ACID transactions
Change capture	Debezium + Kafka	reads WAL, no impact on Postgres
Landing	S3	partitioned by ingest_date; replay-friendly
Warehouse	Redshift	OLAP — dashboards, BI, analyst SQL
Compute model	Leader + N compute nodes	columnar storage + MPP + compression

Why this works — concept by concept:

OLTP vs OLAP separation — the two workloads have orthogonal I/O patterns; forcing both onto one engine guarantees one of them is 10-100× slower than necessary.
Row store for writes — Postgres writes one row in one disk operation; Redshift would have to update N column blocks (one per column) per row — single-row writes are ~50× slower on Redshift than Postgres.
Column store for reads — Redshift scans only the columns the query touches and skips zone-pruned blocks; a typical analytic query touches 5% of the table volume vs 100% in Postgres.
MPP for big scans — 10 compute nodes finish a 1B-row scan in ~0.5s; one Postgres node takes 60-90s for the same scan.
Compression for storage + speed — BYTEDICT on low-cardinality columns gives 30× compression; smaller blocks are also faster to read.
CDC + Kafka decoupling — daily snapshots would lag by ≥24h; Debezium + Kafka gives ~1-min freshness without touching the OLTP query path.

Inline CTA: Drill the SQL aggregation practice page for analytical query patterns and the ETL practice page for OLTP-to-warehouse pipeline shapes.

SQL
Topic — aggregation
SQL aggregation problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

2. Amazon Redshift Distribution Styles and Sort Keys

`DISTKEY`, `DISTSTYLE`, and `SORTKEY` — the two schema-design knobs that decide query speed

"How would you design the schema for a 10TB orders fact table joined to a customers dim?" is the signature schema-design question — and the answer is distribution styles for join co-location and sort keys for filter pruning. The mental model: DISTSTYLE controls how rows are partitioned across compute nodes — EVEN (round-robin), KEY (hash by a column so identical keys land on the same node), or ALL (full copy on every node); SORTKEY controls the physical order of rows within each node — zone maps then let the planner skip blocks whose min/max range can't match a WHERE predicate. Get both right and a 10TB join runs in seconds; get them wrong and the same join shuffles terabytes across the network.

Pro tip: the single best join optimization in Redshift is co-locating the join columns on the same node. If orders.customer_id and customers.id both have DISTKEY on the customer key, the join runs entirely on each compute node without any network shuffle. State this principle out loud; senior interviewers grade it specifically.

`DISTSTYLE EVEN` — round-robin distribution for skew-free workloads

The EVEN invariant: rows are distributed round-robin across compute nodes; every node gets approximately the same number of rows, eliminating skew. The trade-off is that joins between two EVEN-distributed tables require a full network shuffle of one side to match join keys.

Distribution — round-robin; one row per node, then repeat.
Skew — minimized; each node has |table| / N rows.
Join cost — high for EVEN-vs-EVEN joins; the planner must shuffle one side.
Best fit — tables that are rarely joined, or tables where no column has good distribution properties.

Worked example. A 1M-row events table on a 4-node cluster with DISTSTYLE EVEN.

node	rows held
node 1	250,000
node 2	250,000
node 3	250,000
node 4	250,000

Perfectly balanced — every node carries equal load on any scan.

Step-by-step explanation.

COPY lands 1M rows into the cluster; the leader assigns rows round-robin to compute nodes.
Node 1 gets rows 1, 5, 9, … (every 4th row); node 2 gets 2, 6, 10, …; and so on.
A SELECT COUNT(*) FROM events query parallelizes evenly — every node scans 250K rows.
A JOIN between EVEN-distributed events and another EVEN-distributed table requires shipping one side over the network (a "broadcast" or "redistribute") to align join keys.
For a 4-node cluster joining two 1M-row tables, that's 1M rows × ~120 bytes = ~120MB of network shuffle per join — fine for small tables, painful for large ones.

Worked-example solution.

CREATE TABLE events (
    event_id   BIGINT,
    user_id    BIGINT,
    event_type VARCHAR(50),
    event_ts   TIMESTAMP
)
DISTSTYLE EVEN
SORTKEY (event_ts);

Rule of thumb: EVEN is the right default when the table is small, rarely joined, or when no column has a clear "join key" or "filter key" pattern.

`DISTSTYLE KEY` (`DISTKEY`) — co-locate joins on the same node

The KEY invariant: rows are hashed by the DISTKEY column; all rows with the same DISTKEY value land on the same compute node; joins on that key require zero network shuffle because matching rows already share a node. This is the single biggest performance lever for join-heavy schemas.

DISTKEY (customer_id) — hash by customer_id; all rows for one customer co-locate.
Join co-location — orders DISTKEY(customer_id) JOIN customers DISTKEY(id) runs locally per node.
Skew risk — if a few customer_id values have disproportionate row counts (a "hot" customer), one node becomes a bottleneck.
Pick wisely — the DISTKEY should be the join column AND have roughly uniform value distribution.

Worked example. A 10TB orders fact and a 5GB customers dim, both keyed on customer_id.

table	DISTKEY	distribution effect
orders	customer_id	all orders for customer 448 land on node X
customers	id	customer 448's row lands on node X
JOIN orders ON customers	shared key	runs locally per node; zero network shuffle

Step-by-step explanation.

COPY loads 10TB of orders with DISTKEY (customer_id) — Redshift hashes each row's customer_id and sends it to the matching node.
customers is loaded the same way with DISTKEY (id) — same hash function on the same column, so customer 448's row lands on the same node as all of customer 448's orders.
When a query says orders JOIN customers ON orders.customer_id = customers.id, the planner sees the co-location and generates a local join per node.
Each node joins its slice of orders against its slice of customers independently — no network shuffle, no broadcast.
The join completes in ~O(|orders| / N) time per node — for a 10-node cluster, a 10TB join runs ~10× faster than the EVEN variant that would shuffle the whole table.

Worked-example solution.

CREATE TABLE orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    amount       DECIMAL(18,2),
    order_date   DATE
)
DISTKEY (customer_id)
SORTKEY (order_date);

CREATE TABLE customers (
    id   BIGINT,
    name VARCHAR(200),
    region VARCHAR(50)
)
DISTKEY (id);

Rule of thumb: the DISTKEY is the column you join on most frequently. Pick it to match the foreign-key relationship of your biggest, most-joined table — every other table in the join graph should use the same key for co-location.

`DISTSTYLE ALL` — full copy on every node for small lookup tables

The ALL invariant: the entire table is replicated to every compute node; every join against this table runs locally because the data is everywhere. The cost is N× storage (one copy per node); the benefit is zero-shuffle joins regardless of the other table's distribution.

Best fit — dimension/lookup tables under ~3M rows.
Storage cost — N× (one copy per compute node).
Join cost — zero shuffle; runs locally against the small table.
Maintenance cost — every COPY writes to all nodes; updates are N× more expensive.

Worked example. A 500-row countries lookup table on a 10-node cluster.

table	DISTSTYLE	per-node row count	total storage
countries (500 rows, ~50KB)	`ALL`	500 on every node	500KB total (10× 50KB)
every join against countries	local	zero network shuffle	sub-second

Step-by-step explanation.

The countries table has 500 rows — a tiny lookup table mapping country codes to country names.
DISTSTYLE ALL instructs Redshift to copy all 500 rows to every compute node.
Total storage = 500 rows × 10 nodes = 5,000 row-copies (~500KB total) — negligible compared to the multi-terabyte fact tables.
Any query that joins orders to countries runs locally on each node — Node 5's slice of orders joins against Node 5's full copy of countries, no shuffle.
The 10× storage cost is worth paying because the join cost drops from "shuffle 10TB across the network" to "in-RAM lookup against 500 rows" — a 1000× speedup.

Worked-example solution.

CREATE TABLE countries (
    country_code CHAR(2),
    country_name VARCHAR(100),
    region       VARCHAR(50)
)
DISTSTYLE ALL
SORTKEY (country_code);

Rule of thumb: DISTSTYLE ALL is the right call for any dimension/lookup table under ~3M rows that's joined frequently. Above ~3M rows, the N× storage cost outweighs the join-time savings.

`SORTKEY` — physical sort order for zone-map pruning

The SORTKEY invariant: SORTKEY (col) physically orders rows by col within each compute node; Redshift maintains a per-block zone map (min/max of every column); a WHERE predicate that matches a contiguous range of the sort key prunes ~99% of blocks without reading them. This is the second-biggest performance lever after DISTKEY.

SORTKEY (order_date) — physically sort by date; WHERE order_date > '2026-01-01' skips every pre-2026 block.
Compound SORTKEY (col_a, col_b) — primary sort by col_a, secondary by col_b; works best when predicates filter on col_a first.
INTERLEAVED SORTKEY (col_a, col_b) — weights both columns equally; rarely beats compound and requires periodic VACUUM REINDEX.
Date sort key is the canonical choice — most analytical queries filter by date.

Worked example. A 1B-row orders table with SORTKEY (order_date); query filters one month.

layer	bytes touched
raw table (1B rows, ~120B each)	~120GB
compressed (ZSTD ~4×)	~30GB
`WHERE order_date BETWEEN '2026-05-01' AND '2026-05-31'`	~1GB (1 month of 12)
zone-map skip = 11/12 of the table	~92% blocks skipped

Step-by-step explanation.

With SORTKEY (order_date), all rows are physically ordered by date within each compute node.
Each 1MB column block has a zone-map entry recording the min and max order_date value in that block.
The query WHERE order_date BETWEEN '2026-05-01' AND '2026-05-31' triggers a planner check: for every block, is [block_min, block_max] overlapping [2026-05-01, 2026-05-31]?
For blocks with max(order_date) < 2026-05-01 or min(order_date) > 2026-05-31, the block is skipped entirely — no I/O.
For a 12-month dataset, ~11/12 of the blocks are skipped — the query reads ~1GB instead of ~30GB and finishes in ~1s instead of ~30s.

Worked-example solution.

-- Sort key already on order_date (set at CREATE TABLE)
-- This query benefits from zone-map pruning automatically:
SELECT product_category,
       SUM(amount) AS revenue
FROM orders
WHERE order_date BETWEEN '2026-05-01' AND '2026-05-31'
GROUP BY product_category;

Rule of thumb: pick a SORTKEY matching the most common WHERE predicate in your workload. For event/order tables, that's almost always the timestamp column.

Common beginner mistakes

Defaulting every table to DISTSTYLE EVEN — joins between large tables become shuffle-heavy and 10× slower than necessary.
Picking a high-skew DISTKEY (e.g., status with 90% of rows in one value) — one compute node holds 90% of the data and becomes the bottleneck.
Using DISTSTYLE ALL on tables larger than ~3M rows — N× storage cost overwhelms the join-time savings.
Forgetting to set a SORTKEY on fact tables — every WHERE predicate reads the whole table because zone maps are useless without sorted data.
Using INTERLEAVED SORTKEY without VACUUM REINDEX cadence — performance degrades over time; compound sort keys are simpler and usually better.

Amazon Redshift Interview Question on Schema Design

Design the distribution style and sort key for a 10TB orders fact table (joined frequently to a 5GB customers dim by customer_id, queried mostly by date range), plus a 500-row countries lookup table. Justify every choice.

Solution Using `DISTKEY` co-location + `SORTKEY` on date + `DISTSTYLE ALL` for the lookup

-- 10TB fact table: hash on join column, sort on filter column
CREATE TABLE orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    product_id   BIGINT,
    amount       DECIMAL(18,2),
    order_date   DATE,
    country_code CHAR(2)
)
DISTKEY (customer_id)
SORTKEY (order_date);

-- 5GB dim: same key so joins co-locate
CREATE TABLE customers (
    id          BIGINT,
    name        VARCHAR(200),
    signup_date DATE
)
DISTKEY (id);

-- 500-row lookup: replicate everywhere
CREATE TABLE countries (
    country_code CHAR(2),
    country_name VARCHAR(100)
)
DISTSTYLE ALL
SORTKEY (country_code);

Why this works: DISTKEY (customer_id) on both orders and customers co-locates the join — every node joins its local slice without network shuffle, turning a 10TB-shuffle join into a local hash join. SORTKEY (order_date) on orders means date-range queries (the most common analytical predicate) prune ~90%+ of blocks via zone maps. DISTSTYLE ALL on the 500-row countries table replicates it to every node so country lookups never shuffle, at negligible storage cost. The three choices together turn a "30-minute" join into a "30-second" join with no query rewriting.

Step-by-step trace of the design walkthrough:

step	question	answer
1	What's the most common JOIN on `orders`?	`JOIN customers ON customer_id = customers.id` (90% of analytic queries)
2	Pick `DISTKEY` for `orders`	`customer_id` (matches the join key)
3	Pick `DISTKEY` for `customers`	`id` (same hash function on the same column → co-located with orders.customer_id)
4	What's the most common WHERE predicate on `orders`?	`WHERE order_date BETWEEN ... AND ...` (date-range filter for any time-bounded report)
5	Pick `SORTKEY` for `orders`	`order_date` (date-range queries prune ~90%+ of blocks)
6	What about `countries` (500 rows, joined often)?	`DISTSTYLE ALL` (10× storage cost is 5MB — negligible; eliminates join shuffle)

Output: the recommended schema summary:

table	rows	DISTSTYLE	DISTKEY	SORTKEY	rationale
orders	1B (10TB)	KEY	customer_id	order_date	co-locate join + zone-map date pruning
customers	50M (5GB)	KEY	id	(signup_date if needed)	match orders.DISTKEY for join co-location
countries	500	ALL	—	country_code	tiny lookup; replicate to every node

Why this works — concept by concept:

DISTKEY (customer_id) on orders + DISTKEY (id) on customers — same hash function on the join key means matching rows already share a compute node; the join runs locally with zero network shuffle.
SORTKEY (order_date) on orders — physically sorts blocks by date; zone-map pruning skips ~11/12 of blocks for a monthly query, turning a 30GB scan into a 1-3GB scan.
DISTSTYLE ALL on countries — replicates the 500-row lookup to every node; every join against countries runs locally with no shuffle. 10× storage cost is 5MB — irrelevant compared to the join-time savings.
Avoiding skew on customer_id — distribution is roughly uniform across millions of customers; no single customer dominates, so no node bottleneck.
Why not put SORTKEY on customer_id too? — date-range filters are the dominant WHERE predicate; sorting by customer_id would help joins but hurt the more common date filter.
O(|orders| / N + zone-pruned) time — co-located join scales linearly with node count; zone-map pruning gives an additional 10× reduction on date filters; net is sub-30-second response on 10TB facts.

Inline CTA: More SQL joins practice problems for join-key reasoning and dimensional modeling practice for star-schema design.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

3. Amazon Redshift `COPY` Command and Leader/Compute Architecture

Bulk loading from S3 and how queries flow through the leader and compute nodes

"Walk me through what happens when I run a Redshift query" is the signature execution-flow question — and the cleanest answer pairs the COPY command (how data lands in the cluster) with the leader/compute architecture (how queries are planned and executed). The mental model: COPY is the bulk-load command that ingests data from S3 (or other AWS sources) in parallel across all compute nodes — the only sane way to load anything at scale; the leader node parses every SQL query, builds a parallel plan, ships sub-plans to the compute nodes, and aggregates partial results; compute nodes hold the data and execute the bulk of the work. Knowing both halves lets you debug any "why is this slow" question.

Pro tip: COPY parallelizes by default — if you give it one big file, it can only ingest as fast as one slice can pull from S3. The standard practice is to split source data into N × number_of_slices files of roughly equal size (e.g., 40 files for a 10-node × 4-slice cluster). Every slice grabs a file in parallel, maxing out S3 throughput.

`COPY` from S3 — the only sane way to bulk-load data

The COPY invariant: COPY tablename FROM 's3://...' IAM_ROLE 'arn:...' FORMAT AS CSV reads the file(s) at the S3 prefix, distributes rows across compute nodes per the table's DISTSTYLE, and writes them to columnar blocks in parallel; it's 10-100× faster than INSERT INTO ... VALUES (...) and is the only acceptable bulk-load method. Supports CSV, JSON, Parquet, Avro, ORC, fixed-width.

FROM 's3://bucket/prefix/' — S3 source; can be a single file or a prefix matching multiple files.
IAM_ROLE 'arn:aws:iam::...' — assumes the IAM role for S3 access; avoids credential leakage.
FORMAT AS CSV / PARQUET / JSON — file format hint; Parquet is the fastest because it's already columnar.
COMPUPDATE ON — let Redshift pick compression encodings on the first load (recommended).
Parallel ingest — file count should be a multiple of cluster_slices for max throughput.

Worked example. Load a partitioned daily orders CSV from S3 into the orders table.

component	spec
source prefix	`s3://analytics-lake/bronze/orders/ingest_date=2026-05-11/`
file count	40 (matches 10-node × 4-slice cluster)
total size	50 GB (1.25 GB per file)
target table	`orders` with `DISTKEY(customer_id)`
expected load time	~2-3 minutes

Step-by-step explanation.

The S3 source has 40 files of ~1.25GB each, one for each cluster slice.
The COPY command parses each file in parallel — slice 1 grabs file 1, slice 2 grabs file 2, …, slice 40 grabs file 40.
Each slice parses its CSV rows, hashes by customer_id (the DISTKEY), and ships rows to the correct destination node.
Each destination node accumulates the rows in its slice and writes them to columnar blocks with auto-chosen compression encodings.
The 50GB load completes in ~2-3 minutes — vs ~6-12 hours for the equivalent INSERT INTO ... VALUES row-at-a-time approach.

Worked-example solution.

COPY orders
FROM 's3://analytics-lake/bronze/orders/ingest_date=2026-05-11/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopy'
FORMAT AS CSV
DELIMITER ','
IGNOREHEADER 1
DATEFORMAT 'YYYY-MM-DD'
COMPUPDATE ON
STATUPDATE ON;

Rule of thumb: always use COPY for >10K-row loads. Single-row INSERTs are an anti-pattern that wastes the columnar storage layout and the parallel architecture.

Leader node — query parser, planner, and result aggregator

The leader-node invariant: the leader node receives every SQL query, parses it, generates a parallel execution plan, ships sub-plans to the compute nodes, collects partial results, and returns the final answer to the client; the leader holds no data and runs no scans itself. It's the orchestrator, not a worker.

Parses SQL — checks syntax, resolves catalog metadata, validates types.
Generates plan — chooses scan order, join order, join algorithm, and per-node sub-plans.
Dispatches sub-plans — ships compiled code to each compute node via the cluster network.
Aggregates results — collects partial sums/counts and produces the final answer.

Worked example. A SELECT SUM(amount) FROM orders WHERE order_date = '2026-05-11' on a 10-node cluster.

stage	who	output
parse + plan	leader	compiled C++ binary per slice
dispatch	leader	sub-plans sent to all 40 slices
local scan + sum	compute nodes (40 slices)	40 partial sums
network return	compute → leader	40 floats (kilobytes)
final reduce	leader	one total
client response	leader → client	one number

Step-by-step explanation.

Client (psql, BI tool) sends the SQL string to the leader node over TCP.
The leader parses the SQL, validates that orders exists and amount/order_date are valid columns.
The leader compiles the query into per-slice C++ code (Redshift caches this code for re-use).
The leader ships the compiled plan to all 40 slices via the internal cluster network.
Each slice scans its local data, applies the WHERE predicate, computes a partial SUM(amount), and returns one number to the leader; the leader sums the 40 partials into the final answer and returns it to the client.

Worked-example solution.

-- The SQL is identical to single-node SQL; parallelism is invisible
SELECT SUM(amount) AS daily_revenue
FROM orders
WHERE order_date = '2026-05-11';

Rule of thumb: the leader is also where query queueing happens — if you see "queries are queued" in the console, you're at the leader's concurrency limit; scale the cluster or use Concurrency Scaling.

Compute nodes — where data lives and scans happen

The compute-node invariant: compute nodes hold the partitioned data and execute every per-slice scan, filter, join, and partial aggregate; they communicate with the leader over the cluster network and with each other when a join requires a shuffle or broadcast. Each compute node has multiple slices (typically 2-32, depending on node type), each running on its own CPU core.

Slices — the unit of parallelism; each slice is a CPU core + a partition of the node's data.
Local scan — every slice scans its own data without touching other slices.
Join shuffle / broadcast — when join keys aren't co-located, slices ship rows across the network.
Disk + memory — local SSD for cold blocks, RAM for hot blocks and intermediate join state.

Worked example. A 10-node ra3.xlplus cluster (each node has 4 slices, 32 GB RAM, 4 vCPUs).

layer	count	total capacity
nodes	10	10× node resources
slices	40 (10 × 4)	40 parallel workers
RAM	320 GB total	shared across slices
vCPU	40 cores	one per slice
disk	~16 TB managed storage	columnar blocks

Step-by-step explanation.

The cluster is provisioned with 10 ra3.xlplus nodes — each with 4 slices, 32GB RAM, 4 vCPUs.
Data loaded via COPY is partitioned across the 40 slices per the table's DISTSTYLE.
When the leader dispatches a query, each of the 40 slices receives its sub-plan and starts scanning.
Each slice scans its local columnar blocks, applies zone-map pruning, filters rows, computes partial joins/aggregates entirely within its own RAM.
If the query requires a shuffle (e.g., a join on a non-DISTKEY column), slices exchange rows via the cluster network — this is the slowest part of any non-co-located join and the #1 thing DISTKEY choices are designed to avoid.

Worked-example solution.

-- Check per-slice load skew with the system view
SELECT slice,
       SUM(num_values)        AS rows_held,
       SUM(rows_pre_filter)   AS rows_scanned
FROM stv_blocklist
WHERE tbl = (SELECT oid FROM stv_tbl_perm WHERE name = 'orders' LIMIT 1)
GROUP BY slice
ORDER BY rows_held DESC;

Rule of thumb: if STV_BLOCKLIST shows a slice holding 5× more rows than the others, you have skew — re-evaluate the DISTKEY choice or switch to EVEN for that table.

Common beginner mistakes

Using INSERT INTO ... VALUES (...) for bulk loads — slow, fragments storage, defeats compression.
Loading from a single huge S3 file — only one slice can pull it; the other 39 sit idle.
Forgetting IAM_ROLE and using static credentials — security risk and key rotation pain.
Skipping COMPUPDATE ON on the first load — Redshift falls back to RAW encoding, losing 10-30× compression.
Treating the leader as a worker — if you see leader-node CPU spikes, you have plan/dispatch overhead, not data work; rewrite to reduce result-set size.

Amazon Redshift Interview Question on `COPY` and Execution Flow

A daily ETL job dumps 50GB of order CSVs into S3. Walk through how you would (a) load that data into Redshift with the right COPY command, and (b) explain what happens when an analyst runs SELECT product_category, SUM(revenue) FROM sales GROUP BY product_category against the loaded data.

Solution Using parallel `COPY` + leader/compute query flow

PART (a) — Loading via COPY

Step 1: Split the 50GB into 40 files of ~1.25GB each upstream
        (matches 10-node × 4-slice cluster topology).
        s3://lake/orders/ingest_date=2026-05-11/part-00001.csv ... part-00040.csv

Step 2: Run COPY in a single command — parallelism happens automatically.
        COPY orders
        FROM 's3://lake/orders/ingest_date=2026-05-11/'
        IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftCopy'
        FORMAT AS CSV
        IGNOREHEADER 1
        DATEFORMAT 'YYYY-MM-DD'
        COMPUPDATE ON
        STATUPDATE ON;

Step 3: Verify the load.
        SELECT COUNT(*) FROM orders WHERE ingest_date = '2026-05-11';
        SELECT * FROM stl_load_errors ORDER BY starttime DESC LIMIT 10;

PART (b) — Query execution flow

Step 1: Client (psql, Tableau, QuickSight) sends SQL to the leader node.
Step 2: Leader parses, validates against the catalog, generates a parallel plan.
        - Realizes there's no WHERE → full table scan
        - Realizes GROUP BY product_category → hash aggregate per slice + final reduce
Step 3: Leader compiles per-slice C++ code (cached if seen before) and dispatches to 40 slices.
Step 4: Each slice scans its local data (columnar, only revenue + product_category blocks).
        - Local hash aggregate: { 'electronics' → 8421, 'apparel' → 5132, ... }
Step 5: Each slice ships its partial group map to the leader (kilobytes per slice).
Step 6: Leader merges 40 partial maps into the final result by summing per category.
Step 7: Final result (one row per product_category) returned to the client.

Total wall-clock time: ~1-3 seconds for a 1B-row table.

Why this works: part (a) maxes out S3 throughput by giving every slice its own file to pull, then uses COMPUPDATE ON to let Redshift auto-pick compression encodings on the first load. Part (b) demonstrates fluency with the leader/compute split: leader plans + dispatches + aggregates, compute nodes scan + filter + locally aggregate. Each slice runs its own local hash aggregate so only the per-category partials (kilobytes) cross the network, not the raw scan output (gigabytes).

Step-by-step trace for a 1B-row sales table with 50 product categories:

step	location	output
1	client	SQL string → leader (TCP)
2	leader	parsed plan; hash aggregate sub-plan per slice
3	leader	compiled C++ shipped to 40 slices
4	each slice	scan 25M local rows; build local map of 50 categories
5	each slice → leader	40 partial maps (50 entries each, ~kilobytes)
6	leader	sum partials per category into 50 final rows
7	leader → client	final 50-row result

Output:

product_category	total_revenue
electronics	4,128,931
apparel	2,580,420
home	1,945,210
...	...

50 categories total.

Why this works — concept by concept:

40 files for 40 slices — every slice grabs its own file from S3 in parallel; load time is bounded by the slowest slice's pull + parse, not the total file count.
IAM_ROLE instead of credentials — the cluster assumes the role at load time; no static keys to rotate.
COMPUPDATE ON on first load — Redshift samples each column and picks the best encoding (BYTEDICT, ZSTD, RUNLENGTH) — 10-30× compression instead of 1×.
Leader = orchestrator — parses, plans, dispatches, aggregates; never scans data itself.
Local hash aggregate per slice — each slice produces a 50-row partial map; only kilobytes cross the network instead of gigabytes of raw rows.
O(|sales| / N) time — single linear scan parallelized across 40 slices; aggregation is bounded by the number of categories (50), which fits in RAM trivially.

Inline CTA: Drill the SQL aggregation practice page for the GROUP BY + parallel-aggregate pattern and the ETL practice page for the S3 → warehouse load shape.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

4. Amazon Redshift Spectrum, `VACUUM`, and `ANALYZE`

Querying S3 directly + keeping the cluster fast over time

"How would you query a 50TB clickstream that sits in S3 without loading it into Redshift first?" is the signature Spectrum question — and the operational follow-up is always "how do you keep the cluster fast as data changes?" The mental model: Redshift Spectrum lets you register S3-resident Parquet/ORC/CSV files as external tables and query them with the same SQL you use for managed tables; the data never moves into Redshift; VACUUM reorganizes deleted-row gaps and re-sorts blocks to restore zone-map effectiveness; ANALYZE refreshes planner statistics so the optimizer picks the right join order and algorithm. Spectrum extends Redshift into a lakehouse; VACUUM/ANALYZE keep the managed half healthy.

Pro tip: Spectrum's per-query cost is $5 per TB scanned — same as Athena. So a 10TB unpartitioned full scan costs $50 every time. Always create external tables with the same partition column you'd use in the WHERE clause (date, region) so the planner can prune partitions just like managed tables — turning a $50 query into a $0.50 query.

Redshift Spectrum — query S3 data directly without loading

The Spectrum invariant: CREATE EXTERNAL TABLE registers an S3 prefix as a table in an external schema (backed by AWS Glue Data Catalog); subsequent SELECT statements pull rows directly from S3 at query time, with the work distributed across Spectrum's serverless fleet; managed Redshift tables and external Spectrum tables can be joined freely in one SQL statement. The data never moves into the cluster — perfect for cold, large, or schema-evolving datasets.

External schema — CREATE EXTERNAL SCHEMA lake FROM DATA CATALOG DATABASE '...' IAM_ROLE '...'.
External table — CREATE EXTERNAL TABLE lake.clickstream (...) PARTITIONED BY (event_date date) STORED AS PARQUET LOCATION 's3://...'.
Query syntax — same SELECT, joins managed + external tables freely.
Cost model — $5/TB scanned; partition + column projection determine the bill.

Worked example. A 50TB partitioned clickstream in S3 queried by a Redshift Spectrum external table.

component	spec
S3 location	`s3://feature-lake/clickstream/year=YYYY/month=MM/day=DD/`
file format	Parquet (already columnar; Spectrum's preferred format)
total size	50 TB
query	`WHERE day = '2026-05-11'` + `SELECT user_id, event_type`
effective scan	1 TB (day partition × 2-column projection)
spectrum cost	1 TB × $5/TB = $5 per query

Step-by-step explanation.

The clickstream data lives in S3 as partitioned Parquet — directories like year=2026/month=05/day=11/*.parquet.
CREATE EXTERNAL TABLE registers the table in the Redshift catalog without copying any data — pure metadata.
When you query WHERE day = '2026-05-11', the planner prunes to one day partition (1/365 of the table) and Spectrum reads only those files from S3.
Spectrum scans the Parquet files using a serverless fleet that runs in parallel — independent of your Redshift cluster's compute capacity.
Spectrum returns the filtered/projected rows to the Redshift cluster, which can then join them with managed tables in the same query. Net cost: 1 TB scanned × $5 = $5 per query, far cheaper than ingesting the whole 50TB.

Worked-example solution.

-- Register the external schema (one-time setup)
CREATE EXTERNAL SCHEMA lake
FROM DATA CATALOG
DATABASE 'feature_lake'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftSpectrum';

-- Register the external table (one-time setup)
CREATE EXTERNAL TABLE lake.clickstream (
    user_id    BIGINT,
    event_type VARCHAR(50),
    event_ts   TIMESTAMP
)
PARTITIONED BY (event_date DATE)
STORED AS PARQUET
LOCATION 's3://feature-lake/clickstream/';

-- Query Spectrum + managed tables together
SELECT u.region,
       COUNT(*) AS clicks
FROM lake.clickstream c
JOIN users u ON u.id = c.user_id
WHERE c.event_date = '2026-05-11'
GROUP BY u.region;

Rule of thumb: Spectrum is the right answer for cold/historical data, schema-evolving sources, and "we don't want to pay to load 50TB". For hot data with frequent joins, managed Redshift tables with DISTKEY co-location are still faster.

`VACUUM` — reclaim deleted-row space and re-sort blocks

The VACUUM invariant: UPDATE and DELETE in Redshift don't physically remove data — they mark rows as deleted; over time, blocks fragment with tombstone rows and become unsorted with respect to the SORTKEY; VACUUM reclaims the deleted-row space and re-sorts the data, restoring zone-map effectiveness and reducing storage. It's the equivalent of compaction in other systems.

VACUUM FULL — both reclaim deleted space and re-sort; the default and most common.
VACUUM SORT ONLY — re-sort without reclaiming; useful when only fragmentation matters.
VACUUM DELETE ONLY — reclaim deleted space without re-sorting; useful for write-heavy patterns.
VACUUM REINDEX — additionally rebuilds the interleaved sort key (rare; only for INTERLEAVED SORTKEY tables).

Worked example. A 100M-row orders table with 5% of rows tombstoned + 10% out-of-sort.

metric	before VACUUM	after VACUUM
total rows on disk	105M (5M tombstones)	100M
disk space	12 GB	11 GB
sorted region	90%	100%
zone-map pruning ratio	~7×	~10×
date-filter query time	4 s	2.5 s

Step-by-step explanation.

Over weeks, UPDATE and DELETE statements mark 5M rows as tombstoned — they still occupy disk but are skipped at read time.
Late-arriving INSERT rows land out of sort order — the table is now 90% sorted, 10% unsorted.
VACUUM (FULL by default) scans the table, drops the tombstoned rows physically, and re-sorts the remaining rows by the SORTKEY.
After VACUUM, the table is 100M rows, 11 GB on disk, 100% sorted — zone maps work optimally again.
The same WHERE order_date BETWEEN ... query that took 4s pre-VACUUM now takes 2.5s — partly because there's less data to scan, partly because zone-map pruning is more effective on fully-sorted data.

Worked-example solution.

-- Default: reclaim space + re-sort
VACUUM orders;

-- Faster variant when sort order is the issue but space isn't
VACUUM SORT ONLY orders;

-- Check which tables need vacuuming
SELECT "table",
       size,
       tbl_rows,
       unsorted,
       stats_off
FROM SVV_TABLE_INFO
WHERE unsorted > 10
   OR stats_off > 10
ORDER BY size DESC;

Rule of thumb: schedule a VACUUM weekly for write-heavy tables (events, orders); monthly is fine for slowly-changing dimensions. Auto-VACUUM is on by default in modern Redshift but is conservative; manual VACUUMs after big backfills are still recommended.

`ANALYZE` — refresh planner statistics for better query plans

The ANALYZE invariant: Redshift's query optimizer depends on statistics (row counts, distinct value counts, min/max per column) to pick the right join algorithm and join order; ANALYZE refreshes those statistics by sampling the table. Stale statistics cause the planner to pick bad join orders — a 30-second query can become a 30-minute query overnight.

ANALYZE tablename — sample the table and refresh stats.
ANALYZE tablename (col1, col2) — refresh only specific columns (faster).
Auto-ANALYZE — runs automatically when the planner thinks stats are stale.
STATUPDATE ON on COPY — refreshes stats automatically after every COPY (recommended).

Worked example. A 1B-row table where stats are 7 days stale.

metric	stale stats	refreshed stats
planner-estimated `orders.customer_id` cardinality	100K	50M (actual)
chosen join algorithm	nested loop	hash join
join time	30 minutes	30 seconds
post-ANALYZE plan correctness	wrong	correct

Step-by-step explanation.

A week ago, orders had 100K rows with 1K distinct customers; today it has 1B rows with 50M distinct customers.
Without ANALYZE, the planner still believes the old stats: "only 100K rows, 1K customers".
The planner picks a nested-loop join (which is optimal for tiny tables) and ships the wrong execution plan.
The nested loop takes 30 minutes on the actual 1B rows; the right plan (hash join) would take 30 seconds.
After ANALYZE orders, the planner sees the true cardinalities, picks the hash join, and the same query runs in 30 seconds.

Worked-example solution.

-- Refresh stats for a single table
ANALYZE orders;

-- Refresh stats only on join columns (faster for huge tables)
ANALYZE orders (customer_id, order_date);

-- Check which tables have stale stats
SELECT "table",
       stats_off
FROM SVV_TABLE_INFO
WHERE stats_off > 10
ORDER BY stats_off DESC;

Rule of thumb: always set STATUPDATE ON on COPY so stats auto-refresh after every load. Run a manual ANALYZE after any backfill or major schema change.

Common beginner mistakes

Using Spectrum for hot data with frequent joins — Spectrum's per-query cost adds up; managed tables with DISTKEY co-location are faster and cheaper at scale.
Forgetting to partition Spectrum external tables by date — a full-table scan on 50TB costs $250 per query.
Skipping VACUUM for months on write-heavy tables — zone maps degrade, query times double or triple.
Skipping ANALYZE after a big backfill — the planner picks wrong join algorithms, queries get 100× slower silently.
Treating VACUUM as a no-op because "auto-VACUUM runs" — auto-VACUUM is conservative; manual VACUUMs after backfills are still required.

Amazon Redshift Interview Question on Spectrum vs Load Decision

A retail company has a 50TB clickstream sitting in S3 (partitioned Parquet, by date), plus a 5GB curated orders fact already in Redshift. Should they load the clickstream into Redshift via COPY, or query it via Spectrum? Walk through the decision.

Solution Using Spectrum for clickstream + managed table for orders + maintenance plan

DECISION: Query clickstream via Spectrum; keep orders managed in Redshift.

REASONING:

1. CLICKSTREAM (50 TB, mostly cold, infrequent queries)
   - Loading 50 TB via COPY costs significant ingestion time + ~$1,500/month storage in Redshift.
   - Spectrum reads from S3 directly; pay only $5/TB scanned per query.
   - Most clickstream queries filter by date — partition pruning reads ~1 TB at a time.
   - Net Spectrum cost: ~$5 per query × ~10 queries/day = ~$50/day, vs $1,500/month managed storage.

2. ORDERS (5 GB, hot, frequent joins, BI dashboards)
   - Small footprint; full managed table cost is trivial.
   - Heavy join usage with users + products; DISTKEY co-location wins.
   - Sub-second response time expected by BI tools.
   - Managed table with DISTKEY(customer_id) + SORTKEY(order_date) is the right call.

3. CROSS-LAYER JOIN — Spectrum + managed in one SQL
   SELECT u.region, COUNT(*) AS clicks, SUM(o.amount) AS revenue
   FROM lake.clickstream c
   JOIN users    u ON u.id = c.user_id
   JOIN orders   o ON o.user_id = c.user_id
   WHERE c.event_date = '2026-05-11'
   GROUP BY u.region;

4. MAINTENANCE PLAN
   - Spectrum: no VACUUM/ANALYZE needed (external; data managed by S3 + Glue).
   - orders: VACUUM weekly (CDC writes fragment storage), ANALYZE after every backfill.
   - Auto-VACUUM/ANALYZE on by default; manual runs after big migrations.

Why this works: the decision turns on three axes — data size, access frequency, and query pattern. 50TB cold data with date-pruned access is exactly Spectrum's sweet spot; 5GB hot data with frequent joins is exactly managed-table territory. The cross-layer join lets you query both in one SQL statement, which is the entire point of the lakehouse-meets-warehouse pattern. The maintenance plan is non-negotiable: VACUUM keeps zone maps effective; ANALYZE keeps the planner honest.

Step-by-step trace of the decision walkthrough:

step	question	answer
1	What's the data size?	50TB clickstream (huge), 5GB orders (small)
2	Is access frequent (multiple queries/day)?	clickstream: occasional; orders: BI dashboards every minute
3	Are queries date-prunable?	clickstream: yes (one date partition per query); orders: yes (date range)
4	Cost of loading clickstream into Redshift	~$1,500/month storage + COPY time
5	Cost of querying clickstream via Spectrum	~$5/query × ~10/day = ~$50/day = ~$1,500/month — break-even, but Spectrum scales better
6	Final decision	Spectrum for clickstream (cold, prunable); managed for orders (hot, joined)

Output: the recommended architecture summary:

layer	technology	role	maintenance
clickstream (50 TB)	Spectrum external table on S3 Parquet	cold, date-pruned analytic queries	none (external)
orders (5 GB)	managed Redshift table	hot, BI dashboards, joins	weekly VACUUM, ANALYZE after backfills
cross-layer joins	one SQL statement	flexible analytics	leader handles plan

Why this works — concept by concept:

Spectrum for cold + huge + date-pruned — pay only for what you scan; partition pruning means a typical query reads 1/365 of the data; storage stays in cheap S3.
Managed for hot + small + joined — DISTKEY co-location + SORTKEY pruning + in-cluster execution beats Spectrum on per-query latency for sub-100GB tables.
Cross-layer joins in one SQL — Redshift's leader plans the join across managed + external sources; Spectrum returns filtered/projected rows that participate in the hash join with managed tables.
VACUUM weekly — CDC writes fragment storage; weekly VACUUM keeps zone maps effective and reclaims ~5-10% of disk over time.
ANALYZE after backfills — stale stats cause wrong join orders; auto-ANALYZE catches most cases but manual runs after big migrations are non-negotiable.
O(|scanned|) cost per Spectrum query — bounded by partition pruning + column projection; bad practice (full scan, all columns) costs 100× more than disciplined practice.

Inline CTA: More ETL practice problems for the S3-to-warehouse load + Spectrum pattern, and the SQL CTE practice page for multi-step analytical query composition.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

Tips to crack Amazon Redshift interviews

Master the four primitives — columnar+MPP, distribution+sort keys, COPY+architecture, Spectrum+maintenance

If you can explain why columnar storage + MPP beats row-store Postgres for analytics, choose the right DISTKEY and SORTKEY for a 10TB fact, walk through what happens on the leader and compute nodes during a COPY and a SELECT, and decide when to use Spectrum vs a managed table — you can answer every Redshift question that shows up in a fresher or mid-level data-engineering loop. The remaining 20% is dialect-specific SQL fluency and AWS-specific operational trivia (RA3 vs DS2 nodes, Concurrency Scaling, RA3 storage tiers).

Always name OLTP vs OLAP in the first sentence

The opening Redshift question is almost always "when would you use Redshift over Postgres?" The right first sentence names the OLTP vs OLAP split: Postgres for high-frequency small writes (row-store, ACID multi-row commits), Redshift for big analytical scans (column-store, MPP, columnar compression). Senior interviewers grade this framing specifically.

`DISTKEY` co-location is the single biggest join optimization

Whenever two tables join on the same column, use the same DISTKEY. The matching rows land on the same compute node, the join runs locally, and you skip the network shuffle that would otherwise dominate the query time. State this principle out loud — it signals you understand the architecture, not just the syntax.

Pick `SORTKEY` for the dominant `WHERE` predicate

Date-range filters are the most common WHERE predicate in analytical workloads — that's why SORTKEY (order_date) (or event_ts) is the standard choice for fact tables. Zone-map pruning skips ~90%+ of blocks for monthly queries. Pick the wrong sort key and every query scans the whole table.

Use `COPY` not `INSERT` for bulk loads — and split source files for parallelism

INSERT INTO ... VALUES is an anti-pattern in Redshift — it bypasses the columnar storage layout and writes uncompressed blocks. COPY from S3 is 10-100× faster. Split source files into N × num_slices chunks (e.g., 40 files for a 10-node × 4-slice cluster) so every slice can pull a file in parallel.

Spectrum for cold + huge + date-pruned; managed for hot + small + joined

Spectrum is the right answer for cold, huge, infrequently-queried data that's already in S3 (logs, clickstream, history). Managed Redshift tables with DISTKEY co-location win for hot, frequently-joined, sub-100GB data. The lakehouse pattern is "both, in one SQL".

Schedule `VACUUM` weekly and `ANALYZE` after every backfill

Without VACUUM, deleted-row tombstones accumulate and zone maps degrade — queries get 2-3× slower over time. Without ANALYZE, the planner picks wrong join algorithms on changing data — queries can go 100× slower silently. Both are non-negotiable for any production cluster.

Where to practice on PipeCode

Start with the SQL practice surface for PostgreSQL-dialect (Redshift-compatible) SQL. Drill the four Redshift-relevant topic pages: aggregations for GROUP BY/HAVING/window aggregates, joins for the join shapes that benefit from DISTKEY co-location, window functions for ranking + lookback queries, CTE for multi-step analytical pipelines. Add ETL practice for the S3-to-warehouse load pattern and dimensional modeling for star-schema design (the typical Redshift gold-layer shape). For broader coverage, read the related data lake architecture for data engineering interviews and SQL interview questions for data engineering blogs.

Frequently Asked Questions

What is Amazon Redshift?

Amazon Redshift is AWS's fully-managed cloud data warehouse service, built for analytical workloads on structured data — typically terabytes to petabytes of business data. It uses columnar storage, massively parallel processing (MPP) across multiple compute nodes, and per-column compression to make SUM/AVG/COUNT queries across billions of rows complete in seconds. The SQL dialect is PostgreSQL-compatible, so most existing SQL skills transfer directly.

What is the difference between Amazon Redshift and PostgreSQL?

PostgreSQL is an OLTP (online transaction processing) database — optimized for many small writes (insert/update/delete one row at a time) with row-oriented storage and ACID transactions across rows. Redshift is an OLAP (online analytical processing) warehouse — optimized for scanning huge amounts of data with column-oriented storage, MPP, and compression. Use PostgreSQL for application backends; use Redshift for analytics and BI on top of them. The SQL dialects are similar (Redshift is fork-derived from Postgres), but the storage engine and execution model are completely different.

What is columnar storage and why is it faster for analytics?

Columnar storage means each column's values are stored physically next to each other on disk (instead of each row's values being stored together as in a row-oriented database). An analytical query like SELECT SUM(amount) FROM orders reads only the amount column block and skips every other column — typically a 10-50× I/O reduction. Combined with per-column compression (often 10-30× smaller than uncompressed row format) and zone-map pruning, columnar storage is the foundation of fast analytical queries.

What does the COPY command do?

COPY is Redshift's bulk-load command — it ingests data from S3 (or other AWS sources) into a Redshift table in parallel across all compute nodes. A typical command looks like COPY orders FROM 's3://bucket/orders/' IAM_ROLE 'arn:...' FORMAT AS CSV COMPUPDATE ON. It's 10-100× faster than INSERT INTO ... VALUES (...) for bulk loads and is the only acceptable bulk-ingestion method at production scale. Split source files into N × num_slices chunks so every cluster slice can pull a file in parallel.

What is a distribution key (DISTKEY) and when should I use one?

A DISTKEY controls how table rows are partitioned across compute nodes. With DISTSTYLE KEY (customer_id), Redshift hashes each row's customer_id and sends matching values to the same node. The big payoff is join co-location: if two tables share the same DISTKEY on their join column, the join runs locally on each node with no network shuffle — turning a multi-terabyte shuffle into a local hash join. Use DISTKEY whenever the table is frequently joined on a single key and that key has reasonably uniform value distribution (avoid columns with hot values).

What are sort keys (SORTKEY) and how do they help?

A SORTKEY defines the physical order of rows within each compute node. Redshift maintains per-block min/max metadata (zone maps); a WHERE predicate that matches a contiguous range of the sort key prunes ~99% of blocks without reading them. The most common choice is SORTKEY (order_date) for fact tables, because date-range filters are the dominant analytic predicate. Compound sort keys (SORTKEY (col_a, col_b)) work like a B-tree index for filter predicates; interleaved sort keys are rare and require periodic VACUUM REINDEX.

What is VACUUM in Redshift?

VACUUM reclaims space from deleted/updated rows and re-sorts data by the SORTKEY. Redshift's UPDATE and DELETE don't physically remove rows — they tombstone them; over time, blocks fragment and become unsorted, which degrades zone-map pruning. VACUUM (typically VACUUM FULL) compacts the storage and restores the sort order. Schedule it weekly for write-heavy tables (events, orders); monthly is fine for slowly-changing dimensions. Auto-VACUUM runs in the background but is conservative — manual runs after big backfills are still recommended.

What is Redshift Spectrum?

Redshift Spectrum lets you query data sitting in S3 directly with SQL — without loading it into the cluster first. You register an EXTERNAL TABLE (backed by the AWS Glue Data Catalog) pointing at S3 Parquet/ORC/CSV files, and queries scan those files at run time. Spectrum runs on a serverless fleet independent of your Redshift cluster's compute capacity; you pay $5 per TB scanned. It's perfect for cold/historical data, the lakehouse pattern (joining managed Redshift tables with S3 external tables in one SQL), and avoiding the cost of loading 50TB clickstreams just to occasionally query them.

Start practicing Amazon Redshift problems

PostgreSQL SQL Data Types: Practical Column-Type Guide

Gowtham Potureddi — Mon, 11 May 2026 04:08:12 +0000

Choosing the right SQL data types is one of the quiet decisions that shapes storage, correctness, and query behavior in PostgreSQL. In a tight SQL screen, interviewers often follow up on why you picked a type—not only whether the query returns rows. This guide walks through the main families, common pitfalls (rounding, time zones, type mismatches), and how to reason about casts—using PostgreSQL syntax, the same dialect PipeCode uses for practice.

If you want hands-on reps after you read, explore practice →, drill SQL problems →, browse SQL by topic →, or open Zero to FAANG SQL (full fundamentals) → for a structured path.

On this page

Why column types matter
Numeric types
Text and binary
Boolean and NULL
Date and time
Semi-structured and other types
Casting and comparison rules
Choosing types (checklist)
Frequently asked questions
Practice on PipeCode

1. Why column types matter

Storage, comparisons, indexes, and the cost of silent coercion

"Why did you pick that type?" is the single most common SQL-screen follow-up — and the cleanest answer is that a column's type controls four downstream things at once: how the value is laid out on disk, which operators compare it correctly, which indexes the planner can actually use, and when PostgreSQL has to silently coerce data behind your back. Get the type right and joins are fast, comparisons are unambiguous, and disk pages are dense. Get it wrong and you ship a schema that runs but quietly returns the wrong answer or scans 10× more pages than it should.

Pro tip: When you walk an interviewer through a CREATE TABLE, say the grain and the type in the same breath: "one row per order, order_id is BIGINT, total is NUMERIC(14,2)." That single habit signals to the interviewer that you think about column types as design decisions, not afterthoughts.

Storage footprint and on-disk layout

The storage invariant: fixed-width integer and timestamp types occupy a known number of bytes (4 or 8) and never expand; variable-width types (TEXT, NUMERIC, JSONB) carry a length prefix and grow with the value; choosing a tighter type packs more rows per 8 KB page and improves cache locality on every read. A wider type is rarely free — even when the bytes look free, the planner statistics and TOAST thresholds shift.

INTEGER — 4 bytes, range ±2.1 B; default for surrogate counts.
BIGINT — 8 bytes; required when row counts cross ~2 B or for user-facing IDs.
NUMERIC(p, s) — variable (~2 bytes overhead + 2 bytes per 4 digits); cost grows with precision.
TEXT / VARCHAR(n) — variable; no storage penalty for TEXT vs VARCHAR with the same content.

Worked example. A 100 M-row events table sized two ways:

design	per-row bytes	total
`event_id BIGINT, ts TIMESTAMPTZ, user_id BIGINT`	24	~2.4 GB
`event_id BIGINT, ts TIMESTAMPTZ, user_id TEXT (avg 18 chars)`	24 + 20 = 44	~4.4 GB

Step-by-step.

Fixed-width row (BIGINT, TIMESTAMPTZ, BIGINT) is 24 bytes on the heap regardless of values.
Replacing the integer user_id with TEXT for a UUID-shaped string adds a 4-byte length header plus the bytes of the text itself.
With ~100 M rows, the variable-width design adds ~2 GB to the table heap alone, before indexes.
The wider rows also fit fewer per 8 KB page → fewer buffer-cache hits → more I/O per query.
Net: same data, ~2× the disk and worse cache behavior.

Worked-example solution. Pick the tightest correct type:

CREATE TABLE events (
    event_id  BIGINT      PRIMARY KEY,
    ts        TIMESTAMPTZ NOT NULL,
    user_id   BIGINT      NOT NULL          -- not TEXT
);

Rule of thumb: if a value is a count or an internal identifier, it is an integer; reach for TEXT only when the value is a real human-readable string.

Equality and comparison semantics

The comparison invariant: PostgreSQL compares values within a type cleanly, but mixing types forces an implicit cast that can produce surprises — string '10' compares lexicographically ('10' < '2'), numeric 10 compares mathematically (10 > 2), and timestamps compare instant-to-instant only if both sides are TIMESTAMPTZ. The right type makes <, =, and BETWEEN behave the way humans expect.

'10' < '2' is TRUE when both are TEXT — string compare reads left-to-right.
10 < 2 is FALSE when both are INTEGER — numeric compare.
TIMESTAMP vs TIMESTAMPTZ — PostgreSQL will compare them only after coercing one side; the answer depends on the session time zone.
Collations on TEXT — 'abc' = 'ABC' is FALSE with the default C collation, possibly TRUE with a case-insensitive collation.

Worked example. A 5-row table where the comparison flips based on whether score is TEXT or INTEGER.

score (as TEXT)	order	score (as INT)	order
"10"	1	10	3
"2"	2	2	2
"100"	3	100	4
"9"	4	9	1

Step-by-step.

Stored as TEXT: ORDER BY score compares character-by-character; '1' (0x31) sorts before '9' (0x39), so '100' sorts before '2'.
Stored as INTEGER: ORDER BY score compares the numeric value; 2 < 9 < 10 < 100 — the human-expected order.
The query is identical in both cases; only the column type changed the answer.
The bug is invisible until someone audits the leaderboard and notices "100" ranked above "9".

Worked-example solution. Always store ordinal-comparable values in a numeric type:

CREATE TABLE leaderboard (
    player_id BIGINT  PRIMARY KEY,
    score     INTEGER NOT NULL CHECK (score >= 0)
);
SELECT player_id, score FROM leaderboard ORDER BY score DESC;

Rule of thumb: if you ever compare values with <, >, or BETWEEN, the type must support those operators natively — never rely on string sort for numbers or dates.

Index operator classes and planner statistics

The index invariant: a B-tree index is built against an operator class tied to a specific type; when a query compares an indexed column to a value of a different type, the planner usually has to scan instead of seek, because the function it applies to your value (the implicit cast) isn't immutable on the indexed expression. The right type matches the index; the wrong type silently disables it.

CREATE INDEX … ON t (col) — default B-tree, uses the type's default operator class.
col = $1 with matching type — index seek.
col = $1::other_type — index seek when the cast is on the literal side.
col::other_type = $1 — sequential scan; you cast the column, not the value.

Worked example. A user_id BIGINT column with a B-tree index, queried two ways.

predicate	plan	rows scanned
`WHERE user_id = 42`	Index Scan	~1
`WHERE user_id = '42'`	Index Scan (literal cast)	~1
`WHERE user_id::text = '42'`	Seq Scan	full table

Step-by-step.

WHERE user_id = 42 — both sides are BIGINT; planner uses the B-tree directly.
WHERE user_id = '42' — PostgreSQL coerces the string literal '42' to BIGINT (since BIGINT is the indexed side); index still usable.
WHERE user_id::text = '42' — the cast is on the column; PostgreSQL would have to apply the ::text function to every row to compare; the B-tree on user_id cannot help.
The third predicate triggers a full sequential scan even though an index "exists on user_id."
Diagnosis is an EXPLAIN away: Seq Scan on … Filter: ((user_id)::text = '42'::text) is the giveaway.

Worked-example solution. Keep casts on the literal side:

-- good: cast literal, index used
SELECT * FROM events WHERE user_id = '42';      -- literal '42' coerced to BIGINT

-- bad: cast column, index killed
SELECT * FROM events WHERE user_id::text = '42';

Rule of thumb: if you see a :: on a column inside a WHERE or JOIN, expect a seq scan and ask whether the underlying type should change.

Common beginner mistakes

Declaring every text column as VARCHAR(255) "just in case" — wastes nothing on storage but lies in the schema about the real constraint.
Storing numeric IDs as TEXT because the source CSV had quotes — every downstream comparison and index becomes a hazard.
Mixing TIMESTAMP and TIMESTAMPTZ in joins — comparison depends on the session time zone; you have written a query that returns different rows for different users.
Treating implicit coercion as free — the planner often hides the cost behind a seq scan and an unbroken EXPLAIN summary.
Skipping CHECK constraints because "the application handles it" — types and constraints together are the only durable schema.

SQL Interview Question on Picking Types for an Orders Schema

A junior teammate sends a CREATE TABLE orders script: order_id VARCHAR(255), total FLOAT, customer_id TEXT, placed_at TIMESTAMP. The orders application is global, has ~5 M orders per day, and is joined daily to dim_customer (customer_id BIGINT, …). Identify every type-level risk in this schema and rewrite it so reports stay correct, joins stay indexed, and storage doesn't bloat.

Solution Using Tight Native Types + `NUMERIC` + `TIMESTAMPTZ` + `CHECK` Constraints

Code solution.

CREATE TABLE orders (
    order_id    BIGSERIAL     PRIMARY KEY,
    customer_id BIGINT        NOT NULL REFERENCES dim_customer(customer_id),
    total       NUMERIC(14,2) NOT NULL CHECK (total >= 0),
    placed_at   TIMESTAMPTZ   NOT NULL DEFAULT NOW()
);
CREATE INDEX ON orders (customer_id);
CREATE INDEX ON orders (placed_at);

Step-by-step trace of the four problems:

original type	risk	fix
`order_id VARCHAR(255)`	lexicographic sort; wide rows; index mismatch	`BIGSERIAL` / `BIGINT`
`total FLOAT`	binary rounding (0.1 + 0.2 ≠ 0.3); aggregates drift	`NUMERIC(14,2)`
`customer_id TEXT`	cross-type join with `dim_customer.customer_id BIGINT`; seq scan	`BIGINT` + FK
`placed_at TIMESTAMP`	wall-clock semantics; reports differ per session TZ	`TIMESTAMPTZ`

Output: a typed, constrained schema. The daily customer-join now uses a B-tree seek on customer_id; revenue rollups are exact to the cent; "orders placed today" is unambiguous regardless of the analyst's session time zone.

Why this works — concept by concept:

BIGSERIAL PK — monotonic, 8-byte integer; supports range scans, packs tight, and matches every downstream join.
BIGINT customer_id with FK — joins are type-identical, the index is usable, and orphan rows are rejected at write time.
NUMERIC(14, 2) for money — exact decimal arithmetic; aggregates over millions of rows produce the same total a calculator would.
TIMESTAMPTZ for placed_at — every value is stored as a UTC instant; display converts to the session TZ; reports never silently shift by 24 h after a deploy.
CHECK (total >= 0) — durable invariant; even a buggy ETL run cannot insert negative revenue.
Cost — O(1) extra bytes per row vs the original; massive O(log N) per join savings vs the seq-scan caused by the type mismatch.

Inline CTA: drill the SQL practice page for type-fluency reps and the aggregation topic for grain-correct rollups.

SQL
Topic — SQL
SQL practice problems

Practice →

SQL
Topic — aggregations
SQL aggregation problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

2. Numeric types

Integers for counts, NUMERIC for money, FLOAT for measurements

PostgreSQL splits numeric types into three families: exact integers (SMALLINT, INTEGER, BIGINT), arbitrary-precision exact decimals (NUMERIC(p, s) / DECIMAL), and binary floating point (REAL, DOUBLE PRECISION). The choice is rarely about precision in the abstract — it's about which arithmetic errors are acceptable. Integers never lose precision; NUMERIC is exact at a fixed scale; floats trade precision for speed and are the wrong default for currency.

Pro tip: When asked "what type is revenue?", say NUMERIC(p, s) and name p and s out loud — NUMERIC(14, 2) for cents up to ~$100 B, NUMERIC(18, 4) for FX rates and basis points. Knowing the scale is what separates "I know decimals exist" from "I have shipped a ledger."

`INTEGER` / `BIGINT` — surrogate keys and counts

The integer invariant: INTEGER is 4 bytes (range ±2.1 B) and BIGINT is 8 bytes (range ±9.2 quintillion); use INTEGER for small/medium counts and BIGINT for surrogate keys, monotonically increasing IDs, and anything that might ever cross 2 billion. Overflow is silent in some languages but is a hard error in PostgreSQL — once the column wraps, every insert fails.

SMALLINT — 2 bytes; rarely used outside tightly packed enum-like values.
INTEGER — 4 bytes; default for row counts, scores, age, quantities.
BIGINT — 8 bytes; default for primary keys on growing tables.
BIGSERIAL / GENERATED AS IDENTITY — 8-byte auto-incrementing PK.

Worked example. An events table grows from 1 M to 3 B rows.

year	events	INTEGER PK?	BIGINT PK?
2024	1 M	✓	✓
2025	500 M	✓	✓
2026	2.5 B	✗ overflow	✓

Step-by-step.

Start with event_id INTEGER — fits 2.1 B values.
Daily growth at 5 M / day reaches 2.1 B by mid-2026.
Next INSERT fails: ERROR: integer out of range.
Migration requires ALTER TABLE … ALTER COLUMN event_id TYPE BIGINT; — rewrites the entire table; locks scale with table size.
Doing this at 2.1 B rows means hours of downtime; doing it at table creation is free.

Worked-example solution. Use BIGINT for any growing PK:

CREATE TABLE events (
    event_id  BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_id   BIGINT NOT NULL,
    ts        TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Rule of thumb: every primary key on a table that "might be big someday" is BIGINT from day one. The 4 extra bytes per row is the cheapest insurance you can buy.

`NUMERIC(p, s)` — exact decimal for currency

The decimal invariant: NUMERIC(p, s) stores p total digits with s of them after the decimal point; arithmetic is exact at that scale; SUM(NUMERIC) over millions of rows produces the byte-identical result a careful accountant would compute by hand. The cost is performance — NUMERIC math is slower than integer or float — but for currency the trade-off is settled: exact wins.

NUMERIC(14, 2) — up to 12 digits before the decimal, 2 after; ~$1 T.
NUMERIC(18, 4) — FX rates, fractional cents (interest, allocations).
NUMERIC(38, 6) — analytics-warehouse scale; matches Snowflake / BigQuery default.
Storage — ~2 bytes overhead + 2 bytes per 4 digits; cheap up to ~$1 T.

Worked example. Summing 1,000 invoice lines of $0.10 each:

storage type	`SUM(amount)`
`DOUBLE PRECISION`	`100.00000000000007`
`NUMERIC(14, 2)`	`100.00`

Step-by-step.

0.1 cannot be represented exactly in binary floating point; the stored value is 0.1000000000000000055511….
Adding 1,000 of these in DOUBLE PRECISION accumulates 1000 * tiny_error; the result drifts.
NUMERIC(14, 2) stores 0.10 literally and adds with decimal arithmetic; 1,000 × 0.10 is exactly 100.00.
The float error is invisible until a finance lead notices a $0.00000007 discrepancy on a reconciliation report.
Once the column type is NUMERIC, the drift is impossible by construction.

Worked-example solution. Currency columns always use NUMERIC:

CREATE TABLE invoice_lines (
    line_id    BIGSERIAL      PRIMARY KEY,
    quantity   INTEGER        NOT NULL CHECK (quantity > 0),
    unit_price NUMERIC(12, 4) NOT NULL,
    line_total NUMERIC(14, 4) GENERATED ALWAYS AS (quantity * unit_price) STORED
);

Rule of thumb: anything that touches money, tax, allocations, basis points, or a regulated ledger is NUMERIC(p, s) — never FLOAT or DOUBLE PRECISION.

`REAL` / `DOUBLE PRECISION` — binary floating point and rounding

The float invariant: REAL (4 bytes, ~7 decimal digits) and DOUBLE PRECISION (8 bytes, ~15 digits) follow IEEE 754; they're fast and compact but inexact at decimal fractions; their natural home is measurements where the underlying quantity is itself approximate (sensor reading, ML feature, scientific magnitude). Floats are not "lossy currency" — they are the right type for things that were never exact to begin with.

REAL — 4 bytes; ~7 decimal digits of precision.
DOUBLE PRECISION — 8 bytes; ~15 digits; PostgreSQL's default FLOAT.
0.1 + 0.2 = 0.30000000000000004 in both.
Use cases — physical measurements, geographic coordinates, ML scores, neural-net outputs.

Worked example. Same 5 sensor readings stored two ways:

reading	`REAL`	`DOUBLE PRECISION`
23.7	23.7	23.7
0.1 + 0.2	0.3 (~0.30000001)	0.30000000000000004
3.14159265358979	3.1415927	3.141592653589793

Step-by-step.

REAL rounds aggressively after ~7 digits; fine for a temperature gauge, wrong for a price.
DOUBLE PRECISION keeps ~15 digits — enough for almost any measurement.
Neither stores 0.1 + 0.2 as exactly 0.3 because base-2 cannot represent base-10 tenths.
Equality (=) on floats is unsafe; use a tolerance (abs(a - b) < 1e-9) for "approximately equal."
For currency, both are wrong — use NUMERIC.

Worked-example solution. Use floats for genuinely approximate measurements:

CREATE TABLE sensor_readings (
    reading_id   BIGSERIAL        PRIMARY KEY,
    device_id    BIGINT           NOT NULL,
    temp_celsius DOUBLE PRECISION NOT NULL,
    ts           TIMESTAMPTZ      NOT NULL DEFAULT NOW()
);

Rule of thumb: if you would compare the value with = and care about the result, it is not a float.

Common beginner mistakes

Defaulting all PKs to SERIAL (32-bit) and discovering the overflow in production years later.
Storing money in DOUBLE PRECISION because NUMERIC "is slow" — the slowdown is invisible to humans; the rounding is not.
Using NUMERIC with no precision (NUMERIC with no (p,s)) — works but skips the documentation value of stating the scale.
Comparing floats with = instead of a tolerance window.
Using INTEGER for cents (total_cents) instead of NUMERIC(14, 2) — works but burdens every read with a /100.0.

SQL Interview Question on Reconciling a Drifting Invoice Total

The CFO reports that the monthly invoice total in the dashboard disagrees with the source-of-truth ledger by $0.0000034 on average. The dashboard sums an invoice_lines.amount column declared as DOUBLE PRECISION. Identify the cause and propose a schema fix that makes the totals byte-identical to the ledger from now on.

Solution Using `NUMERIC(14, 4)` + a Generated `line_total` Column

Code solution.

ALTER TABLE invoice_lines
    ALTER COLUMN amount TYPE NUMERIC(14, 4) USING amount::NUMERIC(14, 4);

ALTER TABLE invoice_lines
    ADD COLUMN line_total NUMERIC(14, 4)
    GENERATED ALWAYS AS (quantity * amount) STORED;

-- nightly reconciliation
SELECT SUM(line_total) AS dash_total
FROM invoice_lines
WHERE invoice_date = DATE '2026-04-13';

Step-by-step trace of the drift:

step	value	running sum (DOUBLE PRECISION)	running sum (NUMERIC)
1	0.1	0.1	0.10
2	0.1	0.2	0.20
3	0.1	0.30000000000000004	0.30
…	…	accumulating error	exact
1000	0.1	100.00000000000007	100.00

Output: dashboard total per day now matches the ledger to the cent (or to the basis point, given scale 4). No silent drift; finance closes the books without manual adjustment.

Why this works — concept by concept:

NUMERIC(14, 4) exact decimal arithmetic — every addition stays exact at four decimal places; no IEEE 754 representation error.
Generated line_total column — eliminates a class of bugs where the application computes qty * price and the database computes a slightly different number.
STORED not VIRTUAL — value is materialised once at write time; reads are plain column reads with no per-row recomputation.
Tolerance check on the ETL side — even with NUMERIC, reconciliation should compare against the source-of-truth ledger with a 0-tolerance gate.
One-time ALTER TABLE … USING — converts existing rows in place; from then on the type system makes drift impossible.
Cost — single rewrite at migration; per-row NUMERIC math is ~3× slower than DOUBLE PRECISION but invisible compared to network and disk costs.

Inline CTA: for the structured currency-and-aggregation path see SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Topic — aggregations
SQL aggregation problems

Practice →

SQL
Topic — conditional aggregation
Conditional-aggregation problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

3. Text and binary

CHAR vs VARCHAR vs TEXT, collations, and BYTEA

PostgreSQL has three character types — CHAR(n), VARCHAR(n), and TEXT — and one binary type, BYTEA. The decision rule is short: use TEXT unless you have a hard reason to enforce a length cap, and store files outside the database with a URL or object-store key in the column. Most "text" bugs are not about storage at all — they are about collations, which control how text compares and sorts.

Pro tip: Two strings that look identical can compare unequal under a different collation. When a join "returns no rows" on string keys, your first check after EXPLAIN is SHOW lc_collate; and SELECT pg_collation_for(col1) on both columns.

`CHAR` vs `VARCHAR` vs `TEXT` — pick `TEXT` unless you need fixed-width

The text invariant: TEXT and VARCHAR(n) have the same on-disk representation in PostgreSQL — no padding, no length penalty; the only difference is the (n) constraint that throws an error on overflow. CHAR(n) pads with spaces to length, costing both storage and surprise (trailing-space equality is mostly stripped on read, but joins can still misbehave).

CHAR(n) — fixed-width; pads with spaces; storage = n bytes (plus a length header).
VARCHAR(n) — variable-width; rejects values longer than n.
TEXT — variable-width; no length limit (up to 1 GB).
citext extension — case-insensitive text via the citext type.

Worked example. Storing "abc" three ways:

type	stored bytes	trailing pad
`CHAR(5)`	`abc` (5 bytes)	yes
`VARCHAR(5)`	`abc` (3 bytes)	no
`TEXT`	`abc` (3 bytes)	no

Step-by-step.

CHAR(5) stores literally abc (5 chars), padding to length.
VARCHAR(5) stores abc; would reject abcdef with a length-violation error.
TEXT stores abc; would accept abcdef.
Equality semantics differ: CHAR(5) 'abc' = VARCHAR(5) 'abc' may be TRUE but joining a CHAR column to a VARCHAR column from another table can still fail when one side preserved trailing whitespace.
Default to TEXT — it is the simplest and never accumulates these padding surprises.

Worked-example solution. Schema for a free-form bio field:

CREATE TABLE profiles (
    user_id BIGINT PRIMARY KEY REFERENCES users(user_id),
    bio     TEXT   NOT NULL DEFAULT ''
);

Rule of thumb: use VARCHAR(n) only when you genuinely want the database to enforce a maximum length (e.g., regulator-imposed description VARCHAR(280)); otherwise reach for TEXT.

Collations and locale-aware equality

The collation invariant: a collation is a tuple of (alphabet, sort order, case-sensitivity, accent-sensitivity) that the database applies to every text comparison; the default is usually "C" (binary) or the OS locale; case-insensitive matching requires either an explicit ICU collation or the citext extension. Two databases with different locales can disagree on whether 'café' = 'cafe'.

C collation — byte-by-byte; fastest; case- and accent-sensitive.
en_US.UTF-8 — locale-aware; sorts 'a' < 'B' < 'c' (case-insensitive primary).
und-x-icu — ICU root locale; consistent across platforms.
citext — case-insensitive text type; 'ABC' = 'abc' is TRUE automatically.

Worked example. Joining users by email under different collations:

left email	right email	join match (C)	join match (citext)
`alice@x.com`	`alice@x.com`	✓	✓
`Alice@X.com`	`alice@x.com`	✗	✓
`alice@x.com`	`alice@x.com`	✗	✗ (whitespace, not case)

Step-by-step.

Default C collation does a byte compare; 'A' (0x41) is not equal to 'a' (0x61).
Same string with mixed case fails to join in C even though humans see them as the same email.
Switching the column type to citext makes the database compare case-insensitively, and the second row matches.
Whitespace differences still cause mismatches — citext does not trim; that requires BTRIM(col) in ETL.
Pick one normalization rule (lowercase + trim at write time) and apply it consistently rather than relying on collation alone.

Worked-example solution. Use citext for emails and usernames:

CREATE EXTENSION IF NOT EXISTS citext;
CREATE TABLE users (
    user_id BIGINT PRIMARY KEY,
    email   CITEXT NOT NULL UNIQUE
);

Rule of thumb: if you ever want 'Foo' = 'foo' to be TRUE, set that contract at the column type, not at every LOWER(...) call site.

`BYTEA` for binary blobs vs URL-in-SQL for files

The binary invariant: BYTEA stores raw bytes (hashes, signatures, compressed payloads, small binary tokens); large blobs (images, PDFs, ML model weights) belong in object storage (S3, GCS) with a TEXT URL or key in SQL. Databases are not file systems — every byte stored in BYTEA slows backups, replication, and query cache.

BYTEA — variable-length binary; up to 1 GB but typically used for ≤ 10 KB tokens.
SHA-256 hash — 32 bytes; perfect BYTEA use case.
Large files — store in S3; keep s3_key TEXT in SQL.
pg_largeobject — legacy API; rarely worth the complexity vs object storage.

Worked example. A documents table with two design choices:

design	per-row storage	backup time
`body BYTEA` (10 MB PDFs, 1 M rows)	10 TB in `pg_largeobject`	hours
`s3_key TEXT` (URL only, 1 M rows)	< 100 MB	seconds

Step-by-step.

Storing 10 MB PDFs in BYTEA puts all bytes in TOAST; the table grows to 10 TB.
Every pg_dump reads all 10 TB; backups become days, not minutes.
Replication lag grows; HA failover slows.
Object storage (S3) is purpose-built for large files; the database keeps only a 50-byte s3_key.
Reads still feel "one query" — the application fetches the URL from SQL, then streams the file from S3.

Worked-example solution. Store files externally; keep the key in SQL:

CREATE TABLE documents (
    document_id BIGSERIAL PRIMARY KEY,
    user_id     BIGINT    NOT NULL,
    sha256      BYTEA     NOT NULL CHECK (octet_length(sha256) = 32),
    s3_key      TEXT      NOT NULL,
    uploaded_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Rule of thumb: the rough threshold is 100 KB — anything above that belongs in object storage; anything below is fine as BYTEA.

Common beginner mistakes

Declaring text columns as VARCHAR(255) everywhere — Java legacy from the MySQL world; meaningless in modern PostgreSQL.
Using CHAR(n) and being surprised that 'abc' = 'abc ' returns FALSE in some join contexts.
Storing emails as case-sensitive TEXT and writing LOWER(email) = LOWER($1) everywhere — set citext once at the column.
Putting megabyte payloads in BYTEA and discovering the cost only when pg_dump runs for six hours.
Forgetting to trim whitespace at ingest — ' alice@x.com' and 'alice@x.com' are different strings to the database.

SQL Interview Question on Reconciling Case-Sensitive Email Joins

A signup flow stores users.email as TEXT. The marketing dashboard joins events.email (also TEXT) to users.email to count signed-up users. Roughly 8% of events fail to match even though the user definitely signed up. Diagnose the cause and propose a column-level fix that prevents recurrence.

Solution Using `citext` + Normalised Write-Time Email

Code solution.

CREATE EXTENSION IF NOT EXISTS citext;

ALTER TABLE users   ALTER COLUMN email TYPE CITEXT USING LOWER(BTRIM(email));
ALTER TABLE events  ALTER COLUMN email TYPE CITEXT USING LOWER(BTRIM(email));

-- joins now match regardless of case; rejoin to verify
SELECT COUNT(*) FROM events e
JOIN users u ON u.email = e.email;

Step-by-step trace of the 8% miss:

event email	user email	TEXT join	CITEXT join
`Alice@x.com`	`alice@x.com`	✗	✓
`bob@x.com`	`bob@x.com`	✓	✓
`carol@x.com`	`carol@x.com`	✗	✗ (whitespace)

Output: the case-sensitivity portion of the miss disappears (≈ 7%); the remaining ≈ 1% is whitespace, fixed by BTRIM in the USING clause at migration and a BEFORE INSERT trigger going forward.

Why this works — concept by concept:

CITEXT columns — case-insensitive by construction; downstream queries never have to wrap LOWER(...) and indexes still work.
LOWER(BTRIM(email)) in the USING clause — one-shot normalisation of existing rows during the type change.
Trigger or CHECK enforcement going forward — keeps future inserts canonical.
No more LOWER(...) at every query site — every analyst joins safely without remembering the casing rule.
Existing indexes rebuild automatically — ALTER COLUMN TYPE rebuilds the index against the new operator class.
Cost — one rewrite at migration; per-row equality cost identical to TEXT.

Inline CTA: for the string-fluency syllabus see SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Topic — string manipulation
String-manipulation problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

4. Boolean and NULL

Three-valued logic and the `WHERE flag` trap

PostgreSQL has a real BOOLEAN type with three values: TRUE, FALSE, and NULL. The third value is the source of nearly every "where did my rows go?" bug — NULL is not false; it is unknown. Filters like WHERE flag silently exclude NULL rows, and WHERE NOT flag excludes them too, so a "true-or-not-true" pair of queries can together miss rows entirely.

Pro tip: Whenever you write a boolean predicate, name the third bucket out loud. "Active users are is_active = TRUE; bots are is_bot = TRUE; unknown is IS NULL and goes into the needs-investigation drawer." That habit catches the silent-exclusion bug before it ships.

`BOOLEAN` literals, `IS TRUE` / `IS FALSE` / `IS NULL`

The boolean invariant: WHERE flag returns rows where the predicate is TRUE; rows where flag is NULL (unknown) are also excluded; to include or exclude them deliberately you must use IS NULL / IS NOT NULL / IS DISTINCT FROM. Standard SQL three-valued logic treats NULL = anything as NULL, which is neither true nor false — and a WHERE clause keeps only rows that evaluate to TRUE.

TRUE / FALSE — the two non-null boolean values.
NULL — unknown; not equal to anything (including itself).
IS TRUE / IS FALSE — three-valued aware; never returns NULL.
IS DISTINCT FROM — treats two NULLs as equal; useful for join keys.

Worked example. A 5-row events table with a nullable is_bot flag:

event_id	is_bot
1	TRUE
2	FALSE
3	NULL
4	TRUE
5	NULL

predicate	rows kept
`WHERE is_bot`	1, 4 (only TRUE)
`WHERE NOT is_bot`	2 (only FALSE)
`WHERE is_bot IS NOT TRUE`	2, 3, 5
`WHERE is_bot IS NULL`	3, 5

Step-by-step.

WHERE is_bot keeps rows where the predicate is TRUE; rows 3 and 5 (NULL) are silently dropped.
WHERE NOT is_bot keeps rows where the predicate evaluates to TRUE; NOT NULL is NULL, so rows 3 and 5 are still silently dropped.
The dashboard "Bots vs non-bots" pair (is_bot true / NOT is_bot) sums to 3 rows, not 5 — two rows are missing in plain sight.
IS NOT TRUE is three-valued aware: it returns TRUE for rows 2, 3, 5 — both the false ones and the nulls.
Pick the form that matches your intent and audit any dashboard that splits a column on a boolean.

Worked-example solution. Three-valued-aware predicates:

-- bots
SELECT COUNT(*) FROM events WHERE is_bot IS TRUE;
-- non-bots, including unknown
SELECT COUNT(*) FROM events WHERE is_bot IS NOT TRUE;
-- only unknown
SELECT COUNT(*) FROM events WHERE is_bot IS NULL;

Rule of thumb: never write WHERE flag or WHERE NOT flag on a nullable boolean column without consciously deciding what NULL means.

`NOT col` vs `col = FALSE` with NULLs

The negation invariant: col = FALSE and NOT col are logically the same when col is TRUE or FALSE, but both evaluate to NULL when col IS NULL — and a WHERE clause keeps only TRUE, so both forms silently drop nulls. The fix is COALESCE(col, FALSE) or IS NOT TRUE, which collapse NULL into a definite answer.

WHERE col = FALSE — keeps rows where col is literally FALSE.
WHERE NOT col — same; both drop NULL rows.
WHERE COALESCE(col, FALSE) = FALSE — treats NULL as FALSE; keeps both.
WHERE col IS NOT TRUE — treats NULL as not-true; keeps both.

Worked example. Same events table; analyst writes "all non-bot events":

query	rows	comment
`WHERE is_bot = FALSE`	1	row 2 only — silent miss
`WHERE NOT is_bot`	1	identical; same bug
`WHERE is_bot IS NOT TRUE`	3	rows 2, 3, 5 — correct
`WHERE COALESCE(is_bot, FALSE) = FALSE`	3	also correct

Step-by-step.

Marketing asks "how many non-bot events?"; analyst writes WHERE NOT is_bot.
Result is 1; marketing thinks bots account for 4 of 5 events.
A second analyst writes WHERE is_bot IS NOT TRUE and gets 3; the difference is the NULL rows.
The dashboard's "bot vs non-bot" pie chart silently undercounts by 40%.
The fix is either a COALESCE at query time or a NOT NULL DEFAULT FALSE constraint at schema time — both make the NULL case explicit.

Worked-example solution. Default boolean columns to a known value at write time:

ALTER TABLE events
    ALTER COLUMN is_bot SET DEFAULT FALSE,
    ALTER COLUMN is_bot SET NOT NULL;
-- queries are now safe
SELECT COUNT(*) FROM events WHERE NOT is_bot;

Rule of thumb: if a boolean has no "unknown" business meaning, declare it NOT NULL DEFAULT FALSE and remove the third bucket entirely.

`COALESCE` and explicit NULL handling

The COALESCE invariant: COALESCE(a, b, c) returns the first non-NULL argument; it is the simplest way to replace NULL with a default in WHERE, ORDER BY, and aggregations — but use it deliberately, because hiding NULL is the same as throwing away information. The right pattern is to decide whether NULL means "no answer" or "definitely false," then code that intent.

COALESCE(col, default) — first non-NULL argument.
NULLIF(a, b) — returns NULL when a = b; useful for "treat empty string as NULL."
a IS DISTINCT FROM b — TRUE when values differ, treating NULL as a real value.
SUM(col) — ignores NULLs; COUNT(col) ignores NULLs; COUNT(*) includes them.

Worked example. Summing score where some rows are NULL:

row	score
1	10
2	NULL
3	20

expression	result
`SUM(score)`	30
`SUM(COALESCE(score, 0))`	30
`AVG(score)`	15 (n=2)
`AVG(COALESCE(score, 0))`	10 (n=3)

Step-by-step.

SUM ignores NULLs by SQL convention; you get the same answer with or without COALESCE.
AVG divides by COUNT(non-NULL); ignoring NULL gives 15, treating NULL as zero gives 10.
The "right" answer depends on what NULL means — missing measurement (use 15) vs zero score (use 10).
Always make the choice explicit; do not let a downstream consumer guess.
IS DISTINCT FROM is the safe way to compare keys that may be NULL: a IS DISTINCT FROM b is TRUE when one is NULL and the other is not.

Worked-example solution. Choose the aggregation rule that matches the business question:

-- "average of measurements we have"
SELECT AVG(score) FROM responses;
-- "average where missing means zero"
SELECT AVG(COALESCE(score, 0)) FROM responses;

Rule of thumb: every COALESCE should answer the question "what should the missing row contribute?" in one sentence — if you cannot answer, do not coalesce.

Common beginner mistakes

Writing WHERE flag = FALSE and assuming it includes NULL rows.
Pairing WHERE flag with WHERE NOT flag and expecting the row counts to sum to the table size.
Storing booleans as 'Y' / 'N' strings — every comparison becomes a LOWER(...) hazard; use real BOOLEAN.
Forgetting that NULL = NULL is NULL, not TRUE — join keys with NULL need IS DISTINCT FROM or pre-coalesced values.
Using AVG over a nullable column without deciding whether missing means zero or excluded.

SQL Interview Question on a Dashboard Missing 12% of Rows

A events.is_bot BOOLEAN column is nullable. The dashboard splits "bots vs humans" with WHERE is_bot and WHERE NOT is_bot. The two row counts sum to 88% of the table; nobody can explain where the missing 12% went. Identify the cause and produce a single query pair that correctly partitions every row.

Solution Using `IS TRUE` / `IS NOT TRUE` + a Schema-Level `NOT NULL` Fix

Code solution.

-- short-term query-side fix
SELECT
    COUNT(*) FILTER (WHERE is_bot IS TRUE)     AS bots,
    COUNT(*) FILTER (WHERE is_bot IS NOT TRUE) AS humans_or_unknown
FROM events;

-- long-term schema fix
UPDATE events SET is_bot = FALSE WHERE is_bot IS NULL;
ALTER TABLE events
    ALTER COLUMN is_bot SET NOT NULL,
    ALTER COLUMN is_bot SET DEFAULT FALSE;

Step-by-step trace.

step	predicate	rows
1	`WHERE is_bot` (old)	12,000
2	`WHERE NOT is_bot` (old)	76,000
3	sum	88,000 of 100,000
4	missing	12,000 rows where `is_bot IS NULL`
5	`WHERE is_bot IS NOT TRUE`	88,000 — both FALSE and NULL
6	`bots + humans_or_unknown`	100,000 ✓

Output: the two-bucket dashboard sums to 100% of rows. Schema-level NOT NULL DEFAULT FALSE makes future regression impossible.

Why this works — concept by concept:

IS TRUE / IS NOT TRUE are three-valued safe — they never return NULL; the WHERE clause keeps exactly the rows the analyst expects.
COUNT(*) FILTER (WHERE …) — single-pass two-bucket aggregation; faster than running two queries.
UPDATE … WHERE IS NULL + SET NOT NULL — one-shot remediation of historical NULLs.
DEFAULT FALSE — guarantees new rows start in a definite state.
No surprise on rerun — the dashboard's "missing 12%" cannot reappear because the column type now rules it out.
Cost — one UPDATE; the FILTER form has the same cost as two separate COUNTs combined into one scan.

Inline CTA: for the safe-NULL drill set see SQL practice page.

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — conditional aggregation
Conditional-aggregation problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

5. Date and time

`DATE`, `TIME`, `TIMESTAMP`, and `TIMESTAMPTZ` — instants vs wall clocks

PostgreSQL splits time into calendar dates (DATE), local wall-clock times (TIME), wall-clock timestamps (TIMESTAMP WITHOUT TIME ZONE), and absolute instants (TIMESTAMP WITH TIME ZONE, abbreviated TIMESTAMPTZ). The two-row mental model: TIMESTAMP is what a wall clock reads at a particular spot; TIMESTAMPTZ is a point on the global timeline. Every cross-region bug comes from picking the first when you wanted the second.

Pro tip: Default every event-instant column to TIMESTAMPTZ and use TIMESTAMP only when the time is intentionally local (a "9:00 AM recurring meeting" in the user's locale). Reporting that crosses regions becomes obviously correct or obviously wrong, with no middle ground.

`TIMESTAMP` without time zone — local wall-clock semantics

The wall-clock invariant: TIMESTAMP stores the literal datetime you gave it with no time-zone metadata; "2026-04-13 09:00:00" means 9:00 local wherever you happen to be; comparing two TIMESTAMP values is correct only if both came from the same time zone. It is the right type for "9:00 morning meeting in the user's local time" — and the wrong type for "the moment the user clicked Pay."

TIMESTAMP storage — 8 bytes; no time-zone info.
NOW() returns TIMESTAMPTZ — coerced to TIMESTAMP strips the zone.
Comparison — two TIMESTAMPs compare by literal value, regardless of zones.
Use case — "every Monday at 09:00 local time" recurring schedules.

Worked example. Storing a 09:00 morning meeting for two users in different zones:

user	wall-clock time	TIMESTAMP value
Alice (NYC)	9:00 AM EDT	`2026-04-13 09:00:00`
Bob (Tokyo)	9:00 AM JST	`2026-04-13 09:00:00`

Step-by-step.

Both rows look identical because the type carries no zone — the database just stores the digits the application sent.
Both meetings happen at "9:00 AM local"; they are not the same UTC instant (13 hours apart).
A query like SELECT * WHERE start_at = '2026-04-13 09:00:00' returns both rows; that is the right answer for a "9 AM morning meetings" report.
If the same column had been TIMESTAMPTZ, the two values would have been stored as different UTC instants and the report would have returned one of them or neither, depending on session settings.
Pick TIMESTAMP only when the wall-clock semantics are the actual business rule.

Worked-example solution. Recurring local-time schedule:

CREATE TABLE recurring_meetings (
    meeting_id BIGSERIAL PRIMARY KEY,
    user_id    BIGINT    NOT NULL,
    local_tz   TEXT      NOT NULL,                  -- 'America/New_York'
    start_at   TIMESTAMP NOT NULL                   -- intentional wall-clock
);

Rule of thumb: if your column answers the question "what should the clock on the wall read?", use TIMESTAMP; otherwise use TIMESTAMPTZ.

`TIMESTAMPTZ` — UTC instant, session display

The instant invariant: TIMESTAMPTZ stores every value as a UTC instant internally (8 bytes), regardless of the time-zone literal in the INSERT; output is converted to the session's TimeZone at read time; comparison is always instant-to-instant. Same data ships to every region and every report agrees on "when did this happen."

TIMESTAMPTZ storage — 8 bytes; internal representation is UTC.
INSERT … TIMESTAMPTZ '2026-04-13 09:00 EDT' — stored as 13:00 UTC.
SET TimeZone = 'Asia/Tokyo' then SELECT ts — outputs 2026-04-13 22:00:00+09.
AT TIME ZONE — converts between zones in a query.

Worked example. Same UTC instant viewed from three zones:

session TimeZone	what `SELECT ts FROM events WHERE id = 1` shows
`UTC`	`2026-04-13 13:00:00+00`
`America/New_York`	`2026-04-13 09:00:00-04`
`Asia/Tokyo`	`2026-04-13 22:00:00+09`

Step-by-step.

The instant 2026-04-13 13:00:00 UTC was inserted once into the table.
The on-disk representation is a single 8-byte number — UTC microseconds since the epoch.
Each session reads the same row, but the display function converts that instant to the session's TimeZone.
The underlying data is identical; the rendering differs.
Cross-region reports stay correct because every comparison happens on the stored UTC value, not the displayed string.

Worked-example solution. Event-instant column:

CREATE TABLE clicks (
    click_id BIGSERIAL  PRIMARY KEY,
    user_id  BIGINT     NOT NULL,
    ts       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
SELECT user_id, COUNT(*) FROM clicks
WHERE ts >= NOW() - INTERVAL '24 hours'
GROUP BY user_id;

Rule of thumb: every "when did the event happen?" column is TIMESTAMPTZ; never TIMESTAMP.

`AT TIME ZONE` conversions and `DATE_TRUNC` pitfalls

The conversion invariant: ts AT TIME ZONE 'America/New_York' converts a TIMESTAMPTZ to a wall-clock TIMESTAMP in that zone, and the reverse (TIMESTAMP AT TIME ZONE 'America/New_York') interprets the wall-clock as a UTC instant; DATE_TRUNC('day', ts) buckets by UTC midnight unless you convert first. The pattern for "daily count in the user's local time" is DATE_TRUNC('day', ts AT TIME ZONE 'America/New_York').

TIMESTAMPTZ AT TIME ZONE 'zone' → TIMESTAMP (wall clock).
TIMESTAMP AT TIME ZONE 'zone' → TIMESTAMPTZ (instant).
DATE_TRUNC('day', ts) — uses UTC midnight; usually not what reports want.
DATE_TRUNC('day', ts AT TIME ZONE 'zone') — uses local midnight.

Worked example. Daily clicks for a US dashboard:

click	UTC `ts`	UTC day	NY day
A	`2026-04-13 03:00 UTC`	2026-04-13	2026-04-12 (still 23:00 prev day NY)
B	`2026-04-13 14:00 UTC`	2026-04-13	2026-04-13
C	`2026-04-14 02:00 UTC`	2026-04-14	2026-04-13 (still 22:00 NY)

Step-by-step.

DATE_TRUNC('day', ts) groups by UTC midnight; click A goes into UTC 2026-04-13.
But the user in NY clicked at 11 PM on April 12; the dashboard credits the wrong calendar day.
ts AT TIME ZONE 'America/New_York' converts the instant to NY wall-clock: A becomes 2026-04-12 23:00, C becomes 2026-04-13 22:00.
DATE_TRUNC('day', ts AT TIME ZONE 'America/New_York') then buckets by NY midnight; A goes into April 12, B and C into April 13.
Daily counts now match the user's perception of "yesterday."

Worked-example solution. Daily report in NY local time:

SELECT
    DATE_TRUNC('day', ts AT TIME ZONE 'America/New_York')::date AS day_ny,
    COUNT(*) AS clicks
FROM clicks
GROUP BY 1
ORDER BY 1;

Rule of thumb: if a report says "daily" or "monthly," ask whose calendar — and then AT TIME ZONE before truncating.

Common beginner mistakes

Defaulting to TIMESTAMP "because it's shorter to type" — silently breaks cross-region comparisons after the first deploy abroad.
Storing TIMESTAMP and then "adding the time zone in the app" — the database loses the original zone the moment you stored.
DATE_TRUNC('day', ts) on UTC instants for a regional dashboard — daily counts shift by hours.
Using NOW() interchangeably with CURRENT_DATE — NOW() is TIMESTAMPTZ, CURRENT_DATE is DATE in the session's zone.
Forgetting daylight saving — INTERVAL '24 hours' is not always "next day at the same wall-clock time."

SQL Interview Question on a Dashboard That Shifted 24 Hours After Deploy

The team deploys their analytics pipeline to a new region; the next morning the "orders today" dashboard shows yesterday's total. Storage is placed_at TIMESTAMP (without time zone). Diagnose the cause and propose a schema + query fix that survives any future deploy.

Solution Using `TIMESTAMPTZ` + `AT TIME ZONE` in the Reporting View

Code solution.

ALTER TABLE orders
    ALTER COLUMN placed_at TYPE TIMESTAMPTZ
    USING placed_at AT TIME ZONE 'America/New_York';

CREATE VIEW v_daily_orders AS
SELECT
    DATE_TRUNC('day', placed_at AT TIME ZONE 'America/New_York')::date AS order_day,
    COUNT(*)                                                              AS orders,
    SUM(total)                                                            AS revenue
FROM orders
GROUP BY 1;

Step-by-step trace.

step	observation
1	original `placed_at TIMESTAMP` — interpreted in the application's local zone
2	redeploy moves the app to a server in `UTC`; same `NOW()` literal now means UTC, not NY
3	rows inserted post-deploy look 4 hours older to the dashboard's NY-day buckets
4	`ALTER COLUMN … TYPE TIMESTAMPTZ USING … AT TIME ZONE 'America/New_York'` reinterprets all existing rows
5	new `TIMESTAMPTZ` column stores UTC; the view's `AT TIME ZONE` reverses to NY for display
6	dashboard buckets daily counts by NY midnight; results are stable across redeploys

Output: "orders today" matches the operations team's intuition regardless of where the application server lives. Future deploys cannot reintroduce the 24-hour shift because the column type now stores instants, not wall clocks.

Why this works — concept by concept:

TIMESTAMPTZ stores UTC — the on-disk value is the same regardless of session or server zone.
USING … AT TIME ZONE 'America/New_York' — one-shot reinterpretation of legacy rows during the type migration.
AT TIME ZONE in the view, not the table — every report stays explicit about whose calendar it uses.
DATE_TRUNC on the local wall-clock — daily buckets align to the user's perception of "today."
Stable across redeploys — server moves do not change the displayed daily count.
Cost — one rewrite per migration; per-row AT TIME ZONE is essentially free (microseconds).

Inline CTA: drill the date-functions practice topic and the filtering practice topic for time-aware predicates.

SQL
Topic — date functions
Date-function problems

Practice →

SQL
Topic — window functions
Window-function problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

6. Semi-structured and other types

`JSONB`, `UUID`, and arrays for flexible attributes

PostgreSQL is a "relational with side quests" database — it has first-class JSONB (binary, indexable JSON), UUID (opaque distributed IDs), and array types (INTEGER[], TEXT[], JSONB[]) that make schema-flexible patterns possible without giving up SQL. The discipline is to use them deliberately: JSONB for truly sparse attributes, UUID for public/distributed identifiers, arrays for short bounded lists. Reach for them often and the schema becomes hard to query; reach for them never and you write more tables than you need.

Pro tip: Any column that becomes a frequent filter or join key belongs in a real typed column, not nested inside JSONB. Use JSONB as the "everything else" bucket for attributes that vary by row.

`JSON` vs `JSONB` — when binary indexing matters

The JSONB invariant: JSON stores the input text exactly (whitespace, key order, duplicate keys preserved) and reparses on every read; JSONB stores a binary-decoded representation that is faster to query, supports GIN indexes, and rejects duplicate keys — pay the small write-time cost for read-time speed. For event payloads, application config, and flexible user attributes, JSONB is the default.

JSON — text-faithful; preserves whitespace and duplicate keys; slow.
JSONB — binary; faster reads; canonical (no whitespace, no duplicate keys).
-> — returns JSON / JSONB.
->> — returns TEXT.
@> containment — '{"a": 1}'::jsonb @> '{"a": 1}'::jsonb is TRUE.
GIN index — CREATE INDEX … USING GIN (jsonb_col jsonb_path_ops).

Worked example. Searching event payloads for {"plan": "pro"}:

design	predicate	plan
`payload JSON`	`payload->>'plan' = 'pro'`	Seq Scan
`payload JSONB`	`payload @> '{"plan":"pro"}'::jsonb` (with `GIN`)	Index Scan

Step-by-step.

With plain JSON, every row must be parsed at query time to extract the plan key.
The planner cannot use a B-tree index because the parse step is per-row.
Switching the column to JSONB lets you create a GIN index on the document.
The containment query @> is index-eligible — PostgreSQL probes the GIN structure for documents that contain the requested subtree.
On a 50 M-row table, the difference is full table scan vs sub-second seek.

Worked-example solution. Indexed JSONB column:

CREATE TABLE events (
    event_id BIGSERIAL PRIMARY KEY,
    user_id  BIGINT    NOT NULL,
    payload  JSONB     NOT NULL,
    ts       TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX ON events USING GIN (payload jsonb_path_ops);
SELECT COUNT(*) FROM events WHERE payload @> '{"plan":"pro"}';

Rule of thumb: default to JSONB for any "flexible attributes" column; default to a real typed column for any attribute you filter on more than a few times a week.

`UUID` — opaque IDs for distributed systems

The UUID invariant: UUID is a 16-byte fixed-width identifier that does not leak ordering or count; ideal for public IDs, multi-region writes, and any context where you don't want consumers inferring growth rate from the sequence; trade-off vs BIGINT is ~2× storage and worse B-tree locality for monotonic insert patterns. Use UUIDs at the boundary (URLs, foreign systems) and BIGINTs internally if performance is critical.

UUID storage — 16 bytes; gen_random_uuid() from pgcrypto.
v4 (random) — uniform random; great privacy, bad B-tree locality.
v7 (time-ordered) — sortable by creation time; better cache behavior.
UUID vs TEXT — always declare as UUID; TEXT UUIDs lose validation and index efficiency.

Worked example. Two ways to model a public order ID:

design	bytes	URL example
`order_id BIGINT`	8	`/orders/12345678` (leaks volume)
`order_id UUID`	16	`/orders/8c3b7e2a-…` (opaque)

Step-by-step.

BIGINT is monotonic — scraping a few order URLs lets a competitor infer your daily volume.
UUID v4 is unguessable; 8c3b7e2a-… carries no information.
Storage cost: 8 extra bytes per row × millions of rows is meaningful but rarely decisive.
B-tree locality: random UUIDs spread inserts across the index; v7 (time-ordered) restores append-friendly behavior.
For most "public ID" use cases, UUID v7 is the clean middle ground.

Worked-example solution. Internal BIGINT + public UUID:

CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE TABLE orders (
    order_id    BIGSERIAL PRIMARY KEY,
    public_id   UUID      NOT NULL UNIQUE DEFAULT gen_random_uuid(),
    customer_id BIGINT    NOT NULL
);

Rule of thumb: expose UUIDs at the API boundary; keep BIGINT joins inside the database.

Arrays — `INTEGER[]`, `TEXT[]`, and the `UNNEST` pattern

The array invariant: PostgreSQL arrays are first-class typed columns; common operations are ANY (arr) for membership, arr @> arr for containment, and UNNEST(arr) to flatten an array column into rows — useful when the list is short (≤ ~10 items) and *bounded by the row*. For unbounded or queried-often lists, a child table is the better design.

INTEGER[] — array of integers; literal '{1,2,3}'::int[].
ANY (arr) — x = ANY ('{1,2,3}'::int[]) is TRUE if x is in the array.
@> — '{1,2,3}'::int[] @> '{2}' is TRUE.
UNNEST(arr) — produces one row per array element; pivot a row of N elements into N rows.

Worked example. A users.role_ids INTEGER[] column:

user_id	role_ids
1	`{10, 20}`
2	`{20, 30}`
3	`{10}`

query	rows
`WHERE 20 = ANY (role_ids)`	1, 2
`WHERE role_ids @> '{10, 20}'`	1
`SELECT user_id, UNNEST(role_ids)`	(1,10), (1,20), (2,20), (2,30), (3,10)

Step-by-step.

Storing roles as INTEGER[] keeps the user table compact — no separate user_roles table for a small bounded set.
ANY is the array-side IN: it tests membership of one value against the column.
@> tests whether the column array contains every element of the right-hand array.
UNNEST flattens the column into rows; joining UNNEST(role_ids) to dim_role produces a per-role row.
For unbounded role sets (10 K+) the array column gets slow and a child table wins; for typical "a user has 1-5 roles" cases, arrays are clean.

Worked-example solution. A small bounded list:

CREATE TABLE users (
    user_id  BIGINT       PRIMARY KEY,
    role_ids INTEGER[]    NOT NULL DEFAULT '{}'::INTEGER[]
);
SELECT u.user_id, r.role_name
FROM users u
JOIN dim_role r ON r.role_id = ANY (u.role_ids);

Rule of thumb: arrays for short, bounded, rarely-filtered lists; child tables for everything else.

Common beginner mistakes

Storing everything as JSONB because "schemas are hard" — you trade type safety and indexability for write-time convenience.
Indexing JSON instead of JSONB — JSON cannot use GIN; the index won't help.
Picking UUID v4 PKs on a high-write table and watching B-tree fragmentation degrade write throughput.
Treating TEXT UUIDs the same as UUID columns — same data, different operator class, broken indexes.
Storing unbounded lists in arrays — once the array exceeds a few dozen entries, every read TOASTs the column and queries slow.

SQL Interview Question on Searching JSONB Payloads at 50 M-Row Scale

A 50 M-row events.payload JSONB column holds variable payloads. Marketing wants to count events where {"plan": "pro"} appears in the payload, and the query takes 60 seconds. Make it return in under 100 ms without changing the storage shape.

Solution Using a `GIN` Index with `jsonb_path_ops` + `@>` Containment

Code solution.

CREATE INDEX events_payload_gin_idx
    ON events
    USING GIN (payload jsonb_path_ops);

SELECT COUNT(*)
FROM events
WHERE payload @> '{"plan":"pro"}'::jsonb;

Step-by-step trace.

step	action	time
1	initial query `payload->>'plan' = 'pro'`	62 s, Seq Scan
2	switch predicate to `payload @> '{"plan":"pro"}'::jsonb` (no index yet)	60 s, still Seq Scan
3	`CREATE INDEX … USING GIN (payload jsonb_path_ops)` (~5 min build)	—
4	rerun containment query	85 ms, GIN Index Scan
5	`EXPLAIN ANALYZE` confirms `Bitmap Heap Scan on events`	one-pass

Output: 60 s → 85 ms — three orders of magnitude faster — with no schema change, no application change, and no data rewrite. EXPLAIN ANALYZE shows the GIN index handling the containment lookup.

Why this works — concept by concept:

@> containment is index-eligible — ->> text extraction is not; the operator choice unlocks the index.
jsonb_path_ops — specialised GIN class for containment-only queries; smaller and faster than the default jsonb_ops.
No row rewrite — CREATE INDEX builds a new index without touching the table heap; existing reads are uninterrupted.
Generalises to other keys — any future payload @> '{"key":"val"}' query benefits; no per-key index needed.
Trade-off — write throughput drops slightly (GIN updates are heavier than B-tree); usually invisible.
Cost — index build is O(N) one-time; reads become O(log N) per query.

Inline CTA: drill the filtering practice page and the SQL practice page for JSON-flavoured patterns.

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

7. Casting and comparison rules

Implicit coercion, explicit `CAST`, and index-friendly predicates

PostgreSQL silently coerces some type mixes ('42'::text to INTEGER in an = context), refuses others, and lets you make the conversion explicit with CAST(x AS type) or its shorthand x::type. The high-leverage rule is where the cast lands: a cast on a literal is free and index-friendly; a cast on a column usually disables the index. Mixed-type joins are the canonical cause of "the query returns no rows" and "the query is suddenly 100× slower."

Pro tip: When EXPLAIN reveals Seq Scan on … on a column you indexed, scan the Filter: line for a ::type cast. The fix is usually to cast the other side or — better — to change the source column's type so no cast is needed.

Implicit coercion — when PostgreSQL guesses

The coercion invariant: PostgreSQL has a graph of allowed implicit casts (e.g., INTEGER → BIGINT, INTEGER → NUMERIC, TEXT → INTEGER in some contexts) and applies them silently when one side of a binary operator differs from the other; when no implicit cast exists, the query fails with a operator does not exist error. Implicit coercion is convenient until it produces a different answer than expected.

INTEGER = BIGINT — implicit widen; no surprise.
TEXT = INTEGER — works for literals (WHERE id = '42'); fails for columns (WHERE t.id = b.id).
DATE = TIMESTAMPTZ — implicit widen via session zone; can shift.
BOOLEAN = INTEGER — not allowed; you must cast.

Worked example. Joining a TEXT user_id to a BIGINT user_id:

left.user_id (TEXT)	right.user_id (BIGINT)	join works?
`'42'`	`42`	error / Seq Scan
`'042'`	`42`	mismatch (lexicographic ≠ numeric)
`' 42'`	`42`	mismatch (whitespace)

Step-by-step.

PostgreSQL needs both sides of = to be the same type; it tries to coerce.
Coercing TEXT → BIGINT is possible per-value ('42'::BIGINT), but the planner applies it on the column — disabling the index.
Leading zeros, whitespace, and non-digit characters cause the cast to fail mid-query.
The result is either a hard error or a slow seq scan.
The fix is upstream: align the source column types so no cross-type compare is needed.

Worked-example solution. Avoid mixed-type joins:

-- if you must cast, cast at write time, not query time
ALTER TABLE staging_users
    ALTER COLUMN user_id TYPE BIGINT USING NULLIF(user_id, '')::BIGINT;

Rule of thumb: never store an identifier as text on one side and as integer on the other side of a join. Pick one type at the warehouse contract level.

`CAST(x AS type)` vs `x::type` shorthand

The CAST invariant: CAST(x AS type) and x::type produce identical output; the longhand is SQL-standard and self-documenting; the shorthand is PostgreSQL idiomatic and shorter in expression-heavy queries. Both fail with a clear error when the conversion is illegal.

CAST(x AS type) — ANSI SQL; works in every dialect.
x::type — PostgreSQL shorthand.
Failure modes — same for both: invalid input syntax for type integer etc.
NULLIF + CAST — NULLIF(x, '')::INT collapses empty string to NULL before casting.

Worked example. Two equivalent expressions:

SELECT CAST('42' AS BIGINT), '42'::BIGINT;
SELECT CAST('not a number' AS BIGINT);  -- ERROR
SELECT NULLIF('','')::BIGINT;            -- NULL

Step-by-step.

Both CAST and :: produce the same output type and the same value.
Failing input (non-digit string) raises the same error in both forms.
NULLIF(x, '')::TYPE is the canonical "treat empty string as NULL" pattern.
In multi-expression SELECTs, :: keeps lines short; in code-review-heavy contexts, CAST is more legible.
Use whichever your team's house style prefers; do not mix unnecessarily.

Worked-example solution. Safe cast for messy ETL data:

SELECT
    NULLIF(BTRIM(user_id), '')::BIGINT AS user_id,
    raw_payload
FROM staging_events;

Rule of thumb: BTRIM + NULLIF + ::type is the three-step safe-cast pattern for noisy inputs.

Index-killing casts on indexed columns

The index-killer invariant: a WHERE predicate that wraps an indexed column in a function — including an implicit cast — usually forces a sequential scan; the planner cannot prove the function is monotonic on the index, so it falls back to scanning every row. The same query rewritten to cast the literal instead is index-eligible.

col::type = $1 — bad; column cast disables index.
col = $1::type — good; literal cast, index used.
LOWER(col) = $1 — bad unless you build a functional index on LOWER(col).
col = LOWER($1) — good; literal-side function call.

Worked example. A user_id BIGINT column indexed; two predicates:

predicate	plan
`WHERE user_id = 42`	Index Scan
`WHERE user_id::text = '42'`	Seq Scan
`WHERE user_id = '42'::bigint`	Index Scan

Step-by-step.

user_id = 42 matches the type of the indexed column directly.
user_id::text applies a function to every row; the B-tree on the original value cannot be used.
Rewriting as user_id = '42'::bigint casts the literal once and reuses the existing index.
If you genuinely need to query by the casted form, create a functional index: CREATE INDEX ON users ((user_id::text)).
The cheapest fix is almost always to change the data type so no cast is needed.

Worked-example solution. Cast the literal, never the column:

-- good
SELECT * FROM events WHERE user_id = '42';        -- literal coerced
-- bad
SELECT * FROM events WHERE user_id::text = '42';  -- column cast kills the index

Rule of thumb: every :: on the indexed side of a WHERE or JOIN is a code smell. Investigate before merging.

Common beginner mistakes

Joining a TEXT user_id to a BIGINT user_id and adding ::text on the BIGINT side — works but disables the index.
Treating '042' = 42 as TRUE everywhere — leading zeros are preserved in TEXT and lost in INTEGER.
Mixing TIMESTAMP and TIMESTAMPTZ in joins — answers depend on session TZ.
Using LIKE against a numeric column without realising it forces a ::text cast.
Forgetting to handle empty strings before casting — ''::INT is a hard error; use NULLIF.

SQL Interview Question on a Cross-Type Join Returning Zero Rows

staging_users.user_id TEXT joined to dim_users.user_id BIGINT returns 0 rows even though both tables contain user_id = 42. The planner reports a Seq Scan on dim_users. Identify every contributing cause and propose a fix that produces a sound result and keeps the dim's primary-key index usable.

Solution Using a Single-Type Schema + Explicit Literal-Side Cast

Code solution.

-- short-term: cast the staging text to BIGINT (literal-side cast on TEXT)
SELECT d.*
FROM staging_users s
JOIN dim_users d
  ON d.user_id = NULLIF(BTRIM(s.user_id), '')::BIGINT;

-- permanent fix: rewrite staging to BIGINT once
ALTER TABLE staging_users
    ALTER COLUMN user_id TYPE BIGINT
    USING NULLIF(BTRIM(user_id), '')::BIGINT;

Step-by-step trace.

step	symptom	cause
1	`WHERE d.user_id = s.user_id` errors with operator-does-not-exist	type mismatch (BIGINT vs TEXT)
2	analyst rewrites as `WHERE d.user_id::text = s.user_id`	"fixes" the error
3	query returns 0 rows	leading whitespace in `s.user_id` (`' 42'`) breaks lexicographic compare
4	`EXPLAIN` shows Seq Scan on `dim_users`	column cast on `d.user_id` killed the PK index
5	rewrite with `BTRIM` + `NULLIF` + `::BIGINT` on the staging side	index restored, whitespace tolerated
6	row count matches `dim_users.user_id` cardinality	sound result

Output: join now returns the expected rows, the dim's primary-key index is back in the plan, and the permanent ALTER COLUMN removes the per-query cast for good.

Why this works — concept by concept:

Single-type schema — after the ALTER, both sides are BIGINT; no cross-type compare ever runs.
Literal-side BTRIM + NULLIF + ::BIGINT — handles real-world dirty input without disabling the dim's index.
Index on dim_users.user_id preserved — because the cast is on the staging side, not the dim side.
Whitespace-tolerant — BTRIM eliminates the silent-zero-rows mode.
Empty-string-safe — NULLIF(x, '')::BIGINT returns NULL instead of erroring.
Cost — one rewrite at the staging layer; per-query cost drops from full table scan to O(log N) PK seek.

Inline CTA: for the join-fluency syllabus see the joins practice page and the SQL practice page.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

COURSE
Course — SQL for DE
Zero to FAANG SQL fundamentals

View course →

Choosing types (checklist)

If you are storing…	Prefer…	Watch out for…
Surrogate keys, row counts	`BIGINT` / `INTEGER`	Overflow, unnecessary `BIGSERIAL` everywhere
Money, rates, basis points	`NUMERIC(p, s)`	Float rounding in aggregates
Labels, names, free text	`TEXT` or `VARCHAR(n)`	Collation, padding with `CHAR`
Instants in distributed systems	`TIMESTAMPTZ`	Mixing with `TIMESTAMP` in joins
Nested / sparse attributes	`JSONB`	Huge documents without indexes
Public opaque IDs	`UUID`	Stringly-typed UUIDs in joins

Pro tip: When you explain a schema in a live screen, say the grain and the type together: "one row per order, order_id is BIGINT, total is NUMERIC(14,2)."

Frequently asked questions

Should I use `TEXT` or `VARCHAR(255)`?

In PostgreSQL there is no storage penalty for TEXT vs varchar with the same contents. Use VARCHAR(n) when you want the database to enforce a maximum length; otherwise TEXT is simple and common.

Is `SERIAL` still OK for primary keys?

SERIAL / BIGSERIAL are convenient; GENERATED ... AS IDENTITY is the standards-preferred spelling in modern PostgreSQL. Know both in interviews.

Why is my join returning no rows when the IDs "look the same"?

Check types and whitespace on string keys. Compare plans with EXPLAIN: mismatched types can prevent index use or change semantics of comparison. Then rehearse on SQL-tagged problems →.

When must I use `NUMERIC` instead of float?

Whenever exact decimal behavior is required—currency, tax, allocations—or when you must match a ledger or regulatory rule. Floats are for measured magnitudes where error bounds are acceptable.

Practice on PipeCode

PipeCode ships 450+ data engineering practice problems—SQL uses the PostgreSQL dialect, with editorials and topics aligned to what strong companies ask. Start from Explore practice →, open SQL practice →, filter by joins → or aggregations →, and see plans → when you want the full library.

PostgreSQL SQL Cheat Sheet — Clause Order, Joins, Aggregates, Windows

Gowtham Potureddi — Mon, 11 May 2026 03:52:46 +0000

A PostgreSQL SQL cheat sheet is only useful when every row in it maps to something you can drop straight into a query — not a wall of syntax with no operational explanation. This guide condenses real PostgreSQL fluency to four primitives: the logical clause order (FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT), the six join shapes and the grain trap they create, GROUP BY with HAVING plus conditional aggregates for one-pass metrics, and window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, and LEAD for ranking and lookback. These four cover the bulk of analytical SQL — and the cheat-sheet style below is built so you can scan, copy a snippet, and tweak it for your own schema.

Every section walks through a detailed topic explanation, sub-topics with worked examples and runnable solutions, common beginner mistakes, and a worked interview-style scenario with a full traced answer. PostgreSQL syntax throughout — the dialect that drives DataLemur, CoderPad, most product-analytics live screens, and the bulk of modern data-engineering SQL corpora.

1. PostgreSQL Logical Clause Order — `FROM` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

The seven-stage evaluation order every PostgreSQL query follows

"Why does WHERE customer_count > 5 give me a parse error when I'm clearly counting customers?" is the signature beginner question — and the answer is logical clause order. The mental model: PostgreSQL evaluates clauses in a fixed order that is different from the order you write them; FROM/JOIN builds the row set, WHERE filters rows, GROUP BY collapses rows into groups, HAVING filters groups, SELECT projects columns, ORDER BY sorts, LIMIT/OFFSET trims. WHERE cannot reference aggregate functions because aggregates do not exist until after GROUP BY; column aliases declared in SELECT cannot be referenced in WHERE for the same reason.

Pro tip: Memorize one sentence — "From-Where-Group-Having-Select-Order-Limit" — and you can decode any PostgreSQL parse error in under five seconds. The error column "customer_count" does not exist almost always means the column is a SELECT-level alias being referenced in WHERE, which runs three stages earlier; lift the predicate into HAVING (if it references an aggregate) or repeat the expression inline in WHERE.

`FROM` and `JOIN` — build the working row set

The FROM/JOIN invariant: the first stage assembles a candidate row set by listing the tables (and how they join); every subsequent stage operates on this row set. Subqueries in FROM are also evaluated here, and LATERAL joins let later subqueries reference earlier rows.

Single table — FROM orders produces one row per orders row.
Joined tables — FROM orders o JOIN customers c ON c.id = o.customer_id produces one row per matching pair.
Subquery in FROM — FROM (SELECT ...) t materializes the inner result, then treats it as a table.
LATERAL subquery — FROM orders o, LATERAL (SELECT ... WHERE x = o.id) s re-evaluates the inner subquery per outer row.

Worked example. A FROM with a LEFT JOIN that produces the right row set before any filter runs.

step	output cardinality
`FROM customers` alone	3 rows
`LEFT JOIN orders`	4 rows (Alice has 2 orders, Bob 1, Carol 0 padded with NULLs)
ready for `WHERE` filtering	4 rows

Step-by-step explanation.

The engine reads customers first, producing three rows (Alice, Bob, Carol).
For each customer, it scans orders for matching customer_id rows; Alice matches 2 orders, Bob matches 1, Carol matches 0.
Because the join is LEFT, Carol's row is preserved with the right-side columns filled with NULLs — total 4 rows.
This 4-row stream is what WHERE will see; no filtering has happened yet.
Without understanding FROM runs first, you can't reason about why a WHERE predicate on the right side of a LEFT JOIN silently converts the join into an INNER JOIN.

Worked-example solution.

SELECT c.name, o.order_id, o.amount
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: if your LEFT JOIN is producing fewer rows than expected, check whether you have a WHERE predicate that references the right-side table — that predicate runs after the join and discards the NULL-padded rows.

`WHERE` — row-level predicates before grouping

The WHERE invariant: WHERE filters individual rows from the FROM/JOIN output before GROUP BY runs; it can reference any column from the joined row set, but cannot reference aggregate functions or SELECT-level aliases. This is the cheapest place to drop rows — push predicates here whenever possible.

Row predicates — WHERE amount > 30, WHERE order_date >= '2026-01-01'.
IN / EXISTS — WHERE customer_id IN (SELECT id FROM premium), WHERE EXISTS (...).
BETWEEN — inclusive on both ends; WHERE x BETWEEN 1 AND 10 is x >= 1 AND x <= 10.
IS NULL / IS NOT NULL — the only way to check for NULL; never = NULL.

Worked example. Filter to one day of orders before grouping.

filter	rows surviving
no filter	12,847 (full day's orders)
`WHERE order_date = '2026-05-10'`	12,847 (already today's)
`WHERE order_date = '2026-05-10' AND amount > 100`	4,290 (high-value only)

Step-by-step explanation.

FROM orders returns the full row stream.
WHERE order_date = '2026-05-10' is evaluated per row; rows with other dates are dropped.
AND amount > 100 is evaluated next; this is a row predicate (not an aggregate), so it lives in WHERE correctly.
The surviving row set (4,290 rows) flows into GROUP BY if one is present, otherwise into SELECT.
Pushing the date filter into WHERE rather than HAVING is critical for index usage: a B-tree index on order_date can prune 95% of the table before any grouping happens.

Worked-example solution.

SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id;

Rule of thumb: if the predicate uses only raw row columns, it belongs in WHERE; if it uses SUM, COUNT, AVG, MIN, MAX, it belongs in HAVING.

`GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`

The downstream invariant: after WHERE, the engine evaluates GROUP BY (collapsing rows into one row per distinct key combination), then HAVING (filtering groups), then SELECT (projecting columns and computing expressions), then ORDER BY (sorting the final result), then LIMIT/OFFSET (trimming for pagination). SELECT-level aliases become referenceable only in ORDER BY and the outer query (in a subquery context).

GROUP BY col1, col2 — one output row per distinct (col1, col2) combination.
HAVING agg_pred — filter groups; can reference COUNT(*), SUM(col), etc.
SELECT col, agg(col2) AS x — project columns; aggregates and aliases are computed here.
ORDER BY x DESC, col — can reference SELECT aliases; deterministic with a tiebreaker.
LIMIT N OFFSET M — page slicing; always pair with ORDER BY for determinism.

Worked example. Group by customer, filter to high-spend customers, sort descending, top 5.

stage	rows
`FROM orders WHERE order_date = '2026-05-10'`	4,290
`GROUP BY customer_id`	1,720 (one row per customer)
`HAVING SUM(amount) > 500`	312 (high-spend)
`SELECT customer_id, SUM(amount) AS spend`	312 (projected)
`ORDER BY spend DESC, customer_id`	312 (sorted)
`LIMIT 5`	5

Step-by-step explanation.

WHERE produces 4,290 rows for one day with amount > 100.
GROUP BY customer_id collapses them into 1,720 buckets, one per customer.
HAVING SUM(amount) > 500 keeps only the 312 buckets whose total spend exceeds $500.
SELECT computes the alias spend = SUM(amount) and projects two columns.
ORDER BY spend DESC, customer_id sorts the 312 surviving rows by descending spend with a deterministic tiebreaker; LIMIT 5 returns just the top five.

Worked-example solution.

SELECT customer_id, SUM(amount) AS spend
FROM orders
WHERE order_date = '2026-05-10'
  AND amount > 100
GROUP BY customer_id
HAVING SUM(amount) > 500
ORDER BY spend DESC, customer_id
LIMIT 5;

Rule of thumb: every clause has a fixed slot; if you find yourself wanting WHERE to reference an aggregate, the predicate belongs in HAVING instead — and if you want ORDER BY to use a long expression, alias it in SELECT and reference the alias.

Common beginner mistakes

WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
Referencing a SELECT alias in WHERE — WHERE spend > 100 after SELECT SUM(amount) AS spend fails; either repeat the expression or move to HAVING.
Selecting a non-aggregated, non-GROUP BY column — strict PostgreSQL errors out with "must appear in GROUP BY"; some other dialects pick an arbitrary row silently.
LIMIT 5 without ORDER BY — non-deterministic; two runs of the same query return different rows.
Putting HAVING before GROUP BY — syntax error; the clause order is mandatory.

PostgreSQL Interview Question on Clause Order

Given orders(order_id, customer_id, amount, order_date), find every customer who placed more than 3 orders today with total spend above $500. Return customer_id and total_spend, sorted by total_spend descending.

Solution Using `WHERE` + `GROUP BY` + `HAVING` in the Right Slots

SELECT customer_id,
       SUM(amount) AS total_spend
FROM orders
WHERE order_date = CURRENT_DATE
GROUP BY customer_id
HAVING COUNT(*) > 3
   AND SUM(amount) > 500
ORDER BY total_spend DESC, customer_id;

Why this works: WHERE order_date = CURRENT_DATE filters to today's row set first (cheap, index-friendly); GROUP BY customer_id collapses to one row per customer; HAVING evaluates the two aggregate predicates together (more than 3 orders AND total > $500); SELECT projects the alias total_spend; ORDER BY total_spend DESC, customer_id produces a deterministic ordering. Single pass over today's rows with hash aggregation.

Step-by-step trace for sample data on 2026-05-10:

customer_id	orders today	sum(amount)	passes HAVING?
101	5	720	✓
102	4	410	✗ (sum ≤ 500)
103	6	1,250	✓
104	2	800	✗ (count ≤ 3)
105	4	520	✓

Three customers survive both predicates.

Output:

customer_id	total_spend
103	1250
101	720
105	520

Why this works — concept by concept:

WHERE first — order_date = CURRENT_DATE is a row predicate using a non-aggregated column; pushing it into WHERE shrinks the row set before grouping and lets the planner use a B-tree index on order_date.
GROUP BY customer_id — collapses today's rows into one bucket per customer; every subsequent aggregate is computed inside this bucket.
HAVING two-predicate AND — COUNT(*) > 3 and SUM(amount) > 500 are both aggregate predicates; combining them with AND in a single HAVING is the canonical multi-condition group filter.
SELECT projection + alias — SUM(amount) AS total_spend is computed here; the alias becomes available to ORDER BY (but not to WHERE / HAVING).
ORDER BY total_spend DESC, customer_id — descending sort on the metric with a deterministic tiebreaker via customer_id; reviewers depend on stable ordering.
O(|today's orders| + G log G) time — single hash aggregation produces G groups; final sort is G log G. With an index on (order_date, customer_id) the planner can stream rather than hash.

Inline CTA: Drill the SQL filtering practice page for WHERE patterns and the SQL aggregation practice page for GROUP BY + HAVING shapes.

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

2. PostgreSQL Joins and Grain — `INNER`, `LEFT`, `RIGHT`, `FULL`, `SELF`, `CROSS`

Joins, anti-joins, and the grain-inflation trap in PostgreSQL

"Why is SUM(amount) returning double what I expect after I add a JOIN?" is the signature grain-inflation question — and the answer is that joins do not just combine columns; they change the row cardinality of the result. The mental model: INNER JOIN keeps only matching pairs, LEFT JOIN keeps every left row and pads the right side with NULLs, RIGHT JOIN is the mirror, FULL OUTER JOIN keeps both sides' unmatched rows, SELF JOIN joins a table to itself (for hierarchies and pair queries), CROSS JOIN produces a Cartesian product (one row per (left, right) pair). The cardinality of any join is bounded by |left| × |right|, and a 1:N relationship inflates left rows by N — the silent source of doubled metrics.

Pro tip: Before writing any join, ask "what is the grain of the result?" — orders, order lines, customer-day, or (customer, product) pair. A 1:N join (e.g., customers to orders) inflates customer rows by the number of orders; SUM(customer.lifetime_value) after that join returns lifetime value × order count, not lifetime value. Always state the grain out loud.

`INNER JOIN` — keep only matching pairs (no padding)

The INNER JOIN invariant: a left row is paired with a right row iff the join predicate is TRUE; unmatched rows on either side are discarded; the result cardinality is the count of matching pairs. This is the most common join and the fastest because the planner can short-circuit on no-match.

ON l.key = r.key — single-column equi-join; the planner hashes the right table.
Multi-column — ON l.a = r.a AND l.b = r.b for composite keys.
Non-equi — ON l.range_start <= r.point AND l.range_end >= r.point (range join).
USING (col) — shorthand when both sides share the column name; merges the column.

Worked example. Two tables, three customers, two orders; one customer has no order.

customer	order_id
Alice	101
Bob	102

Carol (no orders) does not appear — INNER JOIN dropped her.

Step-by-step explanation.

The engine reads customers (Alice, Bob, Carol) and orders (101 for Alice, 102 for Bob).
For each customers row, it scans orders for a matching customer_id.
Alice matches order_id = 101; Bob matches order_id = 102; Carol has no match.
Carol's row is silently discarded because the join is INNER — no NULL-padded row is produced.
The output has two rows because there were two matching pairs; the result cardinality is min(|customers|, |orders|) ≤ N ≤ |customers| × |orders|.

Worked-example solution.

SELECT c.name AS customer, o.order_id
FROM customers c
INNER JOIN orders o
  ON o.customer_id = c.id;

Rule of thumb: reach for INNER JOIN whenever the question is "rows where both sides exist"; it is the smallest, fastest, most common join.

`LEFT JOIN` — keep every left row, pad the right with `NULL`s (anti-join trick)

The LEFT JOIN invariant: every row from the left table appears in the output; if no right row matches, the right columns are NULL; LEFT JOIN ... WHERE right.key IS NULL keeps exactly the left rows that had no match — the anti-join idiom. RIGHT JOIN is the mirror; flip the table order and use LEFT for consistency.

LEFT JOIN — preserves every left row.
Right columns NULL when no match — the key signal for anti-joins.
LEFT JOIN ... IS NULL anti-join — "find rows in A with no match in B".
RIGHT JOIN — mirror image; rarely needed (just flip table order and use LEFT).

Worked example. Same customers + orders; Carol is preserved with NULL right-side columns.

customer	order_id
Alice	101
Bob	102
Carol	NULL

Step-by-step explanation.

For each customers row, scan orders for a matching customer_id.
Alice matches → row (Alice, 101); Bob matches → row (Bob, 102).
Carol does not match → row (Carol, NULL) is produced because the join is LEFT.
To find Carol via the anti-join: add WHERE o.order_id IS NULL after the LEFT JOIN; only Carol's row passes the filter.
Equivalent to WHERE NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id) and (under NOT NULL constraints) WHERE c.id NOT IN (SELECT customer_id FROM orders) — but the anti-join is immune to the NOT IN NULL-swallowing bug.

Worked-example solution.

SELECT c.name AS customer
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL;

Rule of thumb: "find X with no Y" → LEFT JOIN ... WHERE Y.id IS NULL. Memorize this; it is the most-asked join shape in SQL interviews and the cleanest fix for the NOT IN ... NULL trap.

`FULL OUTER`, `SELF`, and `CROSS` joins — the rarer shapes

The rarer-joins invariant: FULL OUTER JOIN keeps every left row AND every right row (with NULL padding on the unmatched side); SELF JOIN joins a table to itself by aliasing it twice (employees-and-managers, parent-child, pair queries); CROSS JOIN produces every (left, right) combination — the Cartesian product. Each has a narrow but important use.

FULL OUTER JOIN — reconcile two sources; rows from either side without a match get padded.
SELF JOIN — employee/manager, hierarchical recursion (alternative to recursive CTE), pair queries.
CROSS JOIN — generate every combination (small tables only) or paired with LATERAL for top-N per row.
Implicit cross join — comma-separated tables (FROM a, b) without an ON is a CROSS JOIN — usually a bug.

Worked example. Self-join employees to itself to surface each person's manager.

name	manager_name
Alice	Carol
Bob	Carol
Carol	NULL (CEO)

Step-by-step explanation.

Alias the same employees table twice: e (for employees) and m (for managers).
LEFT JOIN on e.manager_id = m.emp_id looks each employee up against the manager rows.
Alice's manager_id points to Carol → row (Alice, Carol).
Bob's manager_id points to Carol → row (Bob, Carol).
Carol is the CEO so her manager_id IS NULL → no match → row (Carol, NULL) because the join is LEFT.

Worked-example solution.

SELECT e.name,
       m.name AS manager_name
FROM employees e
LEFT JOIN employees m
  ON m.emp_id = e.manager_id;

Rule of thumb: SELF JOIN is one-level hierarchy; for arbitrary-depth recursion (org chart traversal, BOM tree), reach for WITH RECURSIVE instead.

Common beginner mistakes

Forgetting that a 1:N JOIN inflates the left side — SUM(left.col) returns left.col × N.
Filtering the right table inside WHERE after a LEFT JOIN (e.g., WHERE o.amount > 0) — silently turns the LEFT JOIN into an INNER JOIN because NULL > 0 is NULL.
Using NOT IN (subquery) when the subquery can contain NULL — returns zero rows because x NOT IN (..., NULL, ...) is NULL, which fails the predicate.
Comma-separated FROM a, b with no ON clause — produces a Cartesian product (CROSS JOIN); usually a bug.
Joining on the wrong column (o.id = c.id instead of o.customer_id = c.id) — produces nonsense rows.

PostgreSQL Interview Question on Customers With No Orders

Given customers(id, name) and orders(order_id, customer_id, amount), return the names of customers who have never placed an order.

Solution Using `LEFT JOIN ... WHERE orders.order_id IS NULL`

SELECT c.name
FROM customers c
LEFT JOIN orders o
  ON o.customer_id = c.id
WHERE o.order_id IS NULL
ORDER BY c.name;

Why this works: the LEFT JOIN preserves every customer row regardless of whether a matching order exists; for matched customers, o.order_id carries a real value; for unmatched customers, the right-side columns are NULL and the WHERE o.order_id IS NULL predicate is TRUE; the filter keeps only the unmatched customers — the anti-join. Single pass over customers; one keyed lookup into orders per customer; no subquery materialization needed.

Step-by-step trace for the sample input:

customers.id	customers.name	LEFT JOIN orders.order_id	IS NULL?	survives?
1	Alice	101	✗	✗
2	Bob	102	✗	✗
3	Carol	NULL	✓	✓
4	Dan	NULL	✓	✓

Carol and Dan survive the filter.

Output:

name
Carol
Dan

Why this works — concept by concept:

LEFT JOIN semantics — keeps every left row; right side is NULL when there is no match. This NULL is the entire signal we filter on.
WHERE o.order_id IS NULL — o.order_id is the right-side primary key; it is NULL only when the join produced a synthetic unmatched row. A real-NULL order-id from the source table never happens because primary keys are NOT NULL.
Anti-join semantics — equivalent to NOT EXISTS (SELECT 1 FROM orders WHERE customer_id = c.id); the LEFT JOIN ... IS NULL form is typically faster on planners that materialize a hash join.
No NULL-swallowing — unlike NOT IN, the predicate is IS NULL, which is well-defined for NULL values. There is no silent zero-row failure.
ORDER BY c.name — deterministic ordering for reviewer stability.
O(|customers| + |orders|) time — hash-join build on orders.customer_id, single probe per customer. With an index on orders.customer_id this is near-linear.

Inline CTA: Drill the SQL joins practice page for INNER, LEFT, and anti-join shapes, and the SQL null-handling practice page for NULL-aware predicates.

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

3. PostgreSQL `GROUP BY`, `HAVING`, and Conditional Aggregates

`GROUP BY` with `HAVING`, `FILTER`, and `CASE` for one-pass metrics in PostgreSQL

"Compute total revenue, refunded revenue, and the percentage refunded — in a single query" is the signature conditional-aggregate prompt — and the cleanest PostgreSQL answer is SUM(... ) FILTER (WHERE …) clauses inside a single SELECT. The mental model: GROUP BY col collapses rows into buckets; COUNT(*), SUM(...), AVG(...), MIN(...), MAX(...) summarize each bucket; WHERE filters individual rows before grouping; HAVING filters groups after grouping; FILTER (WHERE …) and CASE WHEN … express conditional aggregates that count or sum only certain rows per group. The duplicate-finder pattern GROUP BY key HAVING COUNT(*) > 1 lives here too.

Pro tip: PostgreSQL supports the SQL standard FILTER (WHERE …) clause on every aggregate — COUNT(*) FILTER (WHERE status = 'refunded'). It produces clearer queries than SUM(CASE WHEN … THEN 1 ELSE 0 END) and is exactly what interviewers like to see. The CASE variant still works for portability across dialects.

`COUNT`, `SUM`, `AVG`, `MIN`, `MAX` — `NULL`-aware aggregates

The aggregate-NULL invariant: COUNT(*) counts every row including ones with NULL columns; COUNT(col) counts only rows where col is not NULL; SUM, AVG, MIN, MAX skip NULL values entirely; if every value in a group is NULL, the result is NULL (not 0). The distinction between COUNT(*) and COUNT(col) is the #1 source of "my counts are off by 10%" bugs.

COUNT(*) — every row in the bucket, regardless of NULLs.
COUNT(col) — non-NULL values of col only.
COUNT(DISTINCT col) — unique non-NULL values; essential after a JOIN that may have inflated rows.
SUM / AVG — numeric only; AVG is sum-of-non-null-divided-by-count-of-non-null, so NULL does not count as 0.

Worked example. Three rows in one customer's bucket: amount = 10, NULL, 30.

aggregate	result
`COUNT(*)`	3
`COUNT(amount)`	2
`SUM(amount)`	40
`AVG(amount)`	20
`MIN(amount)`	10
`MAX(amount)`	30

Step-by-step explanation.

COUNT(*) = 3 because every row in the bucket counts, regardless of amount's value.
COUNT(amount) = 2 because the NULL row is skipped; only 10 and 30 contribute.
SUM(amount) = 10 + 30 = 40; the NULL is treated as missing, not as 0.
AVG(amount) = (10 + 30) / 2 = 20; the denominator is COUNT(amount) = 2, not COUNT(*) = 3.
MIN and MAX skip the NULL and return the smallest/largest non-NULL value.

Worked-example solution.

SELECT customer_id,
       COUNT(*)        AS n_rows,
       COUNT(amount)   AS n_known,
       SUM(amount)     AS total,
       AVG(amount)     AS mean,
       MIN(amount)     AS lo,
       MAX(amount)     AS hi
FROM orders
GROUP BY customer_id;

Rule of thumb: if the metric is "people who clicked" use COUNT(DISTINCT user_id); if it is "click events" use COUNT(*); if it is "rows with a known value" use COUNT(col).

`WHERE` vs `HAVING` — row filter vs group filter

The two-clause invariant: WHERE runs before GROUP BY and references raw row columns; HAVING runs after grouping and can reference aggregate functions; trying to use WHERE COUNT(*) > 1 is a parse error because aggregates do not exist until after grouping. Both can appear in the same query.

WHERE — filter rows; uses col, col2, etc.
HAVING — filter groups; uses COUNT(*), SUM(col), etc.
Order of evaluation — FROM → WHERE → GROUP BY → HAVING → SELECT.
Performance — push predicates into WHERE whenever possible; WHERE filters before the (often expensive) sort/hash step.

Worked example. Six employees across eng and sales; find departments whose average salary exceeds 50,000 across employees earning more than 30,000.

department	salary
eng	40,000
eng	70,000
eng	25,000
sales	60,000
sales	60,000
sales	20,000

Step-by-step explanation.

WHERE salary > 30000 drops the two rows below the threshold (eng 25,000 and sales 20,000) — 4 rows remain.
GROUP BY department collapses to two buckets: eng (40,000 + 70,000) and sales (60,000 + 60,000).
The planner computes AVG(salary) per bucket: eng = 55,000; sales = 60,000.
HAVING AVG(salary) > 50000 keeps both buckets (both averages exceed 50,000).
The SELECT projects the department name and its average; final result is two rows.

Worked-example solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE salary > 30000
GROUP BY department
HAVING AVG(salary) > 50000;

Rule of thumb: aggregate predicate → HAVING; row predicate → WHERE. If the predicate uses SUM / COUNT / AVG / MIN / MAX, it must live in HAVING.

`FILTER (WHERE …)` and `CASE` — conditional aggregates

The conditional-aggregate invariant: SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; the portable alternative is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). PostgreSQL supports both; pick FILTER for clarity in PostgreSQL-only code, CASE for cross-dialect portability.

FILTER (WHERE …) — PostgreSQL/SQL-standard syntax; applies per-aggregate.
SUM(CASE WHEN … THEN col ELSE 0 END) — portable across dialects.
COUNT(CASE WHEN … THEN 1 END) — counts only matching rows; NULLs in the ELSE branch are skipped.
Multiple aggregates, one query — combine many FILTER clauses to compute several metrics in one pass.

Worked example. One pass over orders to compute total revenue, refunded revenue, and the refund rate.

customer_id	total_revenue	refunded_revenue	refund_pct
101	500	50	10.0
102	1,000	0	0.0
103	800	200	25.0

Step-by-step explanation.

SUM(amount) aggregates every row in the bucket → total_revenue.
SUM(amount) FILTER (WHERE status = 'refunded') aggregates only refunded rows → refunded_revenue.
The refund percentage is refunded_revenue / total_revenue * 100; cast one side to NUMERIC to avoid integer division.
PostgreSQL evaluates every FILTER independently per row of input; one scan computes all metrics.
The portable variant uses SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — same result, slightly more verbose.

Worked-example solution.

SELECT customer_id,
       SUM(amount)                              AS total_revenue,
       SUM(amount) FILTER (WHERE status = 'refunded') AS refunded_revenue,
       ROUND(
         SUM(amount) FILTER (WHERE status = 'refunded')::NUMERIC
         / NULLIF(SUM(amount), 0) * 100, 1
       ) AS refund_pct
FROM orders
GROUP BY customer_id;

Rule of thumb: whenever you find yourself running two queries with different WHERE clauses against the same table and joining the results, refactor to a single query with two FILTER clauses — same answer, half the cost.

Common beginner mistakes

WHERE COUNT(*) > 1 — parse error; aggregates do not exist until after GROUP BY. Use HAVING.
AVG(col) and assuming NULL rows count as 0 — they are excluded from both numerator and denominator. Use AVG(COALESCE(col, 0)) only if "missing means 0" is the business rule.
COUNT(DISTINCT col) forgotten after a JOIN that inflates rows — reports inflated counts.
Integer division — 5 / 100 = 0 in PostgreSQL. Cast one operand to NUMERIC or FLOAT.
Division by zero — NULLIF(denom, 0) converts 0 to NULL, so the division returns NULL instead of erroring.

PostgreSQL Interview Question on Duplicate Emails

Given users(id, email), return every email that appears more than once, along with the number of copies.

Solution Using `GROUP BY email HAVING COUNT(*) > 1`

SELECT email,
       COUNT(*) AS n_copies
FROM users
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY n_copies DESC, email;

Why this works: GROUP BY email collapses every row with the same email into a single bucket; COUNT(*) counts how many rows fell into each bucket; HAVING COUNT(*) > 1 keeps only buckets with at least two rows; ORDER BY n_copies DESC, email produces a deterministic, reviewer-friendly output. Single pass over users; sort cost dominates only when email cardinality is huge.

Step-by-step trace for the sample input:

id	email
1	alice@example.com
2	bob@example.com
3	alice@example.com
4	carol@example.com
5	bob@example.com
6	alice@example.com

FROM users — read all six rows.
No WHERE — every row passes.
GROUP BY email — three buckets: alice (3 rows), bob (2 rows), carol (1 row).
COUNT(*) — 3, 2, 1 respectively.
HAVING COUNT(*) > 1 — drops the carol bucket.
ORDER BY n_copies DESC, email — alice (3), then bob (2).

Output:

email	n_copies
alice@example.com	3
bob@example.com	2

Why this works — concept by concept:

GROUP BY email — collapses to one bucket per distinct email; the bucket is the unit of all subsequent aggregates and group-level filters.
COUNT(*) — counts every row in the bucket, perfect for "how many copies".
HAVING COUNT(*) > 1 — group-level filter; the aggregate predicate must live here, not in WHERE. This is the precise interview signal for duplicate detection.
ORDER BY n_copies DESC, email — deterministic ordering; tie-broken by email so the output is stable across runs.
O(|users| + G log G) time — single hash aggregation produces G group rows; the final sort is G log G. With an index on email, the planner may use stream aggregation and skip the hash step.

Inline CTA: Drill the SQL aggregation practice page for GROUP BY and HAVING shapes, and the SQL filtering practice page for WHERE vs HAVING distinctions.

SQL
Topic — aggregation
SQL aggregation problems

Practice →

SQL
Topic — filtering
SQL filtering problems

Practice →

SQL
Topic — null handling
SQL null-handling problems

Practice →

4. PostgreSQL Window Functions — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`

Ranking, top-N-per-group, running totals, and lookback in PostgreSQL window functions

"Find the second-highest distinct salary" and "compute a running total of daily revenue" are the two signature window-function prompts — and both reduce to a window function with OVER (PARTITION BY … ORDER BY …). The mental model: a window function computes a value across a window of rows related to the current row without collapsing the rows like GROUP BY does; OVER (PARTITION BY col) defines the window boundary; OVER (ORDER BY col) defines the order within the window. ROW_NUMBER assigns unique integers; RANK skips after ties (1, 2, 2, 4); DENSE_RANK does not skip (1, 2, 2, 3); LAG looks back; LEAD looks forward; SUM/AVG/COUNT(...) OVER (...) compute running totals and moving averages.

Pro tip: Window functions cannot be referenced in WHERE of the same SELECT because they execute after WHERE. Wrap the window in a CTE or subquery, then filter on the alias. The error column "rn" does not exist after writing WHERE rn = 1 almost always means you forgot this rule.

`ROW_NUMBER` — unique sequential numbering per partition

The ROW_NUMBER invariant: ROW_NUMBER() OVER (PARTITION BY p ORDER BY o) assigns a unique integer 1, 2, 3, … to every row inside each partition p, ordered by o; ties in o are broken arbitrarily by the planner. Use it when you need a unique sequence per group regardless of tie semantics — most often for deduplication (keep rn = 1 per business key).

OVER (PARTITION BY …) — bucket the rows; without this, the window is the whole result set.
OVER (ORDER BY …) — order within the bucket; required for ROW_NUMBER/RANK/LAG/LEAD.
Ties broken arbitrarily — add a tiebreaker column to ORDER BY for determinism.
Top-N-per-group — WHERE rn <= N after ROW_NUMBER; works only when ties at rank N can be ignored.

Worked example. employees with three engineers; rank by salary descending.

department	name	salary	row_number
eng	Alice	90,000	1
eng	Bob	80,000	2
eng	Carol	80,000	3

Bob and Carol tie on salary; ROW_NUMBER still gives them unique ranks (planner-chosen unless you add a tiebreaker).

Step-by-step explanation.

PARTITION BY department defines the boundary — only eng rows are compared with each other; if there were a sales partition it would have its own 1, 2, 3 sequence.
ORDER BY salary DESC, name orders rows within the partition: Alice (90,000) first, then Bob and Carol (tied at 80,000) broken by name.
ROW_NUMBER() assigns 1, 2, 3 sequentially regardless of ties; Bob gets 2 and Carol gets 3 because name breaks the tie.
Without the , name tiebreaker, Bob/Carol order is undefined — two query runs could swap them.
To deduplicate a table that has multiple rows per (business_key, source_ts), use ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1 to keep the latest.

Worked-example solution.

SELECT department, name, salary,
       ROW_NUMBER() OVER (
         PARTITION BY department
         ORDER BY salary DESC, name
       ) AS rn
FROM employees;

Rule of thumb: ROW_NUMBER is the right tool for deduplication (WHERE rn = 1) and for ordered streams; reach for RANK or DENSE_RANK when ties must be honored.

`RANK` vs `DENSE_RANK` — tie semantics

The rank-vs-dense-rank invariant: both assign the same rank to tied rows; RANK then skips the next k-1 ranks (gap), while DENSE_RANK continues without a gap. For "find the Nth distinct value" questions, DENSE_RANK = N is the correct filter; for "find the Nth row in skip-aware ranking order", RANK = N is correct.

RANK — 1, 2, 2, 4 — skips after ties.
DENSE_RANK — 1, 2, 2, 3 — no skip.
ROW_NUMBER — 1, 2, 3, 4 — never ties.
Pick by semantics — "Nth highest distinct salary" → DENSE_RANK = N; "Nth-ranked row in skip ordering" → RANK = N.

Worked example. Four employees; Bob and Carol tied at second-highest salary.

name	salary	rank	dense_rank	row_number
Alice	90,000	1	1	1
Bob	80,000	2	2	2
Carol	80,000	2	2	3
Dan	70,000	4	3	4

RANK jumps 2 → 4 (skipping 3); DENSE_RANK continues 2 → 3 (no skip).

Step-by-step explanation.

All three window functions agree on Alice (rank 1) because she is alone at the top.
Bob and Carol both get rank = 2 and dense_rank = 2 because they tie on salary; row_number gives them distinct values 2 and 3.
Dan is the next-lowest salary; RANK skips ahead by the number of tied rows (2 tied → next rank is 2 + 2 = 4); DENSE_RANK continues with no gap (3).
For "second highest distinct salary", DENSE_RANK = 2 correctly returns 80,000; RANK = 2 would also work here, but RANK would not return 80,000 if three people tied for first (it would skip to 4).
For "top 3 distinct salaries", use DENSE_RANK <= 3 — it returns Alice, Bob, Carol, Dan (four rows because Bob/Carol both have dr = 2).

Worked-example solution.

SELECT name, salary,
       RANK()       OVER (ORDER BY salary DESC) AS rnk,
       DENSE_RANK() OVER (ORDER BY salary DESC) AS dr,
       ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees;

Rule of thumb: "second highest salary" → DENSE_RANK = 2; "top 3 distinct salaries" → DENSE_RANK <= 3; never use RANK for these unless the spec explicitly says ties should consume rank slots.

`LAG`, `LEAD`, and running totals — lookback, lookahead, and `SUM(...) OVER (...)`

The lookback-and-running invariant: LAG(col, n) returns the value of col n rows back within the partition (default n=1); LEAD(col, n) is the symmetric forward; SUM(col) OVER (PARTITION BY p ORDER BY o) produces a running total within each partition. These three primitives drive month-over-month deltas, sessionization, running balances, and moving averages.

LAG(amount) OVER (ORDER BY date) — previous day's amount.
LEAD(amount) OVER (ORDER BY date) — next day's amount.
amount - LAG(amount) OVER (ORDER BY date) — day-over-day delta.
SUM(amount) OVER (ORDER BY date) — running total from start of partition through current row.

Worked example. Three days of sales; compute previous-day amount, day-over-day delta, and running total.

sales_date	amount	prev_amount	dod_delta	running_total
2026-05-09	100	NULL	NULL	100
2026-05-10	130	100	30	230
2026-05-11	120	130	-10	350

The first day has LAG = NULL because no prior row exists; consumers usually COALESCE(delta, 0) for display.

Step-by-step explanation.

LAG(amount) OVER (ORDER BY sales_date) returns the previous row's amount, ordered by date.
Day 1 (May 9): no previous row, so LAG = NULL; amount - LAG = NULL.
Day 2 (May 10): LAG = 100; delta = 130 - 100 = 30.
Day 3 (May 11): LAG = 130; delta = 120 - 130 = -10.
SUM(amount) OVER (ORDER BY sales_date) accumulates from the start of the partition through the current row: 100, 230, 350.

Worked-example solution.

SELECT sales_date,
       amount,
       LAG(amount) OVER (ORDER BY sales_date)            AS prev_amount,
       amount - LAG(amount) OVER (ORDER BY sales_date)   AS dod_delta,
       SUM(amount) OVER (ORDER BY sales_date)            AS running_total
FROM sales
ORDER BY sales_date;

Rule of thumb: LAG for "compare this row to its predecessor" (delta, retention, gap); LEAD for "what happens next" (sessionization, churn-from-here); SUM(...) OVER (...) for running totals — always PARTITION BY the entity if the table holds multiple series.

Common beginner mistakes

Using RANK when the question wants the Nth distinct value — RANK = 2 skips entirely if two rows tie for first.
Forgetting PARTITION BY for a per-group ranking — produces a global ranking instead of per-department.
Referencing the window-function alias in WHERE of the same SELECT — window functions execute after WHERE; wrap in a CTE or subquery first.
Confusing LAG (previous) with LEAD (next) — quietly produces inverted deltas.
Forgetting ORDER BY inside OVER for ROW_NUMBER/RANK/LAG/LEAD — required; the result is non-deterministic without it.

PostgreSQL Interview Question on Top 3 Salaries Per Department

Given employees(emp_id, name, department, salary), return the top 3 distinct salaries per department, with ties at rank 3 included. Output department, name, salary, and the rank.

Solution Using `DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)` in a CTE

WITH ranked AS (
    SELECT department,
           name,
           salary,
           DENSE_RANK() OVER (
               PARTITION BY department
               ORDER BY salary DESC
           ) AS dr
    FROM employees
)
SELECT department, name, salary, dr
FROM ranked
WHERE dr <= 3
ORDER BY department, dr, name;

Why this works: the CTE ranked materializes a per-department DENSE_RANK keyed by salary descending — dr = 1 is the highest distinct salary in that department, dr = 2 is the second-highest, and so on; the outer WHERE dr <= 3 keeps every row whose salary is in the top three distinct salaries of its department, including all ties at rank 3; the ORDER BY produces a deterministic, reviewer-friendly output. DENSE_RANK over RANK because the spec wants the top three distinct salaries; DENSE_RANK over ROW_NUMBER because ties at rank 3 must be retained.

Step-by-step trace for the sample input:

emp_id	name	department	salary
1	Alice	eng	90,000
2	Bob	eng	80,000
3	Carol	eng	80,000
4	Dan	eng	70,000
5	Eve	eng	60,000
6	Frank	sales	100,000
7	Grace	sales	90,000
8	Heidi	sales	80,000

CTE ranked — partition by department; order by salary DESC.
DENSE_RANK per partition — eng: Alice → 1, Bob → 2, Carol → 2, Dan → 3, Eve → 4. sales: Frank → 1, Grace → 2, Heidi → 3.
Outer WHERE dr <= 3 — drops Eve (dr = 4); keeps both Bob and Carol (tied at 2) and Dan (3).
ORDER BY department, dr, name — eng rows first, then sales; within department by dr, then name for tiebreak.

Output:

department	name	salary	dr
eng	Alice	90000	1
eng	Bob	80000	2
eng	Carol	80000	2
eng	Dan	70000	3
sales	Frank	100000	1
sales	Grace	90000	2
sales	Heidi	80000	3

Why this works — concept by concept:

CTE ranked — names the intermediate ranked result; the outer query then filters it like a regular table. Far cleaner than a nested subquery.
PARTITION BY department — restarts the rank at each department boundary; without this, the rank is global and the answer is wrong.
ORDER BY salary DESC — defines "highest first" inside each partition; required for any deterministic ranking.
DENSE_RANK over RANK — the spec wants the top three distinct salaries; RANK would skip after ties and miss the third distinct salary if there is a two-way tie above it.
WHERE dr <= 3 in the outer — window functions cannot be referenced in WHERE of the same SELECT; the CTE provides the materialized column the outer can filter on.
O(N log N) time — sort within each partition dominates; with an index on (department, salary DESC) the planner can stream rather than sort.

Inline CTA: More SQL window-function practice problems and SQL CTE practice problems on PipeCode.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — CTE
SQL CTE problems

Practice →

SQL
Topic — date functions
SQL date-function problems

Practice →

Tips to use this PostgreSQL cheat sheet effectively

Hold the clause-order diagram in your head

FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT. Memorize this sentence and 80% of "weird" PostgreSQL parse errors decode themselves in five seconds. The error column "x" does not exist almost always means you referenced a SELECT alias in WHERE; the error aggregate functions are not allowed in WHERE means you wanted HAVING instead.

State the grain before any `JOIN`

Before writing the JOIN, name the grain you're producing: "this is order-line grain", "this is customer-day grain", "this is (customer, product) grain". The single most common bug in analytical SQL is SUM(left.col) after a 1:N join — the metric is silently multiplied by N. If grain doubles, you'll spot it immediately.

Use `LEFT JOIN ... IS NULL` over `NOT IN` for anti-joins

NOT IN (subquery) returns zero rows when the subquery contains a single NULL because x NOT IN (..., NULL, ...) is NULL, which fails the WHERE predicate. LEFT JOIN ... WHERE right.id IS NULL and NOT EXISTS (...) are immune. Production engineers who have been burned once never write NOT IN again.

Pick `DENSE_RANK` for "Nth distinct"; pick `ROW_NUMBER` for deduplication

The single most-graded ranking distinction: DENSE_RANK = N is the Nth distinct value; RANK = N is the Nth row in skip-aware ranking order; ROW_NUMBER = N is the Nth row in arbitrary order. For "second-highest distinct salary" → DENSE_RANK = 2. For "remove duplicate rows keeping the canonical one" → ROW_NUMBER() OVER (PARTITION BY key ORDER BY tiebreaker) = 1.

Use `FILTER (WHERE …)` for one-pass conditional metrics

SUM(amount) FILTER (WHERE status = 'refunded') is cleaner than SUM(CASE WHEN status = 'refunded' THEN amount ELSE 0 END) — PostgreSQL supports both. Use FILTER in PostgreSQL-only code, CASE for cross-dialect portability. One scan, many metrics, half the cost of two queries.

Always `ORDER BY` + tiebreaker; pair `LIMIT` with `ORDER BY`

Window functions, LIMIT N, and "top result" queries all require an ORDER BY with a deterministic tiebreaker (e.g., ORDER BY salary DESC, name). Without one, two runs of the same query can return different rows in the tie band — silently wrong in production and visibly wrong in an interview if the reviewer's reference answer locks an ordering.

Use PostgreSQL-specific helpers — `EXTRACT`, `DATE_TRUNC`, `INTERVAL`, `::DATE` cast

EXTRACT(MONTH FROM ts), DATE_TRUNC('month', ts), ts - INTERVAL '1 month', ts::DATE. These four cover 95% of date arithmetic. Reach for DATE_TRUNC whenever the spec says "by month" or "by week" — it groups timestamps to the bucket boundary deterministically.

Where to practice on PipeCode

Start with the SQL practice surface for the all-language SQL corpus. Drill the four-primitive pages: SQL filtering for WHERE patterns, SQL joins for join shapes, SQL aggregation for GROUP BY + HAVING, SQL window functions for ranking and lookback. Add adjacent topics: SQL CTE, SQL subqueries, SQL null-handling, SQL date functions. The interview courses page bundles structured curricula — start with SQL for Data Engineering Interviews — From Zero to FAANG. For broader coverage, browse by topic or read the related SQL interview questions for data engineering and data lake architecture for data engineering interviews blogs.

Frequently Asked Questions

What is the logical clause order in a PostgreSQL query?

PostgreSQL evaluates clauses in the order FROM / JOIN → WHERE → GROUP BY → HAVING → SELECT → ORDER BY → LIMIT / OFFSET, regardless of the order you write them. This is why WHERE cannot reference aggregate functions (they don't exist until after GROUP BY) and why SELECT-level aliases cannot be referenced in WHERE (they're computed in stage 5). Aliases become available only in ORDER BY and the outer query in a nested context.

What is the difference between `WHERE` and `HAVING` in PostgreSQL?

WHERE filters individual rows before the GROUP BY step and can reference only raw row columns. HAVING filters whole groups after the GROUP BY step and can reference aggregate functions like COUNT(*), SUM(col), AVG(col). Trying to use an aggregate in WHERE (e.g., WHERE COUNT(*) > 1) is a parse error because the aggregate does not yet exist. Both clauses can appear in the same query.

How do I find rows in table A that have no match in table B?

The canonical PostgreSQL pattern is SELECT a.* FROM a LEFT JOIN b ON b.fk = a.pk WHERE b.pk IS NULL — the LEFT JOIN preserves every left row, and the WHERE b.pk IS NULL filter keeps only the ones where no right-side match was found. This is the anti-join pattern. Equivalent to WHERE NOT EXISTS (SELECT 1 FROM b WHERE b.fk = a.pk). Both are safer than NOT IN (subquery), which returns zero rows if the subquery contains a single NULL.

What is the difference between `RANK`, `DENSE_RANK`, and `ROW_NUMBER`?

All three assign integers within a window. ROW_NUMBER gives every row a unique sequential integer (1, 2, 3, 4), even on ties. RANK gives tied rows the same rank but skips after them (1, 2, 2, 4). DENSE_RANK gives tied rows the same rank with no skip (1, 2, 2, 3). For "Nth distinct value" use DENSE_RANK = N; for "Nth row in skip-aware ranking order" use RANK = N; for "Nth row in arbitrary order" or "deduplicate keeping one canonical row" use ROW_NUMBER = 1.

What does `FILTER (WHERE …)` do in PostgreSQL aggregates?

SUM(col) FILTER (WHERE pred) and COUNT(*) FILTER (WHERE pred) apply the aggregate only to rows where the predicate is TRUE; rows where the predicate is FALSE or NULL are skipped for that aggregate, while other aggregates in the same SELECT still see them. The portable cross-dialect equivalent is SUM(CASE WHEN pred THEN col ELSE 0 END) and COUNT(CASE WHEN pred THEN 1 END). Use FILTER for clarity in PostgreSQL-only code.

How do I compute a running total in PostgreSQL?

Use SUM(col) OVER (PARTITION BY p ORDER BY o) — the window aggregate accumulates from the start of each partition through the current row in the order defined by ORDER BY. Example: SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) gives a per-customer running total of order amounts ordered by date. Drop PARTITION BY for a single global running total.

Why is `LIMIT 5` returning different rows on different runs?

LIMIT without ORDER BY is non-deterministic — PostgreSQL returns whatever rows it sees first, which depends on the query plan, parallelism, and table physical layout. Always pair LIMIT with ORDER BY <col> DESC, <tiebreaker> so two runs return the same rows. Reviewers depend on stable ordering, and dashboards break silently when row order drifts.

Start practicing PostgreSQL SQL problems

Data Lake Architecture for Data Engineering Interviews

Gowtham Potureddi — Mon, 11 May 2026 03:20:43 +0000

Data lake architecture questions in data-engineering interviews almost always reduce to four primitives: medallion zones (bronze → silver → gold) for progressive refinement, an ingestion → metadata catalog → compute flow on object storage, the lake vs cloud warehouse vs lakehouse decision driven by open table formats (Iceberg, Delta, Hudi), and a disciplined answer shape that covers grain, idempotency, lineage, and aggregate reconciliation. Whether the prompt is "design our analytics lake from scratch", "how would you land CDC from Postgres into the lake", "when would you pick a lakehouse over a warehouse", or "why do counts drift between the lake and the source app", interviewers grade the same handful of mental models — and candidates who skip straight to vendor names without naming the primitives lose the round.

This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, common beginner mistakes, and an interview-style scenario with a full answer that traces the design step by step. Every section ends with a concept-by-concept breakdown that explains why the design works, what it costs, and where beginners typically slip. Storage examples assume an S3-style object store on the cloud, but every primitive transfers to GCS, Azure Blob / ADLS, or any other modern object backend.

1. Bronze / Silver / Gold Medallion Zones for Data Lake Architecture

Progressive refinement through landing/bronze, refined/silver, and curated/gold zones

"Walk me through how you would lay out an analytics lake from scratch" is the signature opening prompt — and the cleanest answer is medallion architecture with three numbered zones. The mental model: landing/bronze is an append-only mirror of the source payloads with minimal transformation; refined/silver applies dedup, type coercion, and conformed business keys; curated/gold publishes subject-area tables and star-schema facts/dims that downstream applications and BI tools consume. Each zone has a different SLA, different read/write permissions, and different retention. The names vary across vendors — Databricks coined "bronze/silver/gold", AWS uses "raw/curated/consumption", Microsoft uses "landing/refined/analytics" — but the three-tier shape is universal.

Pro tip: When you whiteboard the medallion zones, label each box with who writes, who reads, and what breaks if the job reruns. Idempotent writes and clear grain matter as much in a lake as they do in a warehouse — interviewers grade the candidate who naturally adds these annotations without prompting.

Landing / bronze — append-only mirror of source payloads

The landing-zone invariant: bronze is an append-only, immutable copy of the source payload with minimal transformation; the schema is captured but not enforced; partitioning is by **ingest_date (or ingest_hour for high-frequency sources); replays are safe because writes never overwrite**. The zone optimizes for fidelity and replay, not query performance.

Append-only writes — every batch produces a new file under a date-partitioned prefix; MERGE and UPDATE are forbidden.
Source-payload fidelity — store the raw shape (JSON, Avro, CSV, Parquet snapshot) plus an ingest_id and source_ts per row.
Partition by ingest_date — makes back-fill, replay, and audit trivially scoped.
Retention — keep at least 30-90 days; audits and reconciliations need historical bronze.

Worked example. A Postgres CDC pipeline lands daily JSON snapshots into s3://analytics-lake/bronze/orders/.

prefix	files	purpose
`bronze/orders/ingest_date=2026-04-11/`	`part-00000.json`	Apr 11 snapshot
`bronze/orders/ingest_date=2026-04-12/`	`part-00000.json`	Apr 12 snapshot
`bronze/orders/ingest_date=2026-04-13/`	`part-00000.json`	Apr 13 snapshot

Step-by-step explanation.

The source app emits one JSON snapshot per day at 02:00 UTC.
The ingestion job lands each snapshot under a calendar-keyed prefix bronze/orders/ingest_date=YYYY-MM-DD/ so partition pruning works for any date filter downstream.
Each batch is also stamped with a unique ingest_id (timestamp + UUID) sub-prefix so retries write fresh files instead of overwriting a previous attempt.
Files inside a partition are append-only part-NNNNN.json; bronze never edits a written file — corrected payloads land as new files under a new ingest_id.
After three days you have three day-partitions; each is independently re-readable with WHERE ingest_date = 'YYYY-MM-DD' and any single day can be replayed without touching the others.

Worked-example solution. A landing-zone object layout:

s3://analytics-lake/bronze/orders/
  ingest_date=2026-04-13/
    ingest_id=20260413T0200Z/
      part-00000.json
      part-00001.json

Rule of thumb: never edit a bronze file. If a payload is wrong, drop a corrected file under a new ingest_id and let the silver-layer dedup logic resolve it; never overwrite history.

Refined / silver — deduped, typed, conformed business keys

The refined-zone invariant: silver applies dedup against natural or business keys, coerces types to a canonical schema, conforms key columns across sources, and may emit slowly-changing-dimension (SCD) history; the zone is the single source of truth for downstream application code and most analyst SQL. Idempotency at the silver layer is non-negotiable — re-running a daily job must produce byte-identical output.

Dedup on (business_key, source_ts) — ROW_NUMBER() OVER (PARTITION BY business_key ORDER BY source_ts DESC) = 1.
Type coercion — JSON strings → typed columns; epoch ms → TIMESTAMP; cents → DECIMAL(18,2).
Conformed dimensions — customer_id, product_id, geo_id mapped to one canonical form across every source.
SCD type 2 — emit (valid_from, valid_to, is_current) columns when downstream consumers need point-in-time joins.

Worked example. Bronze orders rows arrive twice for order_id=448 due to a CDC retry; silver dedup keeps the latest.

order_id	source_ts	bronze_rn
448	2026-04-12 09:30:00	2
448	2026-04-12 09:30:15	1 (kept)
449	2026-04-12 10:00:00	1 (kept)

Step-by-step explanation.

Bronze contains three rows for ingest_date = 2026-04-12: two for order_id = 448 (a CDC retry produced two payloads at 09:30:00 and 09:30:15) and one for order_id = 449.
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) numbers rows independently inside each order_id group, with the latest source_ts getting rn = 1.
For order_id = 448: the row at 09:30:15 is later, so it gets rn = 1; the 09:30:00 row gets rn = 2.
For order_id = 449: only one row, so it gets rn = 1 automatically.
The outer WHERE rn = 1 keeps two rows — the latest order_id = 448 and the only order_id = 449 — and silently drops the duplicate, producing a deterministic single-row-per-business-key silver table.

Worked-example solution.

WITH ranked AS (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY order_id
               ORDER BY source_ts DESC
           ) AS rn
    FROM bronze.orders
    WHERE ingest_date = DATE '2026-04-12'
)
SELECT order_id,
       customer_id,
       amount::DECIMAL(18,2)            AS amount,
       order_status,
       source_ts                         AS as_of_ts,
       CURRENT_TIMESTAMP                 AS silver_loaded_at
FROM ranked
WHERE rn = 1;

Rule of thumb: the silver zone is where ETL bugs hide — invest in unit-tested dedup logic, schema-evolution tests, and aggregate reconciliation against bronze totals before promoting to gold.

Curated / gold — subject-area tables and star schemas

The curated-zone invariant: gold publishes tables shaped for downstream consumption: dimensional models (fact tables + conformed dimensions), subject-area marts, or one-big-table (OBT) flattenings; SLAs are stricter, freshness is tracked, and consumer contracts are explicit. Each gold table maps to exactly one consumer class — analysts, dashboards, ML feature pipelines, or reverse-ETL into operational systems.

Star schema — fact_orders joined to dim_customer, dim_product, dim_date; one row per business event.
Subject-area marts — domain-scoped denormalized tables (e.g., mart_marketing_attribution).
OBT flattening — when consumers prefer one wide table over a join (Looker, Power BI dashboards).
Consumer contracts — column types, refresh cadence, breakage policy declared in dbt-style metadata.

Worked example. A gold star schema for the orders subject area.

table	grain	example columns
`fact_orders`	one row per order line	`order_id`, `line_id`, `customer_key`, `product_key`, `date_key`, `qty`, `revenue`
`dim_customer`	one row per customer (SCD2)	`customer_key`, `customer_id`, `name`, `valid_from`, `valid_to`
`dim_product`	one row per product	`product_key`, `product_id`, `category`
`dim_date`	one row per calendar date	`date_key`, `date`, `iso_week`, `is_weekend`

Step-by-step explanation.

fact_orders is the central transactional table at order-line grain — one row per line item, with numeric measures (qty, revenue) and foreign-key columns to every dimension.
dim_customer is an SCD2 dimension: a single real-world customer can appear in multiple rows over time, each with valid_from / valid_to / is_current columns to capture historical attribute changes.
dim_product is a simpler Type-1 dimension: one row per product, current state only — overwrites on update with no history.
dim_date is the conformed date dimension: one row per calendar date with pre-computed week, month, quarter, year, and is_weekend columns so dashboards never have to compute date math at query time.
Joins from fact_orders to each dimension use the surrogate keys (customer_key, product_key, date_key) — never the natural business IDs — so SCD2 history is preserved when the same customer's row evolves over time.

Worked-example solution.

CREATE TABLE gold.fact_orders AS
SELECT
    s.order_id,
    s.line_id,
    dc.customer_key,
    dp.product_key,
    dd.date_key,
    s.qty,
    s.qty * s.unit_price AS revenue
FROM silver.order_lines s
JOIN gold.dim_customer dc
  ON dc.customer_id = s.customer_id
 AND s.order_ts BETWEEN dc.valid_from AND dc.valid_to
JOIN gold.dim_product  dp ON dp.product_id = s.product_id
JOIN gold.dim_date     dd ON dd.date      = s.order_ts::DATE;

Rule of thumb: gold tables are the only zone customers should reference by name; if a dashboard reads silver directly, your contract is leaking. Use views or feature-flagged exposures rather than letting consumers couple to interim grains.

Common beginner mistakes

Treating bronze as a junk drawer with no ingest_date partitioning — replay and audit become impossible.
Doing dedup at gold instead of silver — every downstream job has to repeat the work and answers diverge.
Letting consumers query silver directly — silver schemas can change without notice; gold contracts are explicit.
Skipping ingest_id and source_ts lineage columns — when counts drift, you have no way to reconstruct what landed when.
Mixing batch and streaming writes into the same bronze prefix without a partition key for write-mode — late arrivals overwrite eager batches.

Data Lake Interview Question on Designing Layered Zones

A team dumps daily JSON exports of orders into a single S3 prefix. Analysts complain that order counts drift versus the source application by 0.5–2% on most days. Design a three-zone medallion layout that fixes the drift, makes the discrepancy investigable, and supports daily reruns without producing duplicates.

Solution Using Bronze (append-only) + Silver (dedup) + Gold (star schema)

1. Move existing daily dumps into:
     s3://analytics-lake/bronze/orders/ingest_date=YYYY-MM-DD/ingest_id=<batch>/
   Append-only; never overwrite a date partition.

2. Build silver/orders as a daily MERGE that:
     - Dedups bronze rows by ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
     - Coerces JSON fields to a typed schema
     - Joins against dim_customer / dim_product on conformed keys
     - Carries ingest_id + source_ts as lineage columns

3. Promote to gold/fact_orders only after a silver-vs-source aggregate-reconciliation job
   passes a tolerance threshold (e.g., |silver_count - source_count| / source_count < 0.001).

4. Surface a row-count + revenue-sum dashboard sourced from BOTH bronze and the source app's
   replica, so any future drift surfaces within one ingest cycle.

Why this works: the append-only bronze layer makes the discrepancy investigable — every historical payload is preserved with ingest_id and ingest_date, so analysts can replay any day's source state; the silver dedup converts CDC retries and late-arriving rows into a deterministic single row per order_id; the gold layer is gated by an aggregate-reconciliation step that catches drift before it reaches dashboards; and the dual-source row-count dashboard surfaces residual drift immediately. The combination addresses both the prevention (idempotent dedup) and detection (reconciliation + dashboard) sides of the failure mode.

Step-by-step trace for the drift scenario on 2026-04-12:

step	action	observation
1	bronze ingests `ingest_date=2026-04-12`	12,847 raw rows including 12 CDC retries
2	silver dedup keeps `rn = 1`	12,835 unique `order_id`s
3	source-app replica reports	12,835 orders for 2026-04-12
4	reconciliation passes	drift = 0 / 12,835 = 0.0%
5	promote to gold/fact_orders	12,835 fact rows; counts match dashboard

Output: the fixed-state contract per ingest day:

metric	bronze	silver	source	gold	drift
row count	12,847	12,835	12,835	12,835	0.0%
total revenue	$4,128,931	$4,128,931	$4,128,931	$4,128,931	0.0%

Why this works — concept by concept:

Append-only bronze with ingest_date partitioning — every payload is preserved and addressable; replay is a WHERE ingest_date = ... filter rather than a re-ingest.
Silver dedup via ROW_NUMBER over (order_id ORDER BY source_ts DESC) — collapses CDC retries to a deterministic single row per business key; idempotent on rerun.
Lineage columns ingest_id + source_ts — every silver row points back to a specific bronze file and source moment; forensic debugging is one join away.
Aggregate reconciliation gate before gold — drift cannot reach dashboards because gold is gated on |silver - source| / source < threshold; failures page the on-call rather than silently corrupt the BI tool.
Dual-source dashboard — surface drift instantly even when reconciliation isn't perfect; the early-warning loop pays for itself the first incident.
O(|bronze|) time per day — single linear scan + window for dedup; reconciliation adds one aggregate per zone, negligible compared to ingest cost.

Inline CTA: Drill the ETL practice page for medallion-zone problems and the dimensional modeling practice page for star-schema patterns at the gold layer.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

2. Ingestion → Catalog → Compute Flow on Object Storage

Sources to query engines through metadata catalogs in data lake architecture

"How does data physically get from a Postgres source into a query engine like Spark or Trino on the lake?" is the signature design follow-up — and the cleanest answer is the ingest → register → query flow with three distinct components. The mental model: sources (databases, APIs, streaming platforms, file feeds) ingest into object storage as files; a metadata catalog (Hive Metastore, AWS Glue, Unity Catalog, Polaris, Iceberg REST catalog) maps logical tables to physical file paths and column schemas; compute engines (Spark, Trino, Presto, DuckDB-in-the-cloud, Snowflake external tables) read the catalog to discover tables and read the object store to fetch data. The decoupling is the entire value proposition — many engines can read the same footprint, and storage scales independently from compute.

Pro tip: When the interviewer asks "where does Spark get the schema from?", the answer is the catalog, not the file. Files (Parquet, ORC, Avro) carry their own schema in the footer, but the catalog is what makes a logical table addressable across sessions and engines. State this distinction explicitly — it separates candidates who learned data lake architecture by reading docs from those who learned by debugging production.

Object storage as the storage layer — S3, GCS, ADLS

The object-store invariant: modern lakes use cloud object storage (Amazon S3, Google Cloud Storage, Azure Data Lake Storage / ADLS Gen2) rather than HDFS; storage is **infinitely scalable, durable, and decoupled from compute, with eventual-consistency semantics that the table format is responsible for masking**. Files are typically Parquet (columnar, compressed) or ORC; Avro shows up in streaming pipelines.

Hive-style partitioning — s3://bucket/table/col=value/file.parquet for partition pruning at query time.
File sizes — target 128MB-1GB per file; smaller files trigger the small-file problem (excessive metadata, slow planning).
Compaction — periodic batch jobs that rewrite many small files into fewer large ones.
Eventual consistency — S3 was eventually consistent for many years; the table format handles the retry / commit semantics that mask this from queries.

Worked example. A Hive-style partition layout for a daily-loaded orders table in silver.

prefix	role
`s3://analytics-lake/silver/orders/`	table root
`…/ingest_date=2026-04-13/`	partition value
`…/ingest_date=2026-04-13/part-00000.parquet`	data file
`…/_delta_log/` or `…/metadata/`	table-format metadata (if Delta/Iceberg)

Step-by-step explanation.

The table root s3://analytics-lake/silver/orders/ is the registered location in the catalog; everything under it belongs to one logical table.
Each child prefix ingest_date=YYYY-MM-DD/ is one Hive partition value; the key=value syntax is the convention every engine (Spark, Trino, Athena, Snowflake) recognizes.
Inside each partition, multiple Parquet files (~180MB each) split the data so a Spark reader can fetch them in parallel; the file count is bounded by your micro-batch size.
The _delta_log/ (Delta) or metadata/ (Iceberg) prefix holds the table-format commit log — a sequence of JSON files describing every transaction, which is what gives you ACID and time travel on top of plain object storage.
A query with WHERE ingest_date = '2026-04-13' triggers partition pruning: the planner reads only files under that one prefix, skipping every other day's files entirely — the difference between 200ms and 60s.

Worked-example solution. Object layout for a partitioned silver table:

s3://analytics-lake/silver/orders/
  ingest_date=2026-04-13/
    part-00000.parquet  (180MB, 1.2M rows)
    part-00001.parquet  (165MB, 1.1M rows)
  ingest_date=2026-04-12/
    part-00000.parquet  (175MB)
  _delta_log/                              # Delta Lake commit log
    00000000000000000001.json
    00000000000000000002.json

Rule of thumb: if your average file size is below 50MB, schedule a daily compaction job; if it's above 1GB, your partitions are too coarse. Both extremes hurt query latency.

Metadata catalog — Hive Metastore, AWS Glue, Unity Catalog

The catalog invariant: a metadata catalog maps logical names (silver.orders) to physical locations (s3://analytics-lake/silver/orders), column schemas, partition definitions, and table properties; it is the single source of truth for "what tables exist" across every compute engine that reads the lake. The catalog can be a long-running service (Hive Metastore, AWS Glue Data Catalog, Databricks Unity Catalog) or a REST API on top of files (Iceberg REST catalog, Polaris).

Logical → physical mapping — silver.orders → s3://...; column names, types, partition keys.
Engine-agnostic — Spark, Trino, Presto, Snowflake external tables, Athena, DuckDB all read the same catalog.
Schema evolution — add column, widen type, rename (with caveats); the catalog records the evolution history.
Permissions — many catalogs (Unity, Glue with Lake Formation) carry table/column-level access policies.

Worked example. Registering a partitioned silver.orders table in Glue.

field	value
logical name	`silver.orders`
location	`s3://analytics-lake/silver/orders/`
input format	`parquet`
partition keys	`ingest_date STRING`
schema	`order_id BIGINT, customer_id BIGINT, amount DECIMAL(18,2), source_ts TIMESTAMP`

Step-by-step explanation.

CREATE EXTERNAL TABLE silver.orders declares a logical name in the catalog without copying or moving any data files.
The column list (order_id BIGINT, …) declares the schema the engine should expect; Parquet files store their own schema in the footer, but the catalog is the canonical answer the planner trusts.
PARTITIONED BY (ingest_date STRING) declares the partition column; this column is derived from the prefix path, not stored in the data files, which keeps each partition lean.
LOCATION 's3://analytics-lake/silver/orders/' is the prefix the engine scans when reading; data files must already exist at this location.
MSCK REPAIR TABLE silver.orders walks the S3 prefix, discovers any partition values it doesn't yet know about, and registers them; without this command after a backfill, the planner returns zero rows for the new dates.

Worked-example solution.

CREATE EXTERNAL TABLE silver.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    amount       DECIMAL(18,2),
    source_ts    TIMESTAMP,
    ingest_id    STRING
)
PARTITIONED BY (ingest_date STRING)
STORED AS PARQUET
LOCATION 's3://analytics-lake/silver/orders/';

MSCK REPAIR TABLE silver.orders;

Rule of thumb: always run MSCK REPAIR TABLE (or the engine equivalent) after a backfill that adds new partition prefixes; otherwise the catalog won't know about them and the partition predicate will return zero rows.

Compute engines — Spark, Trino, Presto, DuckDB

The compute invariant: compute engines read the catalog to discover tables, plan queries with partition pruning and predicate pushdown, then read the relevant Parquet/ORC files from object storage; storage and compute scale independently and the same data can be queried by multiple engines simultaneously. Spark dominates for batch + streaming pipelines; Trino/Presto dominate for interactive SQL; DuckDB is rising for single-node analytics.

Spark — JVM, batch + streaming, rich ecosystem (Iceberg/Delta connectors, Spark SQL, MLlib, Structured Streaming).
Trino / Presto — interactive SQL across many catalogs; great for federated queries across lake + warehouse.
DuckDB — single-node, embeddable, blazing fast for sub-TB analytics; popular for ad-hoc + notebooks.
Snowflake / BigQuery / Redshift external tables — read lake data from inside a managed warehouse.

Worked example. A Spark SQL query against silver.orders with partition pruning.

layer	action	data scanned
catalog	resolve `silver.orders` → `s3://...`	metadata only
planner	prune partitions for `ingest_date = '2026-04-13'`	one partition
Spark workers	read Parquet column-block for `amount`	~50MB
executor	aggregate `SUM(amount)`	local

Step-by-step explanation.

Spark resolves silver.orders against the catalog — pure metadata fetch, zero data scanned, returns the location plus the partition schema.
The planner sees WHERE ingest_date = '2026-04-13' and prunes the partition list to a single value, so workers only need to list files under one S3 prefix instead of all of them.
Workers issue an S3 LIST for that single partition, fetching a list of ~one to ten Parquet file paths.
Each Parquet reader uses footer metadata to skip every column except amount, then streams just that column block — typically 50MB instead of the full 1GB Parquet file.
Each task computes a partial SUM(amount) locally; a final shuffle sums the partial values to one number — the entire query is O(rows in one partition) and runs in sub-second time.

Worked-example solution.

SELECT SUM(amount) AS daily_revenue
FROM silver.orders
WHERE ingest_date = '2026-04-13';

Rule of thumb: always include the partition key in your WHERE clause to enable partition pruning; without it, the planner reads every partition (terabytes), and your query goes from 500ms to 50 seconds.

Common beginner mistakes

Skipping the catalog and reading raw S3 paths in every job — schemas drift, no central source of truth, no permissions.
Ignoring file-size budgets — millions of 5KB files (the small-file problem) make Spark planning slower than the actual scan.
Not declaring partition keys — full-table scans on every query, costs balloon by 100x.
Mixing file formats inside one logical table (some Parquet, some JSON) — the planner can't push predicates and queries error out.
Forgetting to refresh the catalog after a backfill — MSCK REPAIR TABLE or REFRESH TABLE is the single most-forgotten command.

Data Lake Interview Question on CDC Ingestion from Postgres

Design a near-real-time ingestion pipeline that lands changes from a 10TB Postgres database into the lake, registers them in a catalog, and exposes them to Spark and Trino with sub-five-minute freshness.

Solution Using Debezium → Kafka → Iceberg with Hive Metastore

Postgres (with logical replication enabled)
      │
      ▼
Debezium connector (CDC reader, emits change events)
      │
      ▼
Kafka topic per table (key = primary key; value = before/after JSON or Avro)
      │
      ▼
Spark Structured Streaming job (1-minute trigger):
      - Reads Kafka topic
      - Writes to bronze.orders_cdc as append-only Iceberg files (partitioned by event_date)
      │
      ▼
Hive Metastore / Glue catalog:
      - bronze.orders_cdc registered with Iceberg metadata
      - silver.orders_current registered as a Spark MERGE-on-read view
      │
      ▼
Compute consumers:
      - Trino: SELECT * FROM silver.orders_current WHERE event_date = today
      - Spark batch: nightly compaction + table-maintenance

Why this works: Debezium reads the Postgres write-ahead log (WAL) directly via logical replication, so it captures every insert/update/delete with no impact on the source; Kafka decouples the producer from the consumer and absorbs traffic spikes; the Spark Structured Streaming job runs with a one-minute trigger, so the lake is at most one minute behind; Iceberg's ACID transactions make concurrent micro-batch writes safe; the Hive Metastore registers the table once, and both Trino and Spark see the same schema; partitioning by event_date enables prune-friendly time-window queries; nightly compaction keeps file sizes in the 128MB-1GB sweet spot.

Step-by-step trace for an order update at 09:30:00.000:

time	component	action
09:30:00.000	Postgres	UPDATE orders SET status='shipped' WHERE order_id=448
09:30:00.150	Debezium	reads WAL, emits change event to Kafka
09:30:00.300	Kafka	persists change event to topic `orders.cdc`
09:30:30.000	Spark Streaming	next 1-min trigger; reads change events
09:30:35.000	Spark Streaming	writes Parquet to `bronze.orders_cdc/event_date=2026-04-13/`
09:30:35.500	Iceberg	commits new snapshot; catalog updated
09:30:40.000	Trino	`SELECT … FROM silver.orders_current WHERE order_id=448` returns updated row

End-to-end latency: ~40 seconds. Well within the five-minute SLA.

Output: the consumer-visible contract per minute:

metric	target	actual
freshness (P50)	< 5 min	~40 sec
freshness (P99)	< 5 min	~2 min
dropped events	0	0
schema-drift incidents	< 1/quarter	0 last quarter

Why this works — concept by concept:

Postgres logical replication + Debezium — captures every row change at the WAL layer; no impact on source query performance; no missed events.
Kafka as the decoupler — handles backpressure, replays, and multiple downstream consumers; lake outages don't lose source events.
Spark Structured Streaming with 1-minute trigger — micro-batch sweet spot; latency vs throughput trade-off favors throughput here.
Iceberg table format — ACID commits make concurrent micro-batch writes safe; time travel makes "what did the table look like at 09:30?" a one-line query.
Hive Metastore as the unified catalog — Spark and Trino see the same schema; no per-engine duplication.
event_date partitioning + nightly compaction — bounds query scan size and keeps file count manageable; both maintenance jobs are idempotent.
End-to-end latency ~40s — well inside the 5-min SLA; the 4.5-min headroom absorbs Kafka rebalances and Spark micro-batch jitter without alerting.

Inline CTA: Drill the streaming practice page for Kafka + micro-batch problems and the Python practice page for PySpark Structured Streaming patterns. Course: PySpark Fundamentals.

ETL
Topic — streaming
Streaming practice problems

Practice →

PYTHON
Language — Python
Python practice for data pipelines

Practice →

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

3. Lake vs Cloud Warehouse vs Lakehouse — Iceberg, Delta, Hudi

Pattern selection and open table formats in data lake architecture

"When would you pick a lakehouse over a warehouse?" and "what is the difference between Iceberg, Delta Lake, and Hudi?" are the two signature pattern-selection prompts — and they share one mental model: a data lake is files + a catalog; a cloud warehouse is a managed ACID SQL system with proprietary storage; a lakehouse is a lake plus an open table format that adds ACID, time travel, partition evolution, and concurrent writers — bringing warehouse-like semantics to object storage. Iceberg, Delta Lake, and Hudi are the three dominant open table formats, each with slightly different trade-offs.

Pro tip: Most large organizations run a blend: lake for flexible high-volume ingestion and ML feature stores, warehouse or lakehouse SQL for curated analytics. Don't propose a single-pattern solution to a system-design question — describe the boundary between the two and the contracts that flow across it. That's the senior signal.

Data lake — files on object storage with a catalog

The lake invariant: a data lake is object storage (S3/GCS/ADLS) plus a metadata catalog plus open file formats (Parquet, ORC, Avro) plus convention-based partitioning; reads are cheap and parallel, writes are eventual-consistent unless wrapped in a table format, and the cost model is storage + compute-at-query-time. Lakes shine when data shapes are diverse and high-volume.

Strengths — accepts any data format; massive scale; cheap storage; many engines can read.
Watch-outs — no ACID without a table format; no time travel; concurrent writes can corrupt the table; small-file problem.
Best fit — ML feature stores, log archives, raw event data, ingestion landing zones.
Cost — storage ~$0.023/GB/month (S3 Standard); compute pay-per-query.

Worked example. A 50TB clickstream feature store in S3 + Glue.

attribute	value
storage	S3 Standard, ~$1,150/month for 50TB
catalog	AWS Glue (free for first million objects)
compute	Athena, ~$5/TB scanned
typical query	scan 100GB → ~$0.50

Step-by-step explanation.

Storage line: 50TB × 1024GB × $0.023/GB/month (S3 Standard pricing) ≈ $1,150/month — this is the floor regardless of query activity.
Catalog line: AWS Glue is free for the first million metadata objects; a 50TB clickstream table partitioned by year/month/day fits comfortably under that limit.
Compute line: Athena charges per TB scanned, not per query — write efficient SQL (use the partition predicate, project only needed columns) and you pay only for what you actually read.
Typical query: a partition-pruned + column-projected scan touches ~100GB → 0.1 TB × $5/TB ≈ $0.50; an unpruned full-table scan would touch 50TB → $250 per query.
Net at this scale: storage dominates the monthly bill (~$1,150) and compute scales linearly with query discipline — bad queries cost real money, good queries are nearly free.

Worked-example solution. A lake-first deployment for clickstream:

s3://feature-lake/raw_events/year=2026/month=04/day=13/
  part-00000.parquet
  part-00001.parquet
   …
Glue catalog: feature_lake.raw_events
Athena query: SELECT user_id, COUNT(*) FROM feature_lake.raw_events WHERE day = '2026-04-13' GROUP BY user_id;

Rule of thumb: a pure lake is the right answer when data is high-volume, schema-flexible, and primarily consumed by ML or batch analytics; reach for a lakehouse the moment you need ACID or concurrent writers.

Cloud warehouse — managed ACID SQL on proprietary storage

The warehouse invariant: a cloud warehouse (Snowflake, BigQuery, Redshift, Synapse) is a managed system that owns both storage and compute, exposes SQL as the primary interface, provides ACID transactions out of the box, and handles indexing, statistics, and query optimization automatically. Warehouses shine when data is structured and the primary consumer is analyst SQL.

Strengths — mature SQL; ACID; managed governance products (RBAC, masking); workload management.
Watch-outs — proprietary storage = vendor lock; cost at huge semi-structured scale; less flexible for non-tabular data.
Best fit — curated analytics, BI dashboards, financial reporting, dimensional models.
Cost — ~$2-5 per credit-hour or per-TB-scanned; storage ~$0.02-0.04/GB/month (compressed).

Worked example. A 5TB curated finance mart in Snowflake.

attribute	value
storage	Snowflake, ~$200/month for 5TB compressed
compute	Small warehouse, ~$2/credit-hour
typical query	dashboard refresh in ~30 seconds
ACID	full transactions across multi-table updates

Step-by-step explanation.

Storage line: Snowflake compresses raw data 3-5x, so 5TB raw becomes ~1-1.5TB stored at ~$23-46/TB/month, landing around $200/month total.
Compute line: a Small warehouse runs at ~$2/credit-hour; nightly ELT jobs plus business-hours dashboards consume ~50-200 credits/month for a finance mart of this size.
Typical query: dashboard refresh hits a sub-30-second target because data is co-located with compute and the planner has full statistics.
ACID guarantee: multi-table updates within a BEGIN ... COMMIT block are atomic — the finance close cannot land half-updated, which is the whole reason finance reports run on a warehouse rather than a raw lake.
Net at 5TB scale: the warehouse premium (~$200 storage) is small versus a lake's ~$115 equivalent; ergonomics, SQL-first BI integration, and ACID tilt the choice clearly toward warehouse.

Worked-example solution. A curated star schema in Snowflake:

CREATE OR REPLACE TABLE finance.fact_revenue AS
SELECT date_key, region_key, product_key, SUM(amount) AS revenue
FROM silver.order_lines
GROUP BY 1, 2, 3;

Rule of thumb: warehouses are the right answer when SQL ergonomics and ACID matter more than format flexibility; reach for a lakehouse when you need both and the ability to query the same data from outside the warehouse.

Lakehouse with Iceberg / Delta / Hudi — ACID on object storage

The lakehouse invariant: a lakehouse is an open-table-format layer (Apache Iceberg, Delta Lake, Apache Hudi) on top of object storage that adds ACID transactions, schema evolution, partition evolution, time travel, and safe concurrent writers; the data sits in standard Parquet files but is governed by a JSON/Avro commit log that any engine can read. Lakehouse architectures combine lake scale with warehouse-like semantics.

Apache Iceberg — table format invented at Netflix; broad engine support (Spark, Trino, Snowflake, BigQuery, Dremio); REST catalog spec.
Delta Lake — invented at Databricks; strong Spark integration; commit log in _delta_log/; OSS Delta works across engines.
Apache Hudi — invented at Uber; optimized for upsert-heavy CDC workloads; merge-on-read and copy-on-write modes.
All three — provide ACID, time travel, schema evolution, and partition pruning; pick by ecosystem and team skill.

Worked example. A 50TB lakehouse on S3 + Iceberg + Spark/Trino.

dimension	data lake	warehouse	lakehouse (Iceberg)
storage cost	✓ cheap	✗ expensive	✓ cheap
ACID transactions	✗	✓	✓
concurrent writers	✗	✓	✓
time travel	✗	depends	✓
schema evolution	manual	managed	managed
vendor lock	none	high	low (open standard)
ML / Python access	direct	via connector	direct

Step-by-step explanation.

Storage cost row: lake and lakehouse both win because data sits in cheap object storage; warehouse loses at scale because storage is bundled with managed compute.
ACID + concurrent writers rows: warehouse and lakehouse both provide them out of the box; pure lake does not — concurrent writers can corrupt a lake table without an open table format on top.
Time travel row: only the lakehouse exposes it natively via the Iceberg/Delta snapshot log; some warehouses offer it as a managed feature; pure lake has no concept.
Schema evolution row: lakehouse and warehouse both manage adding/widening columns as a metadata commit; pure-lake users do it manually with file rewrites.
Vendor lock + ML/Python rows: pure lake is open standard; lakehouse is open standard with a richer feature set; warehouse is proprietary and ML access usually requires connectors that copy data back out — which is why ML teams gravitate to lake/lakehouse for feature stores.

Worked-example solution. Creating an Iceberg table via Spark SQL:

CREATE TABLE lakehouse.orders (
    order_id     BIGINT,
    customer_id  BIGINT,
    amount       DECIMAL(18,2),
    order_date   DATE
)
USING iceberg
PARTITIONED BY (days(order_date))
LOCATION 's3://lakehouse-bucket/orders/';

MERGE INTO lakehouse.orders t
USING staging.orders_delta s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Rule of thumb: the lakehouse pattern is the right answer when you need ACID + time travel + concurrent writers + the ability to query from multiple engines; pick Iceberg for the broadest engine support, Delta for tightest Databricks/Spark integration, Hudi for upsert-heavy CDC.

Common beginner mistakes

Conflating "lake" with "Hadoop / HDFS" — modern lakes are object storage; HDFS is the legacy on-prem variant.
Picking a lakehouse "because it's modern" without matching it to the workload — for pure curated SQL analytics, a warehouse is often simpler and cheaper.
Treating Iceberg / Delta / Hudi as interchangeable — Hudi is upsert-tuned; Delta is Spark-tightest; Iceberg is most engine-agnostic. The choice has long-term implications.
Forgetting that lakehouses still need governance — IAM, lineage, quality tests, contracts; the "open" part is the storage, not the operational discipline.
Underestimating the operational cost of running an open lakehouse vs a managed warehouse — engineering time matters.

Data Lake Interview Question on Pattern Selection

A retail company stores 200TB of clickstream events plus a 5TB curated finance mart and a 1TB ML feature store. Should they run on a pure data lake, a cloud warehouse, a lakehouse, or a hybrid? Walk through your decision.

Solution Using a Hybrid — Lakehouse for Clickstream + Features, Warehouse for Finance

Workload                  Volume    Pattern recommended    Why
────────────────────────  ────────  ─────────────────────  ─────────────────────────────────────────
Clickstream events        200 TB    Lakehouse (Iceberg)    Volume + schema flexibility + ML access
ML feature store           1 TB     Lakehouse (Iceberg)    Same engine, same catalog as clickstream
Curated finance mart       5 TB     Cloud warehouse        SQL ergonomics, ACID across many tables, BI tools
                                                            Snowflake / BigQuery / Redshift

Boundary contract:
  - Clickstream + features stay in S3 + Iceberg
  - Finance mart loads nightly from Iceberg via Snowflake external tables
  - Reverse-ETL syncs finance summaries back into the lakehouse for ML feature joins

Why this works: clickstream at 200TB is the workload that justifies the cheaper object-storage cost model; the lakehouse table format adds ACID and time travel that the team will need for replays and audits; the ML feature store sits on the same engine + catalog so feature engineers can JOIN against clickstream without a cross-system data hop; the finance mart at only 5TB is small enough that warehouse storage cost is negligible, and the team's BI tools and analyst SQL ergonomics dominate the decision; the boundary contract (Snowflake external tables) lets finance read curated lake tables without copying them, and reverse-ETL closes the loop for ML.

Step-by-step trace of the decision:

step	question	answer
1	Is volume > 50TB?	yes (clickstream) → lake or lakehouse
2	Need ACID + concurrent writers + time travel?	yes (CDC + ML feature recomputation) → lakehouse, not pure lake
3	Pick a table format	Iceberg (broadest engine support across Spark, Trino, Snowflake)
4	Is the curated SQL workload < 10TB?	yes (finance, 5TB) → warehouse is fine
5	Pick a warehouse	Snowflake (ergonomics + multi-cloud + Iceberg external table support)
6	Boundary contract	Snowflake external tables on Iceberg; reverse-ETL nightly job

Output: the recommended architecture summary:

zone	technology	volume	primary consumer
Clickstream lakehouse	S3 + Iceberg + Spark/Trino	200 TB	ML pipelines, analyst SQL via Trino
ML feature store	S3 + Iceberg + Spark	1 TB	ML training + serving
Finance warehouse	Snowflake (managed)	5 TB	Finance analysts, BI dashboards
Boundary	Snowflake external tables on Iceberg	—	finance reads curated lake data zero-copy

Why this works — concept by concept:

Volume-driven storage choice — 200TB at warehouse storage cost ($0.02-0.04/GB/month) = ~$5K/month; same data on S3 = ~$4.6K/month and available to ML directly. The cost gap widens with growth.
Lakehouse for ACID + time travel — clickstream replays and ML feature recomputation need transactional snapshots; a pure lake without Iceberg cannot give you that.
Warehouse for curated SQL — finance analysts live in BI tools; warehouse SQL ergonomics + ACID across multi-table updates dominates the cost-per-query argument at 5TB scale.
Iceberg as the open boundary — Snowflake reads Iceberg tables natively via external tables; no nightly copy job, no schema drift between systems.
Reverse-ETL closes the loop — finance summaries flow back to the lakehouse so ML features can JOIN against revenue without leaving the lake stack.
Operational cost trade-off — running both a lakehouse and a warehouse is more engineering than a single managed warehouse; the cost is justified at this volume mix but not at 5TB total.

Inline CTA: More SQL practice problems for warehouse-style queries and data modeling practice for star-schema and OBT patterns. Course: Data Modeling for Data Engineering Interviews.

SQL
Topic — dimensional modeling
Dimensional modeling problems

Practice →

SQL
Language — data modeling
Data modeling problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

4. Interview Answer Shape — Grain, Idempotency, Lineage, Reconciliation

A five-step template for data lake design rounds

"Design our company's analytics data lake" is the canonical open-ended system-design prompt — and the cleanest answer is a five-step template that walks the interviewer through the load-bearing decisions in a fixed order. The mental model: clarify grain → separate landing from conformed → make loads idempotent → attach lineage keys → reconcile aggregates against source. Following this template demonstrates that you have shipped data pipelines before, and it gives the interviewer five concrete spots to drill deeper. Candidates who jump straight to vendor names or who skip the grain question lose the round, regardless of how many tools they can name.

Pro tip: State the template out loud at the start: "I'd answer this in five steps — first clarify grain, then separate landing from conformed, then make loads idempotent, then attach lineage keys, then explain how I'd reconcile aggregates against the source." This gives the interviewer a road map and makes it easy for them to interrupt at any step with "tell me more about X" — which is exactly the signal you want.

Step 1 — Clarify the grain and the metric definition

The grain invariant: the grain of a fact table is the business event one row represents — orders, order lines, shipments, page views, user-day, user-session — and ambiguous grain is the single most common bug in data engineering. Ask the interviewer "are we counting orders or order lines?" before drawing a box. The answer changes joins, group-bys, and reconciliation totals.

Order grain — one row per order; COUNT(*) = number of orders.
Order-line grain — one row per line item; COUNT(DISTINCT order_id) = number of orders.
User-day grain — one row per user per day; SUM(events) = events per user per day.
Session grain — one row per session; rolling LAG over events to define session boundaries.

Worked example. "How many orders did we ship last week?" against fact_shipments.

grain candidate	implied metric
order grain	`COUNT(*) WHERE shipped_date BETWEEN ...`
order-line grain	`COUNT(DISTINCT order_id) WHERE shipped_date BETWEEN ...`
shipment grain	`COUNT(DISTINCT order_id) WHERE shipment_event = 'shipped'`

Step-by-step explanation.

If the table is at order grain (one row per order), COUNT(*) WHERE shipped_date BETWEEN ... directly counts orders shipped — clean and simple.
If the table is at order-line grain (one row per item per order), COUNT(*) over-counts every multi-item order; the right answer becomes COUNT(DISTINCT order_id).
If the table is at shipment grain (one row per shipment event per line, including partial shipments and cancellations), filter by event_type = 'shipped' first and then COUNT(DISTINCT order_id).
Without naming the grain, the same SQL can produce three different "right" numbers — and the analyst, dashboard, and source-of-truth Slack thread will each pick a different one.
Stating the grain in the first sentence of every interview answer prevents this entire class of bug — and the same rule applies in production: every fact table should have its grain documented in the catalog comment.

Worked-example solution. Always state grain explicitly:

"This fact_shipments table has shipment grain — one row per shipment_event per order_line.
For 'orders shipped last week' I'll do COUNT(DISTINCT order_id) where event_type = 'shipped'
and shipped_date BETWEEN start_of_week AND end_of_week."

Rule of thumb: the first sentence of every interview answer should name the grain. Even if the interviewer doesn't ask, declaring grain demonstrates senior intent.

Step 2 — Separate landing from conformed (bronze vs silver)

The separation invariant: landing is what the source sent; conformed is what the business agrees to call truth; never let analysts query landing directly because schemas change without notice. The bronze/silver split is the architectural manifestation of this rule.

Landing / bronze — append-only, source-fidelity, partitioned by ingest_date.
Conformed / silver — deduplicated, typed, with conformed business keys.
Curated / gold — subject-area marts and dimensional models for downstream consumption.
Boundary — only the silver and gold layers carry consumer contracts; bronze is for re-processors only.

Worked example. A pipeline lands daily JSON snapshots; without a separation layer, analysts join directly against bronze.orders and break every time the source adds a column.

layer	who reads	breakage tolerance
bronze.orders	re-processors only	high (re-process on demand)
silver.orders	analyst ad-hoc, ML	low (contract change ≥ 30 days notice)
gold.fact_orders	dashboards, BI	zero (versioned column contracts)

Step-by-step explanation.

Bronze is owned by the re-processors only — no SLA, no consumer contract; analysts who query it get whatever the source app emitted today, including freshly-renamed columns and broken types.
Silver is the contract layer — analyst ad-hoc SQL, ML feature pipelines, and reverse-ETL all read it; breakage requires ≥30-day notice so consumers can adapt.
Gold has zero breakage tolerance — dashboards and BI tools couple to specific column names + types; any change requires explicit version bumping (gold.fact_orders_v2) so old dashboards keep working.
Without these boundaries, a source app's column rename cascades immediately into a broken executive dashboard, and the data team learns about it from a Slack screenshot.
With these boundaries, the silver-layer owner absorbs the upstream change inside the dedup logic, gold contracts stay intact, and the dashboard never breaks — the architecture has done its job.

Worked-example solution. State the layer boundaries explicitly:

"I'd split the platform into three layers — bronze for raw landing, silver for conformed,
gold for analytics-ready. Bronze is for re-processors only; analysts and dashboards read
silver and gold. The boundary contract is documented and breakage requires 30-day notice
for silver and explicit version bumping for gold."

Rule of thumb: any answer that allows analysts to query the landing zone has a hidden bug-factory; the bronze/silver split is what prevents source-schema chaos from cascading into BI.

Step 3 — Idempotent loads — same input → same output, every time

The idempotency invariant: a daily load is idempotent if re-running it (after any failure, manual intervention, or backfill) produces byte-identical output; without idempotency, retries cause duplicates and counts drift silently. Idempotency is achieved through MERGE instead of INSERT, partition-overwrite semantics, or table-format ACID transactions.

MERGE on a business key — WHEN MATCHED UPDATE SET * + WHEN NOT MATCHED INSERT *.
Partition overwrite — INSERT OVERWRITE INTO silver.orders PARTITION (ingest_date='2026-04-13').
Iceberg / Delta MERGE — ACID transaction; safe for concurrent writers.
Functional idempotency — pure transformations whose output depends only on inputs, never on NOW() or random.

Worked example. A retry on a half-completed daily load should produce the same final state as the original successful run.

run	rows in silver before	rows after
original	0	12,835
retry (after partial failure)	12,401	12,835 (no duplicates)
backfill 2026-04-12 a week later	12,820	12,820 (overwritten cleanly)

Step-by-step explanation.

Original run: silver starts at 0 rows; the MERGE writes 12,835 unique rows after dedup; final count = 12,835.
Retry after a partial failure: silver already has 12,401 rows (the partial write that crashed); the MERGE updates the existing rows and inserts only the missing 434; final count = 12,835 — no duplicates.
Backfill 2026-04-12 a week later: partition-overwrite semantics drop the existing 12,820 rows for that date and replace them with the freshly recomputed 12,820; final count = 12,820 — clean.
The key invariant: every rerun produces the same final state regardless of the starting state — that's what idempotency means.
Without idempotency, the retry would have inserted 434 duplicate rows (12,835 - 12,401), and the backfill would have either errored on the unique constraint or silently created shadow data that broke the next dashboard refresh.

Worked-example solution.

MERGE INTO silver.orders t
USING (
    SELECT * FROM bronze.orders
    WHERE ingest_date = '2026-04-13'
    QUALIFY ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY source_ts DESC) = 1
) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Rule of thumb: if your interviewer asks "what happens if this job runs twice", and your answer involves any kind of cleanup script, you don't have idempotency — restructure.

Step 4 — Attach row-level lineage keys — `ingest_id`, `source_ts`, `pipeline_version`

The lineage invariant: every silver and gold row carries the columns that let you reconstruct which source payload produced it and which pipeline version transformed it; without lineage, debugging "why does this row look wrong" is forensic archaeology.

ingest_id — unique identifier of the bronze batch (e.g., timestamp + UUID).
source_ts — timestamp from the source system (CDC) for ordering.
pipeline_version — git SHA or version tag of the transformation code.
silver_loaded_at — when the row entered silver; useful for SLA metrics.

Worked example. Analysts notice that revenue for order_id=448 is wrong; with lineage, they can trace it back to the exact bronze file and pipeline version that produced it.

field	value
order_id	448
revenue	$99.00 (wrong; should be $999.00)
ingest_id	`20260412T0200Z_a3f2`
source_ts	`2026-04-12 09:30:15`
pipeline_version	`v2.1.7` (commit `b3a4d72`)
silver_loaded_at	`2026-04-12 02:15:32`

Step-by-step explanation.

An analyst notices order_id = 448 shows revenue $99 instead of the expected $999 in the BI dashboard.
They look the row up in silver: SELECT ingest_id, source_ts, pipeline_version FROM silver.orders WHERE order_id = 448.
The result tells them exactly which bronze batch produced this row (ingest_id = '20260412T0200Z_a3f2'), the source moment (source_ts = 2026-04-12 09:30:15), and the pipeline version that ran (v2.1.7).
They open the bronze file at that ingest_id. If the source payload already shows $99, it's a source bug — file a ticket with the upstream team and replay from a known-good source_ts.
If the bronze payload shows $999 but silver shows $99, the bug is in the pipeline. Run git log v2.1.7 to find the exact commit, fix the transformation, deploy v2.1.8, and backfill the affected ingest_date partition — total recovery time ~30 minutes instead of multi-day forensic SQL.

Worked-example solution. Carry lineage in every silver row:

SELECT order_id,
       customer_id,
       amount::DECIMAL(18,2)            AS amount,
       source_ts,
       ingest_id,
       'v2.1.7'                          AS pipeline_version,
       CURRENT_TIMESTAMP                 AS silver_loaded_at
FROM (deduped bronze rows);

Rule of thumb: if the dashboard shows a wrong number and you can't answer "which source file produced this row?" in under five minutes, your lineage isn't strong enough.

Step 5 — Aggregate reconciliation against the source

The reconciliation invariant: a daily job compares aggregate metrics (row counts, sums, distinct counts) between the lake and the source system, alerts on drift above a tolerance, and blocks promotion to gold until the drift is investigated. Reconciliation is the difference between "we trust the lake" and "we hope the lake is right".

Row count — COUNT(*) in lake vs source for the same time window.
Sum reconciliation — SUM(amount) in lake vs source.
Distinct count — COUNT(DISTINCT user_id) to catch dedup bugs.
Tolerance threshold — typically 0.1% for high-volume facts, 0.01% for finance.

Worked example. Daily reconciliation between silver and source-app replica.

metric	silver	source	drift	passes?
row count	12,835	12,835	0.0%	✓
sum(amount)	$4,128,931	$4,128,931	0.0%	✓
count(distinct user_id)	8,712	8,712	0.0%	✓

Step-by-step explanation.

The daily reconciliation job runs after the silver load completes for the prior day.
It computes three metrics over silver.orders: COUNT(*), SUM(amount), and COUNT(DISTINCT user_id) for the same date.
It computes the same three metrics over source_replica.orders (a read-only replica of the source-app database) for the same date.
For each metric, drift is calculated as ABS(silver - source) / source; the gate passes only if every metric is below the tolerance (0.001 = 0.1% for facts; 0.0001 for finance).
If all three pass: silver promotes to gold and the dashboard refresh proceeds. If any fail: the gate blocks promotion, pages the on-call engineer, and emits the failing metric to a drift dashboard for investigation — the BI team never sees stale or wrong numbers because they see no refresh.

Worked-example solution. A reconciliation gate before gold promotion:

WITH lake AS (
    SELECT COUNT(*) AS n, SUM(amount) AS s
    FROM silver.orders
    WHERE source_ts::DATE = '2026-04-13'
),
src AS (
    SELECT COUNT(*) AS n, SUM(amount) AS s
    FROM source_replica.orders
    WHERE order_ts::DATE = '2026-04-13'
)
SELECT
    ABS(lake.n - src.n) * 1.0 / src.n AS row_drift,
    ABS(lake.s - src.s) * 1.0 / src.s AS sum_drift,
    CASE
        WHEN ABS(lake.n - src.n) * 1.0 / src.n < 0.001
         AND ABS(lake.s - src.s) * 1.0 / src.s < 0.001
        THEN 'PASS'
        ELSE 'FAIL'
    END AS gate
FROM lake, src;

Rule of thumb: never promote to gold without a reconciliation gate; the BI team will discover any drift the hard way otherwise, and trust takes years to rebuild.

Common beginner mistakes

Skipping Step 1 (grain) and going straight to architecture — every downstream answer is wrong if grain is wrong.
Letting analysts query the bronze zone directly — schema drift cascades into BI dashboards.
"Idempotent" loads that depend on NOW() — re-runs produce different rows; not actually idempotent.
Lineage limited to the pipeline level (not the row level) — debugging "this row is wrong" is a multi-day forensic effort.
Reconciliation that only checks row counts but not sums — COUNT matches when the dedup deletes the wrong row but the count happens to be right.

Data Lake Interview Question on a Full System-Design Walkthrough

Walk through your end-to-end answer to "design our company's analytics data lake" using the five-step template.

Solution Using the Five-Step Template

1. CLARIFY GRAIN
   "Before I draw any boxes — what's the canonical fact event? Orders, order lines, shipments?
    What's the metric we ultimately care about? Revenue, user counts, latency?"
   → assume: order grain; canonical metric = daily revenue per region.

2. SEPARATE LANDING FROM CONFORMED
   bronze.orders   ← S3 append-only daily JSON, partitioned by ingest_date
   silver.orders   ← deduped + typed + conformed customer_key/region_key
   gold.fact_orders ← star schema with dim_customer, dim_region, dim_date

3. IDEMPOTENT LOADS
   - bronze: append-only writes by ingest_id (never overwrite)
   - silver: MERGE on order_id with QUALIFY ROW_NUMBER() = 1 dedup
   - gold: INSERT OVERWRITE PARTITION (date_key) for the affected day(s)
   Re-runs produce byte-identical output.

4. ROW-LEVEL LINEAGE
   Carry ingest_id, source_ts, pipeline_version, silver_loaded_at on every silver row.
   Carry silver_loaded_at and pipeline_version on every gold row.
   Forensic queries: "show me every silver.orders row where pipeline_version='v2.1.6'."

5. AGGREGATE RECONCILIATION
   Daily SQL job: compare COUNT(*), SUM(amount), COUNT(DISTINCT user_id) between
   silver.orders and the source-app replica for the prior day. Drift > 0.1% blocks
   gold promotion and pages on-call. Drift dashboard surfaces history at a glance.

Why this works: the template gives the interviewer a clear road map (so they know where to drill) while demonstrating that the candidate has shipped this kind of pipeline before; each step addresses a specific failure mode (grain ambiguity, schema drift, retry duplicates, debugging dead-ends, silent data corruption); the order is non-arbitrary — Step N depends on Step N-1, and skipping any step weakens the foundation; every step has a concrete artifact (a layer, a SQL pattern, a column, a job) so the interviewer can ask "show me what that looks like" and get a specific answer.

Step-by-step trace through a sample interview round:

time (min)	step	candidate output
0-2	grain	"Are we counting orders or order lines? Confirmed: orders."
2-7	landing vs conformed	drew bronze/silver/gold split with ownership boxes
7-12	idempotency	walked through silver MERGE; named QUALIFY ROW_NUMBER dedup
12-15	lineage	listed `ingest_id`, `source_ts`, `pipeline_version` columns
15-20	reconciliation	sketched daily-reconciliation SQL job + drift dashboard
20-25	open questions	streaming variant, schema evolution, multi-region replication

Output: the recommended interview-round shape:

step	minutes	failure mode addressed
1 — grain	0-2	ambiguous metric → wrong joins
2 — landing vs conformed	2-7	source-schema drift → BI breakage
3 — idempotency	7-12	retries → duplicates → drift
4 — lineage	12-15	"why is this row wrong" → forensic dead-end
5 — reconciliation	15-20	silent corruption → trust loss

Why this works — concept by concept:

Step 1 anchors the conversation in business semantics — grain is the foundation; getting it right makes Steps 2-5 simpler, getting it wrong makes them all moot.
Step 2 turns architecture into ownership — naming the layer boundary makes it easy to talk about who reads what, who's allowed to break what, and what notice consumers get.
Step 3 prevents the most common production incident — non-idempotent loads are the #1 source of "duplicate row" bug reports; demonstrating the MERGE + QUALIFY ROW_NUMBER pattern signals senior fluency.
Step 4 turns debugging from hours to minutes — lineage columns are the difference between "I can fix this in 10 min" and "I'll get back to you tomorrow."
Step 5 is the operational backstop — even with steps 1-4 done well, you need reconciliation to catch the failures you didn't anticipate; the gate-before-promotion pattern blocks drift before consumers see it.
The template's value compounds — each step makes the next one easier, and skipping any step weakens the foundation that the later steps build on.

Inline CTA: More ETL practice problems for end-to-end pipeline design and the data modeling practice page for grain and dimensional patterns. Course: ETL System Design for Data Engineering Interviews.

ETL
Topic — ETL pipelines
ETL practice problems

Practice →

SQL
Language — data modeling
Data modeling problems

Practice →

SQL
Language — SQL
All SQL practice problems

Practice →

Tips to crack data lake architecture interviews

Master the four primitives — zones, ingest flow, pattern selection, answer template

If you can draw the bronze/silver/gold zones with ownership labels, walk the ingest → catalog → compute flow without skipping the catalog, articulate when a lakehouse beats a warehouse and when it doesn't, and structure your answer using the five-step grain → idempotency → lineage → reconciliation template — you can clear most data-engineering system-design rounds. The remaining 20% is dialect-specific (Spark vs Snowflake idioms, Iceberg vs Delta semantics) and behavioral.

Always state grain in the first sentence

Before drawing any boxes, name the grain: "this is order-line grain" or "this is user-day grain". Every wrong answer in a system-design round can be traced back to a grain ambiguity that nobody named. Stating grain explicitly costs five seconds and saves the entire round.

Pick Iceberg unless you have a reason not to

Iceberg has the broadest engine support (Spark, Trino, Snowflake, BigQuery, Dremio, Athena) and is the most engine-agnostic of the three open table formats. Pick Delta if your stack is Databricks-centric and Spark-only. Pick Hudi if your workload is upsert-heavy CDC. State the choice and the reason out loud — "I'd pick Iceberg for engine portability" — interviewers grade the reasoning more than the choice.

Treat idempotency as table stakes, not advanced

MERGE instead of INSERT, partition-overwrite for backfill, and pure transformations whose output depends only on inputs (never NOW() or random) — these are baseline expectations, not advanced techniques. If you forget to mention idempotency in a system-design round, the interviewer will assume you have not shipped a production pipeline.

Use Spark for batch, Trino for interactive, DuckDB for ad-hoc

Spark dominates batch + streaming with the richest connector ecosystem; Trino dominates federated interactive SQL across many catalogs; DuckDB is rising fast for single-node ad-hoc analytics under 1TB. Naming the right tool for the workload (without over-explaining) signals breadth.

Reconciliation is what separates "we trust the lake" from "we hope the lake is right"

Always include a reconciliation step that compares aggregate metrics between the lake and the source system, alerts on drift above a tolerance, and blocks gold promotion until drift is investigated. The five seconds it takes to mention reconciliation is the difference between a senior signal and a mid-level signal.

Where to practice on PipeCode

Start with the ETL practice page for medallion-zone and end-to-end pipeline problems. Drill the related topic pages: streaming, dimensional modeling, SQL practice, Python practice, data modeling practice. The interview-first courses page bundles structured curricula — start with ETL System Design for Data Engineering Interviews, Data Modeling for Data Engineering Interviews, or PySpark Fundamentals. For broader coverage, browse by topic or read the SQL interview questions for data engineering and top data engineering interview questions 2026 blogs.

Frequently Asked Questions

What is data lake architecture?

Data lake architecture is the set of conventions — layered zones (bronze/silver/gold), an ingest → catalog → compute flow on object storage, an open table format for ACID semantics, and disciplined ownership and quality contracts — that turn raw object storage into a trustworthy analytics platform. Without these conventions, a "data lake" devolves into a data swamp where nobody can trust the numbers.

What is the difference between a data lake, a data warehouse, and a lakehouse?

A data lake is cheap, flexible object storage with file-based reads and no built-in ACID; a cloud warehouse (Snowflake, BigQuery, Redshift) is a managed system with proprietary storage, full ACID, and SQL-first ergonomics; a lakehouse is a lake plus an open table format (Iceberg, Delta Lake, Hudi) that adds ACID, time travel, schema evolution, and concurrent writers — bringing warehouse-like semantics to object storage. Most organizations run a hybrid: lake/lakehouse for high-volume + ML workloads, warehouse for curated SQL.

What are bronze, silver, and gold layers?

Bronze (or landing/raw) is an append-only mirror of source payloads with minimal transformation. Silver (or refined/conformed) applies dedup, type coercion, and conformed business keys; this is the source of truth for downstream applications. Gold (or curated/consumption) publishes subject-area marts and star-schema fact + dim tables for analyst SQL and BI dashboards. The names vary across vendors but the three-tier shape is universal.

Do I need Iceberg, Delta Lake, or Hudi for every project?

No. Small teams can start with well-partitioned Parquet and strict naming conventions. Reach for an open table format when you need ACID transactions, concurrent writers, partition evolution, time travel, or simpler upserts and deletes. Pick Iceberg for the broadest engine support, Delta for tightest Databricks/Spark integration, Hudi for upsert-heavy CDC workloads.

What is the small-file problem?

When a lake table accumulates millions of small files (e.g., 5KB each from frequent micro-batch writes), query planning spends more time listing files in the catalog and metastore than actually scanning data — a Spark or Trino query that should take 500ms can take 50 seconds. The fix is scheduled compaction jobs that rewrite many small files into fewer 128MB-1GB files, plus targeting larger micro-batch sizes upstream.

How do I handle schema evolution in a data lake?

Open table formats handle schema evolution gracefully — adding a column or widening a type is a single metadata commit. Without a table format, schema evolution requires rewriting partitions or carrying a column-version field on every row. Either way, the silver layer should be the schema-stability boundary: bronze accepts whatever the source sends, silver enforces a canonical schema, and changes to silver require explicit consumer notice (typically 30 days).

How does this connect to data engineering interviews on PipeCode?

System-design questions still reduce to SQL queries, Python data transforms, and dimensional modeling decisions. PipeCode focuses on those signals with 450+ problems — drill SQL aggregations and joins, Python pipeline patterns, and dimensional models, then layer on system-design depth via the courses. Use Practice once you can draw the medallion zones and the ingest → catalog → compute flow confidently.

Start practicing data lake architecture problems

Data Engineering Roadmap for Freshers (2026): A 13-Step Beginner's Guide from SQL to Your First Data Engineering Job

Gowtham Potureddi — Mon, 11 May 2026 03:17:57 +0000

Data engineering is one of the fastest-growing tech careers in 2026. Companies collect huge amounts of data every day, and data engineers build the systems that collect, clean, transform, store, and deliver that data so analysts, scientists, and product teams can use it. If you're a fresher and confused about where to start, this data engineering roadmap for freshers lays out a clear, ordered 13-step path — what to learn first, what to learn next, what to build, and how to prove the work to a recruiter.

This guide is a beginner-first walkthrough for how to become a data engineer in 2026 without a CS degree, three certificates, or a Spark cluster on day one. The 13 steps are grouped into five learning blocks below, each with a tiny worked example you can run on your laptop. Most freshers fail because they jump to Spark too early, ignore SQL depth, avoid projects, or watch tutorials without practising — the roadmap below fixes all four. Examples use PostgreSQL SQL (the dialect every coding-environment interview defaults to) and standard-library Python so you can run everything on a laptop without setup overhead. Default plan: about 6–9 months at 10–15 hours per week to be job-ready, 9–12 months at 6–8 hours per week for working learners.

1. Step 1 — Master SQL: The Most Important Skill for a Data Engineer

SQL fundamentals, joins, aggregations, window functions, and the queries you'll write every day

SQL is the foundation of data engineering — you'll write it daily for querying, cleaning, transforming, joining datasets, building reports, and writing ETL logic. Master SQL first; everything else becomes easier.

The five SQL skill clusters every fresher needs:

Basics — SELECT, WHERE, ORDER BY, LIMIT, DISTINCT.
Aggregations — COUNT, SUM, AVG, MIN, MAX, GROUP BY, HAVING.
Joins — INNER, LEFT, RIGHT, FULL, SELF.
Window functions — ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD.
Advanced — CTEs, subqueries, CASE, NULL handling, date functions, indexes, query optimisation.

Pro tip: SQL is non-negotiable. Drill it daily on a free coding environment (DataLemur, LeetCode SQL, StrataScratch, HackerRank SQL). Most fresher rejections at the SQL screen are not from missing syntax — they are from joining at the wrong grain or putting an aggregate in the wrong clause.

SQL basics — `SELECT`, `WHERE`, `ORDER BY`, `LIMIT`, `DISTINCT`

The bedrock SQL shape: SELECT cols FROM table WHERE row_filter ORDER BY col DESC LIMIT N. That one query covers most "show me the top X by Y" prompts you'll ever write.

SELECT cols FROM table — pick the columns you actually need; never SELECT * in production.
WHERE filter — row-level predicate; runs before grouping.
ORDER BY col DESC — sort the result; ASC is default, DESC is biggest-first.
LIMIT N — keep only the top N rows.
DISTINCT col — collapse duplicates to a single value per group.

Example input. A 4-row employees table with name and salary.

name	salary
Alice	70000
Bob	45000
Carol	90000
Dan	55000

Question. Return the names and salaries of employees who earn more than 50,000, sorted from highest to lowest salary.

Code solution.

SELECT name, salary
FROM employees
WHERE salary > 50000
ORDER BY salary DESC;

Explanation of code. WHERE salary > 50000 runs first and drops Bob (45000). The remaining three rows are then sorted by salary in descending order, so Carol (highest) comes first, Alice second, Dan third. No LIMIT, so all three qualifying rows are returned.

Step-by-step.

step	clause	result
1	`FROM employees`	scan all 4 rows
2	`WHERE salary > 50000`	drop Bob (45000); 3 rows left
3	`ORDER BY salary DESC`	sort: Carol (90000) → Alice (70000) → Dan (55000)
4	`SELECT name, salary`	project the two named columns

Output.

name	salary
Carol	90000
Alice	70000
Dan	55000

Rule of thumb: always name the columns in the SELECT; SELECT * outside an exploratory REPL is a code smell.

Aggregations — `GROUP BY` + `HAVING`

The aggregation shape: SELECT dim, AGG(col) FROM table GROUP BY dim HAVING AGG_filter. GROUP BY collapses many rows to one row per group; HAVING filters the resulting groups (you cannot put an aggregate in WHERE).

COUNT(*) — number of rows per group.
SUM(col) / AVG(col) / MIN(col) / MAX(col) — collapse a numeric column.
GROUP BY dim — one output row per distinct value of dim.
HAVING AGG > N — keep only groups whose aggregate exceeds N.

Example input. A 6-row employees table with department and salary.

name	department	salary
Alice	Engineering	90000
Bob	Engineering	80000
Carol	Sales	50000
Dan	Sales	55000
Eve	Marketing	65000
Frank	Marketing	60000

Question. Return the average salary per department, but only show departments whose average exceeds 60,000.

Code solution.

SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING AVG(salary) > 60000;

Explanation of code. GROUP BY department collapses the six rows into three groups — Engineering, Sales, Marketing. AVG(salary) computes the per-group average: Engineering 85000, Sales 52500, Marketing 62500. HAVING AVG(salary) > 60000 then drops Sales (52500 fails the threshold) and keeps the other two.

Step-by-step.

step	clause	result
1	`FROM employees`	scan all 6 rows
2	`GROUP BY department`	3 groups — Engineering (2 rows), Sales (2 rows), Marketing (2 rows)
3	`AVG(salary)` per group	Engineering 85000, Sales 52500, Marketing 62500
4	`HAVING AVG(salary) > 60000`	drop Sales (52500 fails); keep Engineering + Marketing
5	`SELECT department, AVG(salary) AS avg_salary`	project the 2 surviving rows

Output.

department	avg_salary
Engineering	85000
Marketing	62500

Rule of thumb: row predicates → WHERE; aggregate predicates → HAVING. Putting AVG > X in WHERE is a parse error.

Joins — connecting tables on a common key

Joins combine columns from two tables on a matching key. The four every fresher needs: INNER (only matched rows survive), LEFT (all rows from the left table, even unmatched), RIGHT (mirror of LEFT, rarely used), FULL (all rows from both sides). SELF JOIN joins a table to itself for hierarchies (manager / employee, parent / child).

INNER JOIN — strict match on both sides.
LEFT JOIN — keep every left row; NULL on the right when no match.
RIGHT JOIN — same as LEFT with sides swapped; usually rewrite as LEFT.
FULL JOIN — keep every row from both sides; useful for reconciliation.
SELF JOIN — alias the same table twice (employees a JOIN employees b ON a.manager_id = b.id).

Example input. An orders table and a customers table.

orders:

order_id	customer_id
101	C1
102	C2
103	C1

customers:

customer_id	customer_name
C1	Alice
C2	Bob

Question. Return one row per order showing order_id and the matching customer_name.

Code solution.

SELECT o.order_id, c.customer_name
FROM orders o
JOIN customers c
    ON o.customer_id = c.customer_id;

Explanation of code. The INNER JOIN (the default form when you just write JOIN) matches each order to its customer using customer_id. Order 101 → Alice, order 102 → Bob, order 103 → Alice. All three orders have a matching customer, so every order survives.

Step-by-step.

step	action	result
1	scan `orders` (left side)	3 rows: 101→C1, 102→C2, 103→C1
2	for each row, look up `customer_id` in `customers`	C1→Alice (twice), C2→Bob
3	`INNER JOIN` keeps only matched pairs	all 3 orders matched, 0 dropped
4	`SELECT o.order_id, c.customer_name`	project the 2 named columns

Output.

order_id	customer_name
101	Alice
102	Bob
103	Alice

Rule of thumb: always give every table a short alias (o, c) and prefix every column (o.order_id, c.customer_name) — the SQL becomes self-documenting.

Common beginner mistakes

Using SELECT * everywhere — production queries always name the columns.
Putting an aggregate in WHERE instead of HAVING — parse error in PostgreSQL.
Joining at the wrong grain (one-to-many without thinking) — the #1 source of "the number is suddenly 3× too high" bugs.
Memorising syntax without internalising which side keeps its rows in a LEFT JOIN — the part that breaks numbers.
Skipping window functions because they "look hard" — interviewers love them; they take a week to learn.

Worked Problem on Ranking Top Earners per Department with Window Functions

Example input. A 6-row employees table mixing departments and salaries.

name	department	salary
Alice	Engineering	90000
Bob	Engineering	80000
Carol	Sales	50000
Dan	Sales	55000
Eve	Marketing	65000
Frank	Marketing	60000

Question. Rank each employee by salary within their department (highest = rank 1) and return only the top earner per department. Use a window function — pure GROUP BY cannot keep both the rank and the row's other columns.

Solution Using `ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)`

Code solution.

SELECT department, name, salary
FROM (
    SELECT
        department,
        name,
        salary,
        ROW_NUMBER() OVER (
            PARTITION BY department
            ORDER BY salary DESC
        ) AS rank
    FROM employees
) t
WHERE rank = 1;

Explanation of code. ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) assigns a strict 1, 2, 3 sequence within each department, ordered by salary from highest to lowest. The outer WHERE rank = 1 keeps only the top-paid row per department. The wrapping subquery is needed because PostgreSQL evaluates window functions after WHERE, so we cannot filter rank = 1 in the same level where we compute it.

Output.

department	name	salary
Engineering	Alice	90000
Marketing	Eve	65000
Sales	Dan	55000

Step-by-step trace for the input rows above:

department	name	salary	rank
Engineering	Alice	90000	1
Engineering	Bob	80000	2
Marketing	Eve	65000	1
Marketing	Frank	60000	2
Sales	Dan	55000	1
Sales	Carol	50000	2

After WHERE rank = 1: three rows — one per department, the top earner.

Why this works — concept by concept:

PARTITION BY department — defines the group inside which the ranking happens; without it, the rank would be global across all employees.
ORDER BY salary DESC — descending so rank 1 is the highest-paid; ascending would give the lowest.
ROW_NUMBER not RANK — strict 1, 2, 3; ties produce one rank-1 row per partition, which is what "top earner" demands.
Outer WHERE rank = 1 filter — Postgres cannot filter window-function output in the same query level; the wrap is required.
one row per department guaranteed — ROW_NUMBER (not RANK or DENSE_RANK) ensures no ties, so the result has exactly one row per group.
Cost — O(N log N) from the partitioned sort; with an index on (department, salary DESC) this becomes O(N).

Inline CTA: drill the SQL practice page for short curated reps; the structured path for fresher SQL is SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Topic — window functions
SQL window-function problems

Practice →

SQL
Topic — joins
SQL join problems

Practice →

SQL
Topic — aggregation
SQL aggregation problems

Practice →

2. Step 2 — Learn Python for Data Engineering

Core Python, file handling, Pandas, and the API requests every DE writes

Python is the glue language for everything outside the database — ETL scripts, automation, data pipelines, API integrations, transformations. You don't need to be a Python wizard; you need to be fluent at reading CSVs, calling APIs, transforming data with Pandas, and writing small testable functions.

Three Python skill clusters every fresher needs:

Core Python — variables, loops, functions, lists / dicts / sets, classes, exception handling.
File handling — read and write CSV, JSON, and Excel files using the standard library and Pandas.
Libraries — Pandas for data transformation; Requests for API calls; PySpark later (Step 6) for big-data processing.

Pro tip: the 10% of Python you actually use day-to-day is csv, json, pathlib, collections, dataclasses, typing, and pandas. Skip metaclasses, descriptors, and async event loops on day one — they're irrelevant to fresher DE work.

Core Python — loops, lists, and small functions

The fresher Python invariant: write small testable functions over loops over lists / dicts. Type hints (def f(x: int) -> int:) make a 2-month-old script readable when you come back to it.

Variables and types — int, float, str, bool, None.
Lists, dicts, sets — ordered, key-value, unique-only.
Loops — for x in xs: over iterables.
Functions — single-responsibility; takes inputs, returns outputs.
Exception handling — try / except FileNotFoundError for fragile I/O.

Example input. A Python list of three integers.

data = [1, 2, 3]

Question. Multiply each number by 2 and print the result. Show the canonical for loop pattern that every other Python data-engineering script will mirror.

Code solution.

data = [1, 2, 3]

for num in data:
    print(num * 2)

Explanation of code. for num in data: walks the list one element at a time, binding the current value to num. Inside the loop body, num * 2 doubles the value and print(...) writes it to stdout. The pattern generalises directly to "for every row in this CSV, do something" — replace data with csv.DictReader(f) and you have an ETL skeleton.

Step-by-step.

iteration	`num`	`num * 2`	stdout
1	1	2	`2`
2	2	4	`4`
3	3	6	`6`
end	—	—	loop exits when list is exhausted

Output.

2
4
6

Rule of thumb: if your Python script grows past 100 lines and has zero functions, it's a notebook draft, not a script — refactor before sharing it.

File handling — reading CSV and JSON

Most data-engineering Python is reading a file, transforming the contents, and writing the result somewhere. The standard library has csv and json modules that cover 90% of fresher needs; for anything richer reach for Pandas.

open(path, encoding='utf-8') — open a text file safely.
csv.DictReader(f) — iterate CSV rows as dictionaries (column-name access).
json.load(f) — parse a JSON file into a Python dict / list.
pathlib.Path('file.csv') — modern path object; works on Windows, macOS, Linux.

Example input. A data.json file containing one JSON object.

{"name": "Alice", "salary": 70000}

Question. Open data.json, parse it into a Python dict, and print the parsed result.

Code solution.

import json

with open("data.json") as f:
    data = json.load(f)

print(data)

Explanation of code. with open("data.json") as f: opens the file safely (the with block guarantees the file is closed when the block exits, even on error). json.load(f) parses the file's contents into a Python object — a dict here because the JSON started with {. Printing the dict shows the parsed data.

Step-by-step.

step	action	result
1	`with open("data.json") as f`	file handle `f` opens in text mode
2	`json.load(f)` reads bytes	parses JSON object → Python `dict`
3	bind result to `data`	`data = {"name": "Alice", "salary": 70000}`
4	exit `with` block	file auto-closed (even on error)
5	`print(data)`	dict printed to stdout

Output.

{'name': 'Alice', 'salary': 70000}

Rule of thumb: always use with open(...) rather than the bare open() call — it auto-closes the file and handles exceptions cleanly.

Pandas for tabular data — `read_csv`, `groupby`, `sum`

Pandas is the Python library every DE uses for transforming tabular data. The three operations you'll do hundreds of times: read a CSV into a DataFrame, group by one or more columns, aggregate with sum / mean / count. Requests is the API-call counterpart.

pd.read_csv('file.csv') — read a CSV into a DataFrame.
df.groupby('col') — group rows by a column.
.sum() / .mean() / .count() — aggregate the groups.
requests.get(url).json() — fetch a URL and parse the JSON response.

Example input. A sales.csv file with 5 rows across two regions.

order_id	region	amount
1	North	100
2	North	150
3	South	80
4	South	120
5	North	70

Question. Read sales.csv into a Pandas DataFrame, group by region, and print the sum of amount per region.

Code solution.

import pandas as pd

df = pd.read_csv("sales.csv")
print(df.groupby("region").sum())

Explanation of code. pd.read_csv("sales.csv") loads the entire CSV into a DataFrame, with the first row treated as column headers. df.groupby("region") produces a grouped object that buckets rows by region. .sum() aggregates every numeric column within each bucket — here that's order_id (sum of IDs, usually meaningless) and amount (the metric we care about).

Step-by-step.

step	action	result
1	`pd.read_csv("sales.csv")`	DataFrame with 5 rows × 3 columns
2	`df.groupby("region")`	bucket rows: North = {1, 2, 5}; South = {3, 4}
3	`.sum()` per bucket	North: order_id sum = 8, amount = 320 (100+150+70); South: order_id sum = 7, amount = 200 (80+120)
4	`print(...)`	the two-row grouped frame prints to stdout

Output.

        order_id  amount
region
North          8     320
South          7     200

Rule of thumb: when the data fits in memory and you don't need a database, Pandas is faster than writing the SQL — but for anything past a few million rows, push the work back into SQL or PySpark.

Common beginner mistakes

Skipping type hints — code becomes unreadable in 2 months.
Reading huge CSVs into Pandas without chunksize — your laptop runs out of RAM.
Using requests without a timeout — a hung API call freezes your script forever (requests.get(url, timeout=10)).
Not handling None / missing values — int(None) crashes with a TypeError.
Writing 200-line scripts as one big block — break into def-defined functions.

Worked Problem on Building a CSV-to-Summary Python ETL Script

Example input. A sales.csv file with 5 rows.

order_id	region	amount
1	North	100
2	North	150
3	South	80
4	South	120
5	North	70

Question. Build a small Python ETL script that reads sales.csv, sums amount per region, writes the result to summary.csv, and prints the count of rows processed. This is the canonical Phase-1 portfolio script every fresher should ship to GitHub.

Solution Using Pandas + a writeable summary path

Code solution.

import pandas as pd
from pathlib import Path

def summarise_sales(input_path: Path, output_path: Path) -> int:
    df = pd.read_csv(input_path)
    summary = df.groupby("region", as_index=False)["amount"].sum()
    summary.to_csv(output_path, index=False)
    return len(df)

if __name__ == "__main__":
    rows = summarise_sales(Path("sales.csv"), Path("summary.csv"))
    print(f"processed {rows} rows")

Explanation of code. The function takes two Path objects so it's testable — you can call it from a test with mock paths instead of hardcoding filenames. pd.read_csv(input_path) loads the CSV, groupby("region", as_index=False)["amount"].sum() produces a clean two-column summary (as_index=False keeps region as a column rather than becoming the index), and to_csv(output_path, index=False) writes the summary back out without Pandas' default integer index column. The function returns the row count so the caller can log a clean status line.

Output.

summary.csv:

region	amount
North	320
South	200

stdout: processed 5 rows

Step-by-step trace for the input rows above:

step	action	result
1	`pd.read_csv("sales.csv")`	DataFrame with 5 rows × 3 columns
2	`df.groupby("region", as_index=False)["amount"].sum()`	2-row summary DataFrame
3	`summary.to_csv("summary.csv", index=False)`	file written to disk
4	`return len(df)`	returns `5`
5	`print(f"processed {rows} rows")`	stdout: `processed 5 rows`

Why this works — concept by concept:

Path objects for testable I/O — paths are inputs, not hardcoded constants, so the function works with any source / destination.
groupby(..., as_index=False) — keeps region as a regular column instead of the DataFrame index; the resulting CSV reads naturally.
["amount"].sum() — selects the metric column before aggregation; otherwise Pandas would also sum order_id, which is meaningless.
to_csv(..., index=False) — suppresses Pandas' default integer index column; the CSV has only the two columns you actually want.
return + print separation — the function returns a value (good for tests); the caller decides whether to print it (good for scripts vs imports).
Cost — O(N) where N is the input row count; fits in memory up to a few million rows.

Inline CTA: for fresher Python reps see Python practice page; the structured path is Python for Data Engineering Interviews — Complete Fundamentals.

PYTHON
Language — Python
Python practice problems

Practice →

COURSE
Course — Python for DE
Python for Data Engineering Interviews

View course →

SQL
Topic — CSV processing
SQL CSV-processing problems

Practice →

3. Steps 3-5 — Databases, Data Warehousing, and ETL/ELT

How data is stored, modeled, and moved through pipelines

Three closely-related steps in one section because they answer the same question: where does the data live, and how does it get there?

Step 3 — Databases. Relational (PostgreSQL, MySQL) for transactional workloads; NoSQL (MongoDB, Cassandra, Redis) for specialised cases. Learn keys, normalisation, transactions, indexing, ACID.
Step 4 — Data Warehousing. Snowflake, BigQuery, Redshift store analytics-ready data in fact tables + dimension tables, organised as a star schema (fact in the middle, dimensions hanging off). Heavily asked in interviews.
Step 5 — ETL / ELT. ETL = Extract → Transform → Load (transform before loading). ELT = Extract → Load → Transform (load raw, then transform inside the warehouse). Plus batch vs streaming pipelines, incremental loads, and CDC (change data capture).

Pro tip: the three databases worth installing for practice: PostgreSQL (covers 90% of relational SQL you'll see at work), SQLite (zero-setup local dev), and one NoSQL (MongoDB is the friendliest). Skip Redis until you genuinely need a cache; skip Cassandra until you genuinely have wide-column data.

Relational databases — tables, keys, normalisation, ACID

Relational databases store data in tables with primary keys (one column uniquely identifies each row) and foreign keys (a column in one table references the primary key of another). Normalisation splits data so each fact lives in exactly one place — no duplication, no inconsistency. ACID properties (Atomicity, Consistency, Isolation, Durability) guarantee that transactions either fully succeed or fully roll back.

Primary key — uniquely identifies a row (customer_id in customers).
Foreign key — points to another table's primary key (orders.customer_id → customers.customer_id).
Normalisation — 1NF / 2NF / 3NF — split tables until each fact lives once.
Indexing — speeds up lookups; trade-off is slower writes and extra storage.
ACID transactions — BEGIN; … COMMIT; (or ROLLBACK; on failure).

Example input. A two-table relational design — customers references orders via the foreign key customer_id.

customers:

customer_id	customer_name
C1	Alice
C2	Bob

orders:

order_id	customer_id	amount
101	C1	100
102	C2	200
103	C1	50

Question. Write the CREATE TABLE statements for customers and orders with proper primary keys and a foreign key from orders to customers. Then write a transactional INSERT that adds a new customer plus their first order atomically — both rows commit or neither does.

Code solution.

CREATE TABLE customers (
    customer_id   TEXT PRIMARY KEY,
    customer_name TEXT NOT NULL
);

CREATE TABLE orders (
    order_id    INT PRIMARY KEY,
    customer_id TEXT NOT NULL REFERENCES customers(customer_id),
    amount      NUMERIC(10,2) NOT NULL
);

BEGIN;
INSERT INTO customers (customer_id, customer_name) VALUES ('C3', 'Carol');
INSERT INTO orders (order_id, customer_id, amount) VALUES (104, 'C3', 75);
COMMIT;

Explanation of code. The customers table declares customer_id as PRIMARY KEY (uniqueness + index automatically created). The orders table's customer_id is REFERENCES customers(customer_id) — a foreign key that prevents you from inserting an order for a non-existent customer. The BEGIN; … COMMIT; block makes both inserts a single atomic transaction: if the second insert fails for any reason, the first is rolled back automatically — the database never ends up with a customer who has no order or an order pointing to a missing customer.

Step-by-step.

step	statement	result
1	`CREATE TABLE customers`	empty table, `customer_id` enforced unique
2	`CREATE TABLE orders`	empty table; FK rejects orphan `customer_id`
3	`BEGIN`	open a transaction — changes are invisible until commit
4	`INSERT INTO customers ('C3', 'Carol')`	row staged; FK in `orders` will accept C3 later
5	`INSERT INTO orders (104, 'C3', 75)`	row staged; FK satisfied because C3 exists in-tx
6	`COMMIT`	both rows persisted atomically; on error, both rolled back

Output.

After the transaction commits:

customer_id	customer_name
C1	Alice
C2	Bob
C3	Carol

order_id	customer_id	amount
101	C1	100
102	C2	200
103	C1	50
104	C3	75

Rule of thumb: every multi-row write that has to be "all or nothing" goes inside a BEGIN; … COMMIT; block — that's the entire point of a relational database.

Data warehousing — fact tables, dimension tables, star schema

A data warehouse stores analytics-ready data optimised for fast SELECT queries (not for high-volume INSERT / UPDATE). The canonical model is the star schema — one fact table in the middle that records events (sales, clicks, logins) surrounded by dimension tables that describe context (customers, products, dates). Heavily tested at interviews.

Fact table — measures events; mostly numeric columns + foreign keys to dimensions.
Dimension table — descriptive context; mostly text columns (customer name, product category).
Star schema — one fact in the centre, dimensions hanging off as star points.
Snowflake schema — dimensions further normalised into sub-dimensions.
Partitioning / clustering — physical layout choices that speed up filtered queries.

Example input. A star-schema design for an e-commerce sales fact with three dimensions.

fact_sales:

sale_id	date_id	customer_id	product_id	amount
S1	20260501	C1	P1	100
S2	20260501	C2	P2	200

dim_customer:

customer_id	customer_name
C1	Alice
C2	Bob

dim_product:

product_id	product_name
P1	Book
P2	Headphones

dim_date:

date_id	day	month	year
20260501	1	5	2026

Question. Write a query that joins the fact to all three dimensions and returns customer_name, product_name, month, and amount for every sale. This is the canonical "fact + dim rollup" report every BI dashboard runs.

Code solution.

SELECT
    c.customer_name,
    p.product_name,
    d.month,
    f.amount
FROM fact_sales f
JOIN dim_customer c ON c.customer_id = f.customer_id
JOIN dim_product  p ON p.product_id  = f.product_id
JOIN dim_date     d ON d.date_id     = f.date_id;

Explanation of code. The fact table sits in the middle and is joined once to each dimension on the matching surrogate key. Because each dimension has exactly one row per dimension key, the joins do not multiply rows — the output has the same number of rows as fact_sales. The SELECT then pulls the descriptive columns from the dimensions plus the amount from the fact.

Step-by-step.

step	action	result
1	scan `fact_sales`	2 rows (S1, S2)
2	join `dim_customer` on `customer_id`	S1 → Alice, S2 → Bob; row count unchanged (1:1)
3	join `dim_product` on `product_id`	S1 → Book, S2 → Headphones; row count unchanged
4	join `dim_date` on `date_id`	both rows pick up month=5; row count unchanged
5	`SELECT` 4 projected columns	final 2-row report

Output.

customer_name	product_name	month	amount
Alice	Book	5	100
Bob	Headphones	5	200

Rule of thumb: fact tables hold the measure; dimensions hold the context. If you can't tell whether a column belongs in the fact or the dim, ask "is this a number we'll aggregate, or text we'll group by?"

ETL vs ELT, batch vs streaming, and CDC in plain English

ETL = extract, transform, load — read source data, transform it in a separate engine (Spark, Python), then load the clean result into the warehouse. ELT = extract, load, transform — load the raw source straight into the warehouse, then transform with SQL. Modern cloud warehouses are powerful enough that ELT has become the default. Batch processes data on a schedule (every hour / day); streaming processes data as it arrives (sub-second). CDC (change data capture) tracks INSERT / UPDATE / DELETE events on a source so the warehouse stays in sync without re-loading the whole table.

ETL — transform outside the warehouse (older pattern; Spark, Python, custom).
ELT — transform inside the warehouse with SQL (newer; dbt, Snowflake, BigQuery).
Batch — scheduled jobs (hourly, daily); cheaper, simpler, slightly stale data.
Streaming — event-by-event processing (Kafka, Flink); fresher, more expensive.
CDC — incremental change tracking; loads only what changed since last run.

Example input. A daily-batch ETL skeleton in Python that loads yesterday's orders, transforms them, and writes a curated table.

source: raw orders dropped to S3 daily under s3://orders/2026-05-08/orders.csv
target: warehouse table fact_orders, partitioned by order_date

Question. Sketch the three-stage ETL pipeline shape — extract reads the CSV, transform cleans / dedupes / casts types, load writes to the warehouse. Use plain Python pseudocode; the goal is the shape not a runnable example.

Code solution.

import pandas as pd

def extract(date: str) -> pd.DataFrame:
    return pd.read_csv(f"s3://orders/{date}/orders.csv")

def transform(df: pd.DataFrame) -> pd.DataFrame:
    df = df.drop_duplicates(subset=["order_id"])
    df["order_date"] = pd.to_datetime(df["order_date"]).dt.date
    df["amount"]     = df["amount"].astype(float)
    return df

def load(df: pd.DataFrame, partition_date: str) -> int:
    # warehouse-specific COPY or INSERT INTO fact_orders WHERE order_date = partition_date
    return len(df)

if __name__ == "__main__":
    date = "2026-05-08"
    rows = load(transform(extract(date)), date)
    print(f"loaded {rows} rows for {date}")

Explanation of code. extract is the only function that knows where the source is; transform is pure (no I/O) and easy to unit-test; load is the only function that writes to the warehouse. Splitting the pipeline into three named functions makes the script readable, testable, and easy to swap (you can replace extract with a Postgres reader without touching transform). The dedupe + type-cast inside transform is the canonical "raw → curated" cleaning step.

Step-by-step.

step	function	what it does	result
1	`extract("2026-05-08")`	read S3 path for that day	raw DataFrame from CSV
2	`transform(df)` step a	`drop_duplicates(subset=["order_id"])`	duplicate orders removed
3	`transform(df)` step b	`pd.to_datetime(...).dt.date`	`order_date` cast to date type
4	`transform(df)` step c	`astype(float)`	`amount` cast to float
5	`load(df, date)`	warehouse `COPY` / `INSERT`	row count returned
6	`print(...)`	stdout summary	`loaded 5 rows for 2026-05-08`

Output.

loaded 5 rows for 2026-05-08

Rule of thumb: always separate extract, transform, load into three named functions — even when the pipeline is small. The shape is what reviewers look for.

Common beginner mistakes

Treating data warehouses like OLTP databases — running thousands of UPDATEs per minute (warehouses optimise for SELECT, not UPDATE).
Modelling everything in one wide table — kills performance and makes joins impossible later.
Confusing batch and streaming — batch is the default; pick streaming only when you genuinely need sub-second freshness.
Forgetting CDC — re-loading the whole customers table every night when only 100 rows changed wastes hours.
Skipping the staging step — going source → curated directly means you can't reproduce yesterday's run.

Worked Problem on Building an Idempotent Daily ETL with Quality Checks

Example input. A daily CSV orders_2026-05-08.csv that lands in S3. The warehouse has a fact_orders table partitioned by order_date. The pipeline must be idempotent — running it twice with the same input produces the same output.

order_id	order_date	amount
1	2026-05-08	100
2	2026-05-08	200
3	2026-05-08	50

Question. Write a Python script that loads the daily CSV, replaces today's partition (so a rerun does not double-count), and runs three data-quality checks (row count > 0, no NULL order_ids, no duplicate order_ids). Fail loudly with a non-zero exit code if any check fails.

Solution Using `TRUNCATE` of today's partition + three quality checks

Code solution.

import sys
import pandas as pd
import psycopg2

LOAD_DATE = "2026-05-08"

def run(conn, csv_path: str) -> int:
    df = pd.read_csv(csv_path)
    with conn.cursor() as cur:
        cur.execute("DELETE FROM fact_orders WHERE order_date = %s;", (LOAD_DATE,))
        for _, row in df.iterrows():
            cur.execute(
                "INSERT INTO fact_orders (order_id, order_date, amount) VALUES (%s, %s, %s);",
                (int(row["order_id"]), row["order_date"], float(row["amount"])),
            )
        cur.execute("SELECT COUNT(*) FROM fact_orders WHERE order_date = %s;", (LOAD_DATE,))
        if cur.fetchone()[0] == 0:
            return 1
        cur.execute("SELECT COUNT(*) FROM fact_orders WHERE order_id IS NULL;")
        if cur.fetchone()[0] > 0:
            return 1
        cur.execute("""
            SELECT COUNT(*) FROM (
              SELECT order_id, COUNT(*) c FROM fact_orders GROUP BY 1 HAVING COUNT(*) > 1
            ) d;
        """)
        if cur.fetchone()[0] > 0:
            return 1
    conn.commit()
    return 0

if __name__ == "__main__":
    conn = psycopg2.connect(dbname="warehouse")
    sys.exit(run(conn, f"orders_{LOAD_DATE}.csv"))

Explanation of code. DELETE FROM fact_orders WHERE order_date = LOAD_DATE wipes today's partition before re-inserting — that's what makes the pipeline idempotent (a rerun overwrites today's slice instead of appending). The INSERT loop loads every CSV row with explicit type casts so dates land as dates and amounts land as numbers. Three quality checks then verify the load worked — non-zero row count, no null primary keys, no duplicates. Any failure returns exit code 1 so the orchestrator (Airflow, cron) notices automatically and the developer is paged.

Output.

After a healthy run:

order_id	order_date	amount
1	2026-05-08	100
2	2026-05-08	200
3	2026-05-08	50

Exit code: 0. A second run of the same script produces an identical fact_orders (idempotent).

Step-by-step trace for a clean 3-row CSV:

step	action	result
1	`DELETE WHERE order_date = '2026-05-08'`	today's partition wiped
2	`INSERT` 3 CSV rows	3 rows in today's partition
3	row-count check	3 > 0 → pass
4	null-PK check	0 nulls → pass
5	duplicate-PK check	0 dupes → pass
6	`commit`	exit `0`

Why this works — concept by concept:

DELETE of today's partition before insert — makes the pipeline idempotent; rerun overwrites instead of appending.
explicit type casts in the INSERT — int(), float(), ISO date strings make the warehouse see clean types.
three quality checks inside the same job — checks live next to the load, not in a "we'll add monitoring later" backlog.
non-zero exit code on failure — Airflow / cron / GitHub Actions detect the failure automatically.
conn.commit() only on success — bad runs roll back; the warehouse is never left half-loaded.
Cost — O(rows in today's CSV); the historical fact_orders is only scanned for the duplicate check.

Inline CTA: for the structured ETL learning path see ETL System Design for Data Engineering Interviews and the Data Modeling course.

ETL
Topic — ETL
ETL practice problems

Practice →

COURSE
Course — Data Modeling
Data Modeling for Data Engineering Interviews

View course →

COURSE
Course — ETL System Design
ETL System Design for DE Interviews

View course →

4. Steps 6-9 — Apache Spark, Airflow, Cloud, and Data Modeling

From single-machine SQL and Pandas to production-scale pipelines

After SQL, Python, databases, and ETL fundamentals are solid, four scaling skills turn you from a script-writer into a production data engineer:

Step 6 — Apache Spark. The industry standard for large-scale processing; PySpark is its Python API. Learn DataFrames, transformations, actions, Spark SQL.
Step 7 — Workflow orchestration. Apache Airflow runs your pipelines on a schedule. Learn DAGs (directed acyclic graphs), tasks, operators, dependencies.
Step 8 — Cloud platforms. Modern data engineering lives on AWS, Azure, or GCP. Pick AWS first — it's the most asked. Learn S3, EC2, Lambda, Glue, Redshift, IAM.
Step 9 — Data modeling. OLTP vs OLAP, normalisation vs denormalisation, slowly changing dimensions (SCDs), fact-vs-dim design. Read the Kimball Data Warehouse Toolkit.

Pro tip: these are scale-and-production skills. Don't open them until SQL and Python are second nature. The most common fresher failure mode is "I learned Spark but I can't write a LEFT JOIN correctly under pressure." Master the foundations first.

Apache Spark + PySpark — the big-data engine

Apache Spark processes data that doesn't fit on a single machine by splitting work across a cluster. PySpark is its Python API — almost everything you do in Pandas has a PySpark equivalent, just distributed. The simplest entry point is SparkSession.builder.getOrCreate() followed by spark.read.csv(...) to load data.

SparkSession — the entry point; creates the cluster connection.
DataFrame — the main abstraction; like a Pandas DataFrame but distributed.
Transformations — select, filter, groupBy — lazy, build a plan.
Actions — show, count, write — trigger actual execution.
Spark SQL — register a DataFrame as a table and run SQL against it.

Example input. A sales.csv file similar to the Pandas example, but big enough that we want Spark to process it on a cluster.

order_id	region	amount
1	North	100
2	South	200
3	North	150

Question. Write a minimal PySpark script that reads sales.csv and shows the first few rows. Show the canonical SparkSession setup that every PySpark script begins with.

Code solution.

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("demo").getOrCreate()

df = spark.read.csv("sales.csv", header=True, inferSchema=True)
df.show()

Explanation of code. SparkSession.builder.appName("demo").getOrCreate() either creates a new Spark session or attaches to an existing one — either way, you end up with a spark object that knows how to talk to the cluster. spark.read.csv("sales.csv", header=True, inferSchema=True) loads the file as a DataFrame, treating the first row as headers and inferring column types. df.show() is an action that triggers execution and prints the first 20 rows to stdout.

Step-by-step.

step	call	kind	result
1	`SparkSession.builder.appName(...).getOrCreate()`	setup	`spark` session attached to a (local or cluster) executor
2	`spark.read.csv(..., header=True, inferSchema=True)`	transformation (lazy)	DataFrame plan registered — no rows scanned yet
3	`df.show()`	action	plan executes: read CSV → infer types → render first 20 rows
4	stdout	—	grid-formatted table printed

Output.

+--------+------+------+
|order_id|region|amount|
+--------+------+------+
|       1| North|   100|
|       2| South|   200|
|       3| North|   150|
+--------+------+------+

Rule of thumb: in PySpark, transformations (filter, select, groupBy) are lazy — nothing runs until you call an action like .show(), .count(), or .write(). That's why the same PySpark code can be reused for 1 GB and 1 TB datasets.

Apache Airflow — DAGs, tasks, scheduling

Airflow is the workflow orchestrator most data teams use. You write a DAG (directed acyclic graph) of tasks; Airflow runs them on a schedule, respects dependencies, retries failures, and surfaces alerts. The minimum viable DAG is two tasks chained with >>.

DAG — the workflow; a Python file in dags/ directory.
Task — a single unit of work (run a SQL query, call an API, run a PySpark job).
Operator — a reusable task type (BashOperator, PythonOperator, SQLExecuteQueryOperator).
Dependencies — task1 >> task2 means "run task1 then task2."
Schedule — schedule_interval='@daily', '0 3 * * *' (cron), or None for manual.

Example input. A simple two-stage daily ETL — extract data from an API, load it into a warehouse table.

task1: extract — call the API, write raw JSON to S3
task2: load — read S3 JSON, INSERT into fact_events
schedule: daily at 03:00 UTC

Question. Write a minimal Airflow DAG that defines two PythonOperator tasks extract_task and load_task, and chains them so load_task only runs after extract_task succeeds.

Code solution.

from datetime import datetime
from airflow import DAG
from airflow.operators.python import PythonOperator

def extract():
    pass  # call the API, write raw JSON to S3

def load():
    pass  # read S3 JSON, INSERT into fact_events

with DAG(
    dag_id="etl_pipeline",
    start_date=datetime(2026, 5, 1),
    schedule_interval="@daily",
    catchup=False,
) as dag:
    extract_task = PythonOperator(task_id="extract", python_callable=extract)
    load_task    = PythonOperator(task_id="load",    python_callable=load)
    extract_task >> load_task

Explanation of code. The with DAG(...) as dag: block defines the DAG metadata — its name, when it starts, how often it runs (@daily is shorthand for "every day at midnight"), and whether to backfill missed runs (catchup=False means "no, just run from now on"). Two PythonOperator tasks wrap the actual Python functions. extract_task >> load_task declares the dependency — Airflow will only run load_task if extract_task succeeds.

Step-by-step.

step	when	action	result
1	parse time	`DAG(...)` instantiates	DAG `etl_pipeline` registered in Airflow metadata
2	parse time	`PythonOperator(...)` ×2	two tasks attached to the DAG
3	parse time	`extract_task >> load_task`	dependency edge added (extract → load)
4	every day at midnight	scheduler triggers a DAG run	`extract` task starts
5	extract succeeds	scheduler sees green upstream	`load` task starts
6	load succeeds	DAG run marked success	green tick in calendar grid
6′	extract fails	downstream skipped	red tick; alert fires

Output.

In the Airflow UI, this DAG appears as two boxes connected by an arrow:

[extract] → [load]

Each daily run produces a tick in the calendar grid; failures are red, successes are green.

Rule of thumb: one DAG = one logical workflow. If you find yourself writing 50 tasks in a single DAG, you probably want 5 DAGs of 10 tasks each — easier to debug, easier to retry.

Cloud platforms — AWS first, then expand

Modern data engineering is cloud-based. Pick one platform first and learn its data services before branching out — most teams use AWS, so it's the highest-leverage starting point. Azure and GCP are equally valid second choices once you have one cloud under your belt.

S3 — object storage; where raw data lands.
EC2 — virtual machines; rarely touched directly anymore.
Lambda — serverless functions; great for small ETL triggers.
Glue — managed ETL service; runs Spark jobs without you managing the cluster.
Redshift — AWS data warehouse; SQL-compatible.
IAM — identity and access; non-optional — every cloud bug eventually traces back to permissions.

Example input. You have a daily CSV that lands in S3 at s3://my-bucket/orders/{date}/orders.csv and a Redshift table fact_orders to load it into.

Question. Write the AWS CLI / SQL pseudocode that copies the CSV from S3 into Redshift on a schedule. (Don't worry about IAM details; the goal is the shape.)

Code solution.

-- Inside Redshift, run on a schedule from Airflow / cron
COPY fact_orders
FROM 's3://my-bucket/orders/2026-05-08/orders.csv'
IAM_ROLE 'arn:aws:iam::ACCOUNT:role/RedshiftS3ReadRole'
CSV
IGNOREHEADER 1;

Explanation of code. COPY ... FROM 's3://...' is the Redshift-specific bulk-load command — it pulls a file directly from S3 into a table without needing an intermediate machine. IAM_ROLE references an AWS IAM role that grants Redshift permission to read that S3 bucket — without this, the copy fails with a permission error. CSV tells Redshift the file format; IGNOREHEADER 1 skips the column-header row.

Step-by-step.

step	actor	action	result
1	Airflow / cron	submits `COPY` to Redshift	command queued
2	Redshift leader	assumes `RedshiftS3ReadRole`	temporary AWS credentials obtained
3	Redshift compute nodes	parallel-fetch the S3 object	bytes streamed direct to slices
4	parser	apply `CSV, IGNOREHEADER 1`	header row skipped; data rows parsed
5	loader	bulk-insert into `fact_orders`	rows committed
6	system catalogue	log to `STL_LOAD_COMMITS`	row count + reject count recorded

Output.

The CSV's rows land in fact_orders. A small status row is logged in STL_LOAD_COMMITS showing how many rows were copied and whether any were rejected.

Rule of thumb: for AWS, S3 + IAM are the two services you actually need to be fluent in. Everything else (Lambda, Glue, Redshift) layers on top.

Common beginner mistakes

Learning Spark before SQL is solid — Spark is just bigger SQL with more failure modes.
Writing 1,000-line Airflow DAGs — split into smaller DAGs that each do one thing.
Storing AWS credentials in code — always use IAM roles or environment variables, never hardcode.
Ignoring data modeling because it "feels theoretical" — interviewers test it heavily.
Trying to learn all three clouds at once — pick AWS first; the others are easier once you know one.

Worked Problem on Designing a Slowly Changing Dimension (Type 2) for Customer Addresses

Example input. Customer C1 lives at "12 Old St" until 2026-03-15, then moves to "88 New Ave". The fact tables need to know which address was current at the time of each historical sale.

customer_id	address	valid_from	valid_to	is_current
C1	12 Old St	2025-01-01	2026-03-14	FALSE
C1	88 New Ave	2026-03-15	(NULL)	TRUE

Question. Write the SQL to handle the address change as an SCD Type 2 update — close the old row by setting valid_to and is_current = FALSE, then insert a new row with the new address and is_current = TRUE. This pattern preserves historical correctness without losing the past.

Solution Using `UPDATE` to close the old row + `INSERT` for the new one

Code solution.

-- Step 1: close the existing current row
UPDATE dim_customer
SET valid_to = DATE '2026-03-14',
    is_current = FALSE
WHERE customer_id = 'C1' AND is_current = TRUE;

-- Step 2: insert the new current row
INSERT INTO dim_customer (customer_id, address, valid_from, valid_to, is_current)
VALUES ('C1', '88 New Ave', DATE '2026-03-15', NULL, TRUE);

Explanation of code. SCD Type 2 keeps full history by adding new rows rather than overwriting old ones. The UPDATE finds the row where is_current = TRUE for the customer and closes it — sets valid_to to the day before the change and is_current to FALSE. The INSERT then adds the new row with valid_from set to the change date, valid_to left NULL (still current), and is_current = TRUE. Historical fact tables can join to this dim with a date predicate to find the address that was current at the time of each sale.

Output.

After the two statements:

customer_id	address	valid_from	valid_to	is_current
C1	12 Old St	2025-01-01	2026-03-14	FALSE
C1	88 New Ave	2026-03-15	(NULL)	TRUE

A query like WHERE is_current = TRUE returns only the current address. A historical join uses WHERE sale_date BETWEEN valid_from AND COALESCE(valid_to, DATE '9999-12-31') to pick the right address per sale.

Step-by-step trace for the change on 2026-03-15:

step	action	result
1	`UPDATE` closes row 1	`valid_to = 2026-03-14`, `is_current = FALSE`
2	`INSERT` adds row 2	new row with `valid_from = 2026-03-15`, `is_current = TRUE`
3	dimension now has 2 rows for C1	one historical, one current

Why this works — concept by concept:

SCD Type 2 keeps history — old rows are not overwritten; both versions of the customer's address coexist with date ranges.
valid_from / valid_to define the row's lifetime — the date range during which this row was the truth.
is_current = TRUE flag — shortcut for dashboards that always want the latest; saves an ORDER BY ... LIMIT 1 lookup.
historical joins use BETWEEN — pick the dim row whose date range contains the fact row's date.
COALESCE(valid_to, '9999-12-31') — handles the open-ended current row whose valid_to is NULL.
Cost — two row-level operations; constant time per dimension change.

Inline CTA: for the deeper modeling syllabus see Data Modeling for Data Engineering Interviews; when you do start Spark, the gentle entry point is PySpark Fundamentals.

COURSE
Course — PySpark
PySpark Fundamentals

View course →

COURSE
Course — Spark internals
Apache Spark Internals

View course →

SQL
Topic — slowly changing data
SCD practice problems

Practice →

5. Steps 10-13 — Streaming, Portfolio Projects, Git, and Interview Prep

From skills to a job offer — proving the work and clearing the loop

The last four steps turn your skills into a job offer. Streaming systems handle real-time data, portfolio projects prove you can ship, Git makes your code visible, and interview prep closes the deal.

Step 10 — Streaming systems. Kafka, event-driven architectures, message queues, real-time processing. Required for advanced roles; optional for first jobs.
Step 11 — Build five portfolio projects. SQL analytics, Python ETL, Airflow pipeline, PySpark large-data, cloud deployment. Put all on GitHub.
Step 12 — Master Git. clone, add, commit, push, branch, merge — every company uses Git from day one.
Step 13 — Interview prep. SQL questions (joins, windows, aggregations, ranking), Python questions (dicts, strings, lists, hashmaps), system-design basics (ETL architecture, lake vs warehouse, batch vs streaming, scalability).

Pro tip: projects beat certificates. A GitHub repo with a clean README and a runnable pipeline outperforms a stack of certifications. Your top-of-funnel signal to recruiters is "here's the URL to my orders-batch-etl project" — not your transcript.

Streaming systems — Kafka in plain English

Kafka is a distributed message queue that lets producers publish events to a "topic" and consumers read them in order. Event-driven architectures use Kafka as the spine — payment events flow in, multiple downstream consumers (fraud detection, analytics, notifications) read the same stream independently. Required for advanced / senior DE roles; optional for fresher first jobs.

Producer — writes events to a Kafka topic.
Topic — a named append-only log; events stay in order.
Consumer — reads events from a topic; multiple consumers per topic are fine.
Partition — topics are split into partitions for parallelism.
Use case — live payment events flowing into a fraud-detection model.

Example input. A payment event payload that a producer wants to publish to the payments topic.

event = {
    "payment_id": "PAY-1001",
    "amount": 250.00,
    "currency": "USD",
    "user_id": "U42",
    "ts": "2026-05-08T10:15:00Z",
}

Question. Sketch the producer-side Python code that publishes this event to a Kafka topic called payments. Use kafka-python (the most popular client). Include just the producer setup + send call.

Code solution.

import json
from kafka import KafkaProducer

producer = KafkaProducer(
    bootstrap_servers=["localhost:9092"],
    value_serializer=lambda v: json.dumps(v).encode("utf-8"),
)

event = {
    "payment_id": "PAY-1001",
    "amount": 250.00,
    "currency": "USD",
    "user_id": "U42",
    "ts": "2026-05-08T10:15:00Z",
}

producer.send("payments", event)
producer.flush()

Explanation of code. KafkaProducer(bootstrap_servers=[...]) connects to one or more Kafka brokers. The value_serializer lambda turns the Python dict into JSON bytes (Kafka stores raw bytes, not Python objects). producer.send("payments", event) queues the event for delivery to the payments topic; producer.flush() blocks until the queued messages are actually sent. Downstream consumers (fraud detection, analytics) can read this event independently and in order.

Step-by-step.

step	action	result
1	`KafkaProducer(bootstrap_servers=[...])`	TCP connection to broker established; metadata fetched
2	`value_serializer = json.dumps(...).encode("utf-8")`	every send will convert dict → JSON bytes
3	`producer.send("payments", event)`	record buffered in the producer's in-memory queue
4	broker picks a partition (hashed key or round-robin)	record assigned to a partition log
5	`producer.flush()`	blocks until all buffered records are acknowledged
6	subscribed consumers	next `poll()` returns the new event

Output.

The event is appended to the payments topic. Any consumer subscribed to payments will receive it on its next poll() call:

{"payment_id": "PAY-1001", "amount": 250.0, "currency": "USD", "user_id": "U42", "ts": "2026-05-08T10:15:00Z"}

Rule of thumb: fresher first jobs rarely need Kafka. Master batch (Step 5) before opening Step 10. Mention Kafka in interviews only if you've actually shipped a project that uses it.

Five portfolio projects — what to build, in order

Projects matter more than certificates. Build all five and put them on GitHub with a clean README.md for each. The five build on each other — by the end you have a production-grade portfolio.

Project 1 — SQL analytics. E-commerce sales dashboard built entirely in SQL.
Project 2 — Python ETL. Extract API data → clean → store in PostgreSQL.
Project 3 — Airflow pipeline. Schedule the Python ETL as a daily DAG.
Project 4 — PySpark large-data pipeline. Process millions of rows.
Project 5 — Cloud project. Deploy the ETL pipeline on AWS.

Example input. Project 1 — an e-commerce dataset (orders, customers, products) for which you'll write the SQL behind a sales dashboard.

orders (sample):

order_id	customer_id	product_id	order_date	amount
1	C1	P1	2026-04-01	100
2	C2	P2	2026-04-15	200
3	C1	P1	2026-05-01	150

Question. For Project 1, write the canonical "monthly revenue per product" SQL. This is the single query the entire dashboard hangs off — get this right and the rest of the dashboard is just WHERE and ORDER BY variations.

Code solution.

SELECT
    p.product_name,
    DATE_TRUNC('month', o.order_date) AS month,
    SUM(o.amount) AS revenue
FROM orders o
JOIN products p ON p.product_id = o.product_id
WHERE o.order_date >= DATE '2026-01-01'
GROUP BY p.product_name, DATE_TRUNC('month', o.order_date)
ORDER BY month, p.product_name;

Explanation of code. DATE_TRUNC('month', o.order_date) collapses every order date to the first day of its month, so all April orders aggregate together. The JOIN brings in product_name from products so the dashboard can label rows. GROUP BY collapses to one row per (product, month). ORDER BY produces a chronologically-readable result. Wrap this in a saved view or a dbt model and the dashboard renders automatically.

Step-by-step.

step	clause	result
1	`FROM orders o`	scan all order rows
2	`JOIN products p ON p.product_id = o.product_id`	each order picks up its `product_name`
3	`WHERE o.order_date >= DATE '2026-01-01'`	drop pre-2026 rows
4	`DATE_TRUNC('month', o.order_date)`	every date snapped to month-start (e.g. 2026-04-15 → 2026-04-01)
5	`GROUP BY product_name, month`	bucket by (product, month)
6	`SUM(o.amount)` per bucket	revenue total per group
7	`ORDER BY month, product_name`	chronological, then alphabetical

Output.

product_name	month	revenue
Book	2026-04-01	100
Headphones	2026-04-15 (truncated to 2026-04-01)	200
Book	2026-05-01	150

(Real output would have one row per (product, month) combination; the sample is too small for a strong rollup.)

Rule of thumb: every Project 1 SQL should be runnable on a free PostgreSQL sandbox with a 100-row sample dataset. Put both the SQL and the sample data in your GitHub repo so a recruiter can clone and run it in 60 seconds.

Git, GitHub, and the resume bullet

Git is non-optional infrastructure. Every team's workflow assumes you can clone a repo, branch off, commit, and push. The bare minimum command set fits on a single screen.

git clone <url> — copy a remote repo locally.
git checkout -b feature/x — create + switch to a new branch.
git add file — stage a change.
git commit -m "..." — record the staged changes.
git push origin <branch> — push the branch to GitHub; open a pull request.
git merge / git rebase — combine branches.

Example input. You've finished Project 1 (the SQL analytics dashboard) on your laptop and want to push it to GitHub under your account.

Question. Show the canonical six-command workflow: clone an empty template repo, branch off, add the files you've written, commit with a descriptive message, push the branch, and open a pull request.

Code solution.

git clone https://github.com/<you>/sql-sales-dashboard.git
cd sql-sales-dashboard
git checkout -b feature/initial-dashboard
# (write README.md, schema.sql, queries.sql, sample-data/)
git add README.md schema.sql queries.sql sample-data/
git commit -m "Add Project 1: SQL sales dashboard with sample data"
git push origin feature/initial-dashboard
# open a pull request on github.com

Explanation of code. clone brings the empty repo to your laptop. checkout -b creates a feature branch (never push to main directly on a team repo, even your own — make the habit). After writing the project files, add stages them, commit records the change with a one-line message that future-you can scan, and push sends the branch to GitHub. The pull request is the artifact a recruiter or interviewer will actually look at.

Step-by-step.

step	command	result
1	`git clone <url>`	empty repo copied to laptop
2	`cd sql-sales-dashboard`	move into the working tree
3	`git checkout -b feature/initial-dashboard`	new branch created and checked out
4	write `README.md`, `schema.sql`, `queries.sql`, `sample-data/`	working tree now has 4 untracked items
5	`git add ...`	files staged for commit
6	`git commit -m "..."`	snapshot recorded with descriptive message
7	`git push origin feature/initial-dashboard`	branch published to GitHub
8	open PR on github.com	reviewable artifact link a recruiter can click

Output.

A GitHub repo URL with a feature branch and a pull request — both visible to anyone you share the link with. The README renders directly on the repo home page, becoming your portfolio artifact.

Rule of thumb: if you can't clone, branch, commit, and push within 60 seconds without looking commands up, Git is still on your to-do list. Practice it daily until it's muscle memory.

Common beginner mistakes

Trying to learn Kafka before mastering batch ETL — Kafka adds complexity without removing any.
Building one giant project instead of five small ones — recruiters skim; five clear repos beat one tangled one.
Pushing to main directly — every commit becomes part of history with no review trail.
No README.md per project — repos without READMEs are invisible.
Skipping interview prep — solid skills + zero practice = solid skills wasted at the screen.

Worked Problem on Picking Project 2 and Writing the Resume Bullet

Example input. You've shipped Project 1 (SQL dashboard). Project 2 is the Python ETL — extract from an API, clean, store in PostgreSQL. The repo will be python-api-etl. The recruiter call is in two weeks.

Question. Sketch the four-file layout for the Project 2 repo plus the one-line resume bullet you'll lead with on the recruiter call. The goal: a stranger should be able to read the repo, run it locally, and understand the work in 5 minutes.

Solution Using a four-file repo layout + a metric-led resume bullet

Code solution.

python-api-etl/
├── README.md           # 60-second pitch + how to run
├── etl.py              # extract / transform / load functions
├── tests/
│   └── test_etl.py     # one test per function
└── requirements.txt    # pinned dependencies

Resume bullet (lead with the metric):

Built a Python ETL pipeline that ingests 10K daily API records into a PostgreSQL warehouse with row-level validation and CI-friendly exit codes. (github.com/<you>/python-api-etl)

Explanation of code. Four files is the floor — one for documentation, one for code, one for tests, one for dependencies. The README is what a recruiter sees first; lead with the what and how to run, then explain the why. The resume bullet leads with a quantitative metric (10K daily records) and ends with the GitHub URL — recruiters scan for both, in that order.

Output.

Your GitHub now has a runnable, documented ETL repo. The recruiter receives the link, clicks through, sees the README, and forwards your resume to the hiring manager. The bullet on the resume becomes the first sentence of the recruiter's pitch to the hiring manager.

Step-by-step trace of how a recruiter reads it:

step	recruiter action	what they see
1	clicks the GitHub link in the resume	repo home page with the README rendered
2	scans the first paragraph of README	"Daily API → PostgreSQL with validation"
3	scrolls to "How to run"	three commands (`git clone`, `pip install`, `python etl.py`)
4	clicks `etl.py`	sees three named functions; reads in 30 seconds
5	clicks `tests/`	tests exist; quality signal confirmed

Why this works — concept by concept:

one repo per project — recruiters skim; five clean repos beat one tangled monorepo.
README-first design — the home page is the pitch; lead with what + how to run.
tests in tests/ — even one test per function is a quality signal.
pinned requirements.txt — anyone can clone and run; no "works on my machine" surprises.
metric-led resume bullet — 10K daily records is concrete; "ETL pipeline" alone is generic.
Cost — about a weekend of focused work for the project; 30 minutes for the bullet.

Inline CTA: for fresher interview reps see SQL practice page, Python practice page, and the canonical course path SQL for Data Engineering Interviews — From Zero to FAANG.

SQL
Hub — all practice
Browse all data-engineering practice

Practice →

COURSE
Course — SQL for DE
SQL for Data Engineering Interviews

View course →

COURSE
Hub — all courses
Browse all DE courses

View courses →

Tips to master the data engineering roadmap (best learning order + timeline)

Follow the order — and the calendar

The 13 steps above have a best learning order that works for most freshers — skip ahead at your own risk. The order plus a realistic timeline:

Order: SQL → Python → Databases → Pandas → ETL concepts → Data Warehousing → PySpark → Airflow → Cloud (AWS) → Kafka → Projects → Interview prep.
2-3 months — SQL + Python basics solid.
4-6 months — intermediate DE (warehousing, ETL, modeling).
6-9 months — job-ready (Airflow, cloud, projects shipped).
9-12 months — strong fresher profile (Spark, streaming basics, polished portfolio).

Most freshers fail for the same four reasons — avoid them

The failure modes are predictable. Watch for these in your own routine:

Jumping to Spark too early. Spark is just bigger SQL with more failure modes; without solid SQL it's noise.
Ignoring SQL depth. Beyond SELECT and JOIN, the bar at the screen is window functions + grain reasoning. Drill them.
Avoiding projects. Tutorials and certifications are signals; shipped code on GitHub is proof.
Watching tutorials without practice. Watch the video → close it → rebuild the example without it. If you can't, you didn't learn it.

The winning formula

Every successful fresher career follows the same five-step loop: learn → practice → build → publish → interview. Pick a topic, drill it on a coding-environment, build a small artifact, push to GitHub, then interview for jobs that touch that topic. Repeat for each step in the roadmap.

Books worth buying

Designing Data-Intensive Applications (Martin Kleppmann) — the modern systems book; read once a quarter.
The Kimball Data Warehouse Toolkit (Ralph Kimball) — the canonical modeling reference.

Where to practice on PipeCode

Start with the SQL practice page and the Python practice page; the structured paths are SQL for Data Engineering Interviews — From Zero to FAANG and Python for Data Engineering Interviews — Complete Fundamentals. After SQL and Python land, drill ETL practice, window functions, joins, and the deeper Data Modeling course and ETL System Design course. Pivot to peer guides — the Airbnb DE interview guide, the top DE interview questions 2026, and the SQL data types Postgres guide.

Frequently Asked Questions

How long does it really take to become a data engineer in 2026?

If consistent, 6-9 months at 10-15 hours per week is enough to be job-ready for junior / fresher data-engineering roles; 9-12 months produces a strong fresher profile with Spark, streaming basics, and a polished portfolio. The 2-3 months mark is where SQL and Python basics click; 4-6 months gets you through warehousing, ETL, and modeling. The single biggest predictor of speed is consistency — 10 hours a week for 6 months beats 40 hours a week for 6 weeks.

Do I need to learn all 13 steps before applying for jobs?

No — start applying as soon as Steps 1-5 are solid (SQL, Python, databases, warehousing, ETL/ELT). Roles you can target with the first five steps done: junior data engineer, junior analytics engineer, data engineer intern, ETL developer trainee. Steps 6-9 (Spark, Airflow, Cloud, Modeling) turn "hireable" into "competitive." Steps 10-13 (Streaming, Projects, Git, Interview prep) close the deal. Apply earlier than you think you should — interviewing is itself a skill that needs reps.

Should I master one cloud or learn all three (AWS, Azure, GCP)?

Pick one first and master its core data services before touching the others. AWS is the most asked at fresher interviews and the most widely deployed in industry — start there. The core AWS services for fresher DE work: S3 (object storage), IAM (access control), Lambda (serverless functions), Glue (managed ETL), Redshift (warehouse). Once you have one cloud under your belt, the other two are easy because the concepts (object storage, IAM, serverless, managed ETL, warehouse) are the same — only the names change.

Is Apache Spark required for fresher data-engineering jobs?

For most fresher first jobs, no — but knowing what Spark is and when it appears is required. The honest fresher posture: "I've shipped batch ETL with Python and SQL; I know Spark is the next step when data outgrows a single machine; I've done the PySpark Fundamentals tutorial and would learn the rest on the job." That's enough for 80% of fresher screens. Roles at Spark-heavy shops (Databricks customers, ad-tech, large e-commerce) will test deeper — for those, ship a PySpark project as part of your Step 11 portfolio.

What does a data engineer actually do day-to-day?

Day-to-day, a data engineer writes SQL queries, builds and maintains batch pipelines, models new tables, fixes data quality issues, and reviews other engineers' pipelines. A typical week: Monday — investigate a Slack message about a wrong dashboard number (usually a grain or null-handling bug); Tuesday-Wednesday — model a new dimension table for a product launch; Thursday — code review on a teammate's Airflow DAG; Friday — add a quality check that would have caught Monday's bug. Spark, Kafka, and lakehouse architecture appear at scale-heavy companies; the day-to-day at most companies is SQL + modeling + pipelines.

What's the difference between a data engineer, data analyst, and data scientist?

Data engineers build the pipelines and tables; analysts query them for business questions; scientists run experiments and ML models on top. In a typical e-commerce team: a DE owns the daily ETL that loads cur_orders; an analyst writes the SQL behind the daily revenue dashboard; a scientist runs the A/B test that decides whether the new checkout flow ships. The roles overlap on SQL — every analytics person writes it — but only DEs own the infrastructure that produces the tables everyone else queries. Salaries also follow this stack — DEs are typically paid more than analysts and on par with scientists at most companies.

#	Topic	Why it shows up in Redshift interviews
1	Columnar storage, MPP, and compression	The architectural foundation; explains why Redshift is fast for analytics and slow for single-row writes — the OLTP-vs-OLAP question every interview opens with.
2	Distribution styles (`EVEN`, `KEY`, `ALL`) and sort keys	The two schema-design knobs that decide whether a 10TB join takes 30 seconds or 30 minutes; `DISTKEY` controls data co-location for joins, `SORTKEY` controls zone-map pruning for filters.
3	`COPY` command and leader/compute-node architecture	`COPY` is how 99% of bulk ingestion lands in Redshift; the leader/compute split is how every query is planned, distributed, and aggregated — both topics show up in every loop.
4	Redshift Spectrum, `VACUUM`, and `ANALYZE`	Spectrum lets you query S3 with SQL without loading first (the lakehouse pattern); `VACUUM` reclaims deleted-row space and re-sorts, `ANALYZE` refreshes planner statistics — the two commands that keep a Redshift cluster fast in production.

#	Topic	Why it shows up in a PostgreSQL cheat sheet
1	Logical clause order — `FROM` → `WHERE` → `GROUP BY` → `HAVING` → `SELECT` → `ORDER BY` → `LIMIT`	The single most useful PostgreSQL mental model: the order you write clauses is not the order the engine evaluates them; knowing the evaluation order eliminates 80% of parse errors and explains why `WHERE` cannot reference aggregates or column aliases.
2	Joins and grain — `INNER`, `LEFT`, `RIGHT`, `FULL`, `SELF`, `CROSS`	Joins combine rows but they also change grain; a careless `1:N` join inflates row counts silently, and the `LEFT JOIN ... IS NULL` anti-join is the canonical "find rows in A with no match in B" pattern (orphan customers, churned users).
3	`GROUP BY`, `HAVING`, and conditional aggregates	`WHERE` filters rows before grouping; `HAVING` filters groups after; `COUNT() FILTER (WHERE …)` and `SUM(CASE WHEN …)` express many metrics in one query — the universal duplicate finder `HAVING COUNT() > 1` lives here.
4	Window functions — `ROW_NUMBER`, `RANK`, `DENSE_RANK`, `LAG`, `LEAD`	Per-partition ranking without collapsing rows, top-N-per-group, second-highest salary, running totals with `SUM(...) OVER (PARTITION BY ... ORDER BY ...)`, and month-over-month deltas via `LAG`; the most-graded primitive in modern SQL screens.

#	Topic	Why it shows up in DE interviews
1	Bronze / silver / gold medallion zones	Progressive refinement is the single biggest lake-architecture concept; interviewers grade whether you know which transformations belong in landing/bronze vs refined/silver vs curated/gold and how SLAs differ per layer.
2	Ingestion → catalog → compute flow on object storage	Sources land into S3/GCS/ABS, register in a Hive/Glue/Unity catalog, and are queried by Spark, Trino, or warehouse external tables; the small-file problem, partition pruning, and schema evolution all live here.
3	Lake vs cloud warehouse vs lakehouse — and Iceberg / Delta / Hudi	The pattern-selection question is canonical; open table formats are what turn a lake into a lakehouse and bring ACID, time travel, and partition evolution to object storage.
4	Interview answer shape — grain, idempotency, lineage, reconciliation	Even system-design rounds reduce to a five-step template: clarify grain, separate landing from conformed, make loads idempotent, attach lineage keys, and reconcile aggregates against the source.

Forem: Gowtham Potureddi

Senior SQL: Advanced Joins, Window Analytics, Plans, Indexing & Production Mindset

1. Junior vs senior — the mindset and the bar

From syntax fluency to production responsibility

What junior coverage usually stops at

What senior coverage adds

How seniors decompose a “suddenly slow” query

Observable signals worth naming in interviews

Common beginner traps

2. Join mastery — cardinality, order, and physical strategies

Joins are algebra and physics

Cardinality consciousness

Physical strategies (how interviewers phrase it)

Semijoins, antijoins, and row multiplication

Outer joins and predicates: where the filter lives

Numeric fan-out: why “only ten accounts” still explodes

Common beginner mistakes

3. Window analytics — partitions, orders, and frames

Keep row grain while computing comparative metrics

Ranking patterns

Frames — where running metrics go wrong

LAG / LEAD and session boundaries

Common beginner mistakes

SQL Interview Question on top three salaries per department

Solution Using ROW_NUMBER with tie-break ORDER BY

4. Recursive CTEs — hierarchies and graph-shaped data

Trees, org charts, and bill-of-materials patterns

Concrete org-chart skeleton (ANSI shape)

Cycles, uniqueness, and when SQL is the wrong tool

Bill-of-materials and explosion factors

Common beginner mistakes

5. Plans, indexes, and partitions — observability meets physics

EXPLAIN is the senior’s debugger

Index strategy (B-tree mental model)

Partitioning — prune, don’t pray

Reading a plan like a diff

Selective partial indexes and write amplification

Common beginner mistakes

6. Isolation, transactions, and locking — correctness under concurrency

Isolation is a contract, not a vibe

Locking & deadlocks

Isolation levels vs anomalies (memory aid)

A pragmatic concurrency playbook

Common beginner mistakes

7. Modeling, ETL SQL, quality checks, and anti-patterns

Readable pipelines and honest schemas

ETL SQL readability

Keys, grain, and slowly changing dimensions

DQ probes beside transforms, not after escalations

Anti-patterns seniors refuse

Materialized views / rollups

Common beginner mistakes

Tips to stay senior under interview clocks

Frequently asked questions

What is “senior SQL” in hiring terms?

How is senior SQL different from knowing a specific warehouse?

When should I prefer RANK over ROW_NUMBER?

Do I always need an index for fast queries?

What is the biggest modeling mistake in analytics SQL?

How do I practice senior patterns safely?

Practice on PipeCode

Reporting Services in SQL (SSRS): Architecture, Report Types, RDL & Interview Notes

1. Why reporting services sit between SQL and the business

Raw tables are honest; stakeholders need narrative artifacts

What “services” means in this context

Why plain query tools stop short of “reporting”

Grain, filters, and one definition of the metric

Common beginner traps

2. SSRS architecture — four components and the request path

Server-side rendering with a catalog database behind it

Report Builder / SSDT (design time)

Report Server (run time)

Report server database (catalog)

Web portal (consumption)

Snapshots, caching, and “why did yesterday match but today doesn’t?”

Common beginner mistakes

3. Report types you should be able to explain cold

Tabular, matrix, charts, drill paths, and parameterized slices

Tabular (list) reports — audit-friendly detail

Matrix (pivot) reports — dynamic columns

`LAG` / `LEAD` and session boundaries

Solution Using `ROW_NUMBER` with tie-break `ORDER BY`

`EXPLAIN` is the senior’s debugger

When should I prefer `RANK` over `ROW_NUMBER`?

Drill-down — hierarchy inside one `.rdl`

Solution Using `date_trunc`, half-open range, and `SUM`

Field list discipline (`SELECT` projections)

What lives inside an `.rdl` file?

NULL is tri-valued logic (`TRUE`, `FALSE`, `UNKNOWN`)

Fan-out rehearsal (why `COUNT(*)` lied)