Forem: Sandeep Bhatia

Healthcare Analytics Has a Cloud Cost Problem. And It's Not the One You Think.

Sandeep Bhatia — Wed, 18 Mar 2026 06:21:31 +0000

Healthcare analytics organizations are quietly building some of the most complex cloud data architectures in existence. Petabytes of EHR data. Claims streams arriving in near-real time. Clinical trial datasets that span multiple sites, multiple countries, and multiple consent regimes. Predictive models for patient readmission risk running on top of BigQuery ML. Databricks clusters processing Epic and Cerner extracts into curated Delta tables every four hours.

And almost all of them are significantly overspending.

Not because they chose the wrong vendors. Not because they over-built their infrastructure. But because the skills required to manage cost at this level of complexity demand something the industry hasn't fully articulated yet: you need to be both a FinOps engineer and a deep query performance specialist, simultaneously, across multiple platforms that each have completely different cost models.

I want to walk through exactly what that looks like in practice — where the money goes, why it's so hard to catch, and what the engineers who are successfully tackling it are doing differently.

The Stack That Healthcare Analytics Actually Runs On

Before we get into cost traps, it's worth describing the architecture I see most often.

On the Azure side: Databricks for data engineering and ML pipeline work, Azure Synapse Analytics SQL Pools for the analytical warehouse that clinical and operations teams query directly. Jupyter notebooks scattered across the Databricks workspace — some attached to long-running clusters, some firing up their own clusters, some nobody has opened in six months but the cluster is still warm. The orchestration is usually Azure Data Factory or Databricks Workflows, and the governance layer is Unity Catalog if the team is reasonably mature.

On the GCP side — often run in parallel by the same organization — BigQuery for analytics, BigQuery ML for the predictive modeling work, and Cloud Composer (managed Airflow) to schedule everything. The BigQuery workload tends to be heavier on ad-hoc query traffic because the SQL interface is familiar to clinical analysts who don't want to touch PySpark.

What drives this dual-cloud topology? Often it's acquisition history. A health system acquires a specialty pharmacy data company that runs on Azure. The core analytics platform was built on GCP three years ago. Nobody wants to migrate anything, and honestly they shouldn't — migration risk in healthcare data environments is real. So you end up with both.

The result is a cost management problem that spans two major clouds, three or four distinct compute billing models, and a dozen different ways for a single query to silently consume thousands of dollars.

What Healthcare Companies Are Now Requiring of Their Data Platform Engineers

Something has shifted in the last eighteen months in the healthcare analytics job market. The job descriptions that used to read "strong SQL skills, experience with ETL pipelines" now require a genuinely different profile.

Here is what leading healthcare analytics companies are now asking for in their job descriptions:

6+ years of hands-on data engineering experience, with significant exposure to healthcare clinical data — EHR systems, claims, clinical trials, population health datasets
Proficiency with Databricks for building and optimizing data pipelines, including Unity Catalog, Medallion architecture, and Delta Lake optimization
Strong skills in Azure Synapse and/or BigQuery with an understanding of their respective cost models, not just their query interfaces
Cost awareness as a first-class skill — the ability to estimate, attribute, and reduce compute spend across platforms, not just as an afterthought but as part of pipeline design
Experience with Jupyter Notebooks governance — understanding which notebooks are driving cluster costs, which are idle, and how to build guardrails around ad-hoc compute consumption
Familiarity with FinOps practices: tag-based cost allocation, chargeback to clinical vs. research vs. operations teams, and the tooling required to make that visible
Understanding of HIPAA-compliant approaches to cost monitoring , which means you can't just pipe query logs to an external SaaS without thinking about what's in the metadata

That last point deserves its own paragraph. Cost monitoring tools that work cleanly for a retail analytics team create compliance headaches in healthcare environments. Query text can contain patient identifiers. Execution logs can reveal which clinical cohorts are being analyzed. Any tooling that touches query metadata needs to be evaluated not just for technical capability but for data governance suitability.

The companies asking for all of this in a single engineer are essentially describing a new subspecialty that sits at the intersection of database performance engineering and FinOps. Let me show you what that looks like across the two most common platform combinations.

Use Case 1: Databricks on Azure + Synapse SQL Pools

The Architecture

A mid-sized health system I worked with had this setup: Databricks handling the heavy transformation work — ingesting Epic ADT feeds, normalizing Cerner lab result streams, building the curated patient encounter tables that downstream systems consumed. Synapse SQL Pools sat downstream, serving as the analytical warehouse for clinical operations and population health queries. Finance was getting charged a blended rate for "data platform." Nobody knew how much was Databricks versus Synapse, and nobody knew how much of either was actually necessary.

Cost Trap 1: Databricks Cluster Scaling Gone Wrong

Databricks auto-scaling is genuinely useful. It's also one of the most common sources of uncontrolled spend I've seen.

The problem is that autoscaling responds to pending tasks, not to business priority. An EHR ingestion job that runs at 3 AM and a data scientist running an exploratory cohort analysis at 2 PM can both trigger a cluster scale-out event. But they have completely different cost profiles and completely different latency requirements. The ingestion job can run on a smaller cluster over a longer window. The analyst query needs to complete before the afternoon clinical review meeting.

What I typically find when I dig into system.billing.usage on these clusters:

SELECT
    sku_name,
    DATE_TRUNC('hour', usage_date) AS hour_bucket,
    SUM(usage_quantity) AS dbus,
    SUM(usage_quantity * list_price) AS est_cost
FROM system.billing.usage
WHERE usage_date >= DATEADD(day, -30, CURRENT_DATE())
GROUP BY 1, 2
ORDER BY 4 DESC

Almost universally, the top cost hours are not the scheduled pipeline windows. They are the hours immediately surrounding when data scientists opened notebooks and started exploring — and the clusters scaled out to accommodate the interactive workload, then didn't scale back in before the next scheduled job ran and kept the cluster alive.

The fix is workload isolation by job type, not just by time of day. Scheduled EHR ingestion pipelines should run on dedicated job clusters with fixed sizing, sized for the actual task. Interactive exploration should happen on separate clusters with aggressive auto-terminate configured (ten minutes of inactivity is a reasonable starting point for clinical analytics environments where the analyst might step away for a call).

Cost Trap 2: Shuffle Spill in Clinical Data Joins

The most expensive Spark anti-pattern I see in healthcare pipelines is the shuffle spill on patient-encounter joins. The reason it's endemic to healthcare specifically: encounter tables are enormous (millions of rows per patient-year for high-utilization populations), and the analytical question almost always requires joining them back to patient demographic tables, claims tables, and diagnosis code tables simultaneously.

When Spark can't fit the shuffle exchange in memory, it spills to local disk and potentially to remote storage. By the time a single claims adjudication query has spilled 15 GB to remote storage, you're looking at significant I/O cost on top of the DBU cost — and the query is also dramatically slower than it needs to be.

The signature I look for in system.query.history:

SELECT
    statement_id,
    statement_text,
    total_duration_ms,
    spilled_local_bytes,
    spilled_remote_bytes,
    read_bytes
FROM system.query.history
WHERE spilled_remote_bytes > 1073741824  -- 1 GB
  AND execution_status = 'FINISHED'
ORDER BY spilled_remote_bytes DESC
LIMIT 20

The remediation depends on the specific join. For joins where one side is a small reference table (ICD-10 code descriptions, formulary data, provider rosters), adding a /*+ BROADCAST(small_table) */ hint eliminates the shuffle entirely. For genuine large-to-large joins (patient demographics to encounter history), the answer is usually Z-ORDER on the join columns and ensuring partition pruning is firing. If the claims table is partitioned by service_date and your join query doesn't include a date filter, you're scanning the entire table every time.

Cost Trap 3: Synapse DWU Over-Provisioning

Synapse SQL Pools bill by DWU-hour. DW300c, DW500c, DW1000c — each step up is a significant price increase, and the pricing is not linear relative to performance gains for all workload types.

The pattern I see most often: the Synapse pool was provisioned at DW1000c for a large historical load that ran during initial deployment, and nobody ever downsized it. The ongoing operational workload — population health dashboard queries, clinical operations reports, care management cohort queries — is actually quite manageable at DW300c or DW500c. The expensive compute is idle most of the time.

Query the DMVs to see what you're actually running:

SELECT
    r.request_id,
    r.command,
    r.resource_class,
    r.status,
    r.total_elapsed_time / 1000.0 AS elapsed_sec,
    r.rows_returned
FROM sys.dm_pdw_exec_requests r
WHERE r.status = 'Completed'
  AND r.submit_time >= DATEADD(day, -7, GETDATE())
ORDER BY r.submit_time DESC

If the vast majority of completed requests show elapsed times under 30 seconds and row counts in the thousands rather than millions, you're running an enterprise warehouse at clinical-trial-data-load scale. That's an expensive mismatch.

The more subtle version of this problem is the distribution key mismatch. Synapse distributes data across 60 distributions. If your patient_encounters table is hash-distributed on encounter_id but your most common analytical query joins on patient_id, every join requires a data movement operation — shuffling rows between distributions before the join can execute. When I pull sys.dm_pdw_exec_requests and see consistent data_movement_bytes in the multi-GB range for routine population health queries, this is almost always the cause.

The fix requires a table rebuild with the correct distribution key. It's disruptive, but it's often the single highest-leverage Synapse optimization available.

The Jupyter Notebook Problem

Data scientists in healthcare analytics environments are frequently the largest uncontrolled cost source on a Databricks platform — not because they're wasteful, but because the notebook interface abstracts away cost visibility entirely.

A clinical data scientist analyzing a patient cohort for a care management program opens a notebook, attaches it to the shared analytics cluster, runs several exploratory queries against a 500 GB Delta table, gets pulled into a meeting, and leaves the notebook session open. The cluster stays alive. The interactive SQL query that scanned 180 GB of patient encounter data without partition filters runs twelve times as they refine their cohort definition. There's no indication anywhere in the notebook interface that this sequence of operations cost $340 in compute.

This is a governance problem as much as a technical one. The teams handling it well are doing three things:

Cluster policies that enforce auto-terminate for all interactive clusters, no exceptions
Query tagging that attributes Databricks spend to cost centers by notebook name and username, using the custom_tags field in system.billing.usage
Cost visibility at the notebook level — surfacing DBU consumption per notebook session to the data scientist in near-real time, so the feedback loop is immediate rather than appearing on a monthly cloud invoice

The chargeback piece is particularly important in healthcare organizations with distinct clinical, research, and operations functions. Clinical informatics and health economics teams often have separate budgets. Without tag-based cost allocation that maps Databricks workspace spending to those cost centers, the platform team ends up absorbing costs that should be attributable to specific programs — and they lose the ability to have informed conversations about compute allocation priorities.

Use Case 2: BigQuery on GCP with ML Capabilities

The Architecture

On the GCP side, BigQuery handles the analytical layer, and an increasing number of healthcare organizations are moving into BigQuery ML for the predictive modeling work that used to require standing up separate ML infrastructure. Readmission risk models. Drug interaction flags. Patient population stratification for value-based care contracts. Real-world evidence analytics for specialty pharmacy programs.

BigQuery's billing model is deceptively simple on the surface — on-demand pricing at $6.25 per TB scanned — and deceptively expensive in practice. The reason: BigQuery scans are billed at column-level granularity, which means every SELECT * on a 20-column table is billing you for 20 columns worth of scan even if your WHERE clause only needs three of them.

Cost Trap 1: BQML Training Jobs Without Partition Pruning

BigQuery ML training jobs are some of the most expensive queries I've ever seen in a production environment. The pattern in healthcare: a readmission risk model trains on three years of encounter history, and the training SQL looks something like this:

-- What this query actually does to your BigQuery bill:
CREATE OR REPLACE MODEL `clinical.readmission_risk_v3`
OPTIONS (model_type='logistic_reg', input_label_cols=['readmitted_30d'])
AS
SELECT
    p.age_at_admission,
    p.primary_diagnosis_code,
    p.discharge_disposition,
    e.prior_admissions_12mo,
    e.length_of_stay_days,
    e.readmitted_30d
FROM `clinical.patient_encounters` e
JOIN `clinical.patient_demographics` p ON e.patient_id = p.patient_id
WHERE e.admission_date >= '2022-01-01'

Looks reasonable. The problem is that patient_encounters is partitioned by admission_date and clustered on patient_id, but the BQML job runs as a full table scan because the query optimizer doesn't always push partition filters through the JOIN path in training queries. On a 2 TB encounters table, that's a $12.50 charge per training run. Run it weekly as the model retrains on new data, and you're at $650/year for a single model — and this organization has fourteen predictive models.

The fix: build a partitioned, incrementally maintained feature table first, then train against that bounded dataset. This separates the scan cost from the training cost, makes the scan explicit and optimizable, and avoids the strict limitations and maintenance overhead of BigQuery materialized views with joins.

-- Build a curated, partitioned feature table
CREATE OR REPLACE TABLE `clinical.readmission_features`
PARTITION BY admission_date
CLUSTER BY patient_id AS
SELECT
  e.admission_date,
  e.patient_id,
  p.age_at_admission,
  p.primary_diagnosis_code,
  p.discharge_disposition,
  e.prior_admissions_12mo,
  e.length_of_stay_days,
  e.readmitted_30d
FROM `clinical.patient_encounters` e
JOIN `clinical.patient_demographics` p
  USING (patient_id)
WHERE e.admission_date >= DATE '2022-01-01';

-- Training now scans the curated feature table, not the raw history
CREATE OR REPLACE MODEL `clinical.readmission_risk_v3`
OPTIONS (
  model_type = 'logistic_reg',
  input_label_cols = ['readmitted_30d']
) AS
SELECT
  age_at_admission,
  primary_diagnosis_code,
  discharge_disposition,
  prior_admissions_12mo,
  length_of_stay_days,
  readmitted_30d
FROM `clinical.readmission_features`
WHERE admission_date >= DATE '2022-01-01';

This single change reduced the weekly training cost for this particular model from $12.50 to under $2.00 — because the feature table pipeline only processes the partitions that have new data since the last refresh, and the model training only scans the curated features.

Cost Trap 2: Partition Pruning Failures on Clinical Data

The most common BigQuery anti-pattern I see in healthcare environments is partition pruning failure on date-partitioned clinical tables. It's particularly insidious because the query looks correct.

The encounter table is partitioned by admission_date. The analyst writes:

SELECT
    patient_id,
    primary_diagnosis_code,
    length_of_stay_days
FROM `clinical.patient_encounters`
WHERE CAST(admission_date AS DATE) >= '2024-01-01'

The CAST() wrapper defeats partition pruning. BigQuery cannot evaluate the partition filter before scanning because the cast is applied per-row, not per-partition. The query scans the entire table. A small change — removing the redundant cast on a column that is already of type DATE — immediately activates partition elimination.

The same failure mode appears with date functions:

-- Defeats pruning:
WHERE EXTRACT(YEAR FROM admission_date) = 2024

-- Activates pruning:
WHERE admission_date BETWEEN '2024-01-01' AND '2024-12-31'

For a 2 TB encounters table queried twenty times a day by clinical analysts, the difference between pruned and unpruned queries is approximately $225/day in scan costs. That's $82,000/year from a single table, and I've seen organizations with dozens of tables in this state.

Cost Trap 3: On-Demand vs. Slot Reservations

Most healthcare analytics teams start on BigQuery's on-demand pricing because it requires no commitment. The total spend is low, the per-query cost is invisible, and the billing model seems straightforward.

As the analytics workload grows — more users, more dashboards, more BQML training jobs — on-demand pricing starts to become expensive and unpredictable. The shift from on-demand to slot reservations (BigQuery Editions) is one of the highest-leverage cost decisions a healthcare data platform team can make, but it requires knowing your actual workload pattern well enough to size the commitment correctly.

The analysis I run before making this recommendation:

SELECT
    EXTRACT(HOUR FROM creation_time) AS hour_of_day,
    COUNT(*) AS query_count,
    SUM(total_slot_ms) / 1000 / 3600 AS slot_hours,
    SUM(total_bytes_billed) / POW(1024, 4) * 6.25 AS on_demand_cost_usd
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
  AND statement_type != 'SCRIPT'
GROUP BY 1
ORDER BY 1

This gives you a per-hour view of slot consumption and equivalent on-demand cost. For most healthcare analytics organizations with 24-hour query activity driven by clinical operations in the US and research teams in other time zones, the breakeven point for a 100-slot Enterprise Edition reservation typically sits around $4,000–$6,000 in monthly on-demand spend. Below that threshold, stay on on-demand. Above it, model the commitment carefully.

One nuance that matters in healthcare: ML training jobs consume slots in bursts that look very different from steady analytical query traffic. A slot reservation sized for the analytical baseline will run fine most of the time but queue aggressively when training jobs fire. The right answer is usually a baseline reservation for predictable analytical workloads plus on-demand overflow for training bursts — not a single reservation attempting to cover both.

Cost Attribution in Healthcare Environments

The chargeback conversation in healthcare organizations is more complex than in most industries because the cost centers are more numerous and the political stakes of attribution are higher. Clinical informatics, research computing, population health management, specialty pharmacy analytics, and corporate finance analytics all have legitimate claims on the data platform, and they all have separate budget owners.

The tooling to handle this correctly requires:

Label-based attribution on every BigQuery job — team, cost_center, project_code, and data_classification (the last one matters for HIPAA considerations)
Automated label enforcement — labels that have to be set manually will be forgotten; they need to be injected at the pipeline level or caught by a policy that fails jobs without required labels
A reporting layer that translates BigQuery billing export data into per-department spend, with drill-down to the job level for the month-end chargeback conversation

The data_classification label deserves specific attention. Before you pipe BigQuery job metadata to any external cost reporting tool, you need to understand what's in the query text and execution metadata. For most analytical queries this is innocuous — aggregate counts, summary statistics. But for ad-hoc queries run by clinical analysts, query text can contain patient identifiers, cohort definitions referencing diagnosis codes, or other protected health information. Your cost attribution infrastructure needs to be evaluated under the same data governance framework as your clinical data infrastructure.

The Dual Skillset Problem

What makes all of this hard is that it genuinely requires two different types of expertise to do well, and they don't naturally coexist in the same person or the same team.

Query performance tuning is a deeply technical skill. Understanding why a BigQuery partition filter is failing requires knowing how the query optimizer evaluates predicates against partition metadata. Understanding why a Databricks Spark job is spilling to remote storage requires reading the physical query plan and understanding how broadcast joins compare to sort-merge joins in terms of memory pressure. These are skills that develop over years of reading execution plans and tuning real production workloads.

FinOps engineering is a different skill. It requires understanding billing models across multiple vendors, building cost attribution infrastructure, having difficult conversations with budget owners about chargeback policies, and designing governance frameworks that work in regulated industries without becoming so restrictive that they impede legitimate analytical work.

The healthcare analytics market is increasingly recognizing that data platform teams need both. The roles being posted now aren't just "senior data engineer" in the traditional sense — they're asking for engineers who can optimize a Spark shuffle operation in the morning and design a tag-based chargeback policy in the afternoon.

That combination is rare. And organizations that have it, or are building toward it, have a significant structural advantage: they fix the right thing first. They tune the queries, let utilization settle into its real baseline, and then right-size the infrastructure against that accurate signal. The teams that don't have this expertise tend to do it backwards — they see high infrastructure costs, provision more compute to alleviate the slowness, and end up with an even larger bill for an even more complicated system.

Healthcare data is too important, and the compute costs are too large, to get that order wrong.

Where to Start

If your team is managing a multi-platform healthcare analytics environment and you're not sure whether your current spend reflects your actual workload requirements, here is where I'd begin:

For Databricks on Azure:

Query system.query.history for spilled_remote_bytes > 1073741824 — any query spilling more than 1 GB to remote storage is a high-priority tuning target
Check system.billing.usage for DBU consumption by sku_name broken down by hour — are your peaks during scheduled pipeline windows or during interactive hours?
Audit notebook cluster attachment — how many notebooks are attached to clusters that have been idle for more than 30 minutes?

For Azure Synapse SQL Pools:

Check sys.dm_pdw_exec_requests for data_movement_bytes — if routine queries are moving more than 1 GB between distributions, your distribution key strategy needs revisiting
Look at columnstore_segments_scanned vs. columnstore_segments_skipped — if segment elimination is below 50%, your ordered CCI configuration needs work
Is your DWU tier aligned to your actual peak workload, or to the heaviest query you ever ran during initial setup?

For BigQuery on GCP:

Pull INFORMATION_SCHEMA.JOBS_BY_PROJECT and look for queries where total_bytes_billed is significantly higher than total_bytes_processed — this gap often indicates partition pruning failure
Check your BQML training jobs for full table scans that could be curated feature table references instead
Run the hourly slot consumption analysis above and compare against your monthly on-demand bill — the flat-rate breakeven may be closer than you think

The underlying question in all of these cases is the same: are you paying for what you actually need, or are you paying for what a poorly optimized query decided to consume?

In healthcare analytics, getting that answer right matters more than in most industries. The compute budget that's going to unnecessary BigQuery scans is compute budget that isn't going to expanding your real-world evidence analytics program, or improving your readmission prediction models, or accelerating the clinical trial data aggregation pipelines that your research teams are waiting on.

The data you're working with is too important to leave money on the table like this. And the tools to find it are, for the most part, already built into the platforms you're running.

Have you navigated multi-platform cost optimization in a healthcare analytics environment? I'd be particularly interested in hearing how your teams are handling the HIPAA considerations around query metadata in cost monitoring tools — it's one of the most underappreciated constraints in this space.

Tags: FinOps, DataEngineering, HealthcareAnalytics, Databricks, BigQuery, AzureSynapse, CloudCostOptimization, SQL, QueryTuning, HIPAA

Snowflake Cost Optimization Starts With Workload Design, Not Auto-Suspend

Sandeep Bhatia — Mon, 16 Mar 2026 02:56:31 +0000

Snowflake Cost Optimization Starts With Workload Design, Not Auto-Suspend

The biggest savings usually come from attribution, workload isolation, and choosing the right Snowflake-native optimization lever for the problem you actually have.

Most Snowflake cost advice starts with the same recommendation: lower your auto-suspend timeout.

That advice is not wrong. It is just too small.

In smaller environments, aggressive auto-suspend can capture obvious idle time. In larger Snowflake estates, the biggest savings usually come from something else, such as mixed workloads sharing the same warehouse, concurrency problems misread as sizing problems, expensive scans treated as unavoidable, and teams using billing data as a proxy for warehouse behavior.

That is why the best Snowflake cost reviews do not start with asking whether this warehouse should be smaller.

They start with a better question: Does this workload belong on this warehouse at all?

That framing changes everything. It pushes you to identify what is running, where it is running, and whether the warehouse design matches the workload. It also helps you choose the right Snowflake-native lever, whether that is workload isolation, a multi-cluster strategy, Snowpark Optimized Warehouses, Dynamic Tables, or simply better attribution.

Here is the review sequence I trust most.

1. Start with attribution, not assumptions

Before you tune a single warehouse, answer the most basic cost question in the account: Who is consuming credits, and for what?

In many Snowflake environments, one warehouse is serving all of this at once, including dbt model runs, BI dashboards, analyst ad hoc SQL, notebook exploration, and health checks.

If those are not clearly identifiable, cost optimization becomes guesswork. The fastest improvement is usually operational, which involves standardizing your query tags, using warehouse names that encode the workload and environment, and separating service users from human users.

Stop manually setting query tags for dbt. Instead of running session-level alter commands, use the native dbt-snowflake adapter configurations. Set your query tag in your project configuration file to automatically inject the model name, environment, and execution ID into the Snowflake query history.

Then inspect query history:

SELECT
    QUERY_TAG,
    USER_NAME,
    WAREHOUSE_NAME,
    COUNT(*) AS query_count,
    SUM(BYTES_SCANNED) / POWER(1024, 3) AS total_gb_scanned,
    AVG(TOTAL_ELAPSED_TIME) / 1000 AS avg_elapsed_sec
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP())
  AND EXECUTION_STATUS = 'SUCCESS'
GROUP BY 1, 2, 3
ORDER BY query_count DESC;

2. Stop treating billing data like utilization data

This is one of the most common Snowflake mistakes. The metering history tells you what was billed, but it does not tell you whether a warehouse was saturated, queueing, or mostly idle between bursts. For that, you need the load history.

Start with metering to see the cost, then compare it with load:

SELECT
    WAREHOUSE_NAME,
    DATE_TRUNC('hour', START_TIME) AS hour_start,
    AVG(AVG_RUNNING) AS avg_running,
    AVG(AVG_QUEUED_LOAD) AS avg_queued_load,
    AVG(AVG_QUEUED_PROVISIONING) AS avg_queued_provisioning
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE START_TIME >= DATEADD('day', -14, CURRENT_TIMESTAMP())
GROUP BY 1, 2
ORDER BY 1, 2;

That distinction matters because the fixes are different. High credits with low queueing often indicate idle time, over-provisioning, or poor workload placement. High queueing often points to concurrency pressure. Long-running heavy queries with spill often reveal memory pressure, query shape issues, or poor pruning.

If you use metering as a stand-in for everything, you will solve the wrong problem.

3. Find mixed-workload warehouses before you resize anything

One of the most expensive Snowflake patterns is a warehouse serving workloads with completely different economics. Typical examples include heavy ELT transformations sharing compute with BI dashboards, or tiny monitoring queries running on a warehouse sized for large joins.

Averages hide this, but distribution exposes it.

SELECT
    WAREHOUSE_NAME,
    WAREHOUSE_SIZE,
    COUNT(*) AS query_count,
    PERCENTILE_CONT(0.10) WITHIN GROUP (ORDER BY BYTES_SCANNED) AS p10_bytes_scanned,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY BYTES_SCANNED) AS p50_bytes_scanned,
    PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY BYTES_SCANNED) AS p90_bytes_scanned
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP())
  AND EXECUTION_STATUS = 'SUCCESS'
GROUP BY 1, 2
ORDER BY 1;

If your 10th percentile bytes scanned is tiny while your 90th percentile is massive, you have several workloads sharing one compute pool. That is your segmentation signal. Split them into a transform warehouse, an interactive warehouse, and a utility warehouse.

4. Treat sizing and concurrency as different decisions

Snowflake tuning is often reduced to going up one warehouse size. Sometimes that is right, but often it is incomplete.

Larger warehouses generally improve single-query performance. But if the real issue is many users or dashboards hitting the warehouse at once, the better answer is a multi-cluster warehouse, not just a bigger size.

That is why load history matters so much:

SELECT
    WAREHOUSE_NAME,
    AVG(AVG_RUNNING) AS avg_running,
    AVG(AVG_QUEUED_LOAD) AS avg_queued_load,
    AVG(AVG_QUEUED_PROVISIONING) AS avg_queued_provisioning
FROM SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_LOAD_HISTORY
WHERE START_TIME >= DATEADD('day', -7, CURRENT_TIMESTAMP())
GROUP BY 1
ORDER BY avg_queued_load DESC;

A good rule of thumb is that if one heavy query is slow and spilling, you should think about sizing or query shape. If many small queries are waiting, you should think about your concurrency strategy. If both are happening on one warehouse, you must split the workload before deciding anything else.

5. Use auto-suspend as a workload setting, not a religion

Auto-suspend still matters, but it just is not a universal answer. For intermittent interactive work, short auto-suspend settings often make sense, but assuming 60 seconds is always right is too absolute.

Inspect current settings with:

SHOW WAREHOUSES;

A better default approach involves tailoring the timeout to the use case. For interactive analyst warehouses, start around 60 to 300 seconds, then validate against actual usage. For scheduled batch warehouses, rely on a tight 60-second auto-suspend so the warehouse safely spins down the moment the orchestration tool stops sending queries, and never use your orchestration tool to explicitly send suspend commands since a tool crash means the warehouse burns credits forever. As a modern alternative for heavy and frequent batch pipelines, stop managing warehouses entirely and move these workloads to Dynamic Tables or Serverless Tasks, which automatically scale compute to the exact size needed and scale to zero the millisecond the job finishes.

6. Treat spill and scan inefficiency as diagnostic signals

Spill is one of the most useful signals in Snowflake, especially for transformation-heavy workloads.

SELECT
    WAREHOUSE_NAME,
    QUERY_TYPE,
    COUNT(*) AS query_count,
    SUM(BYTES_SPILLED_TO_LOCAL_STORAGE) / POWER(1024, 3) AS local_spill_gb,
    SUM(BYTES_SPILLED_TO_REMOTE_STORAGE) / POWER(1024, 3) AS remote_spill_gb,
    AVG(TOTAL_ELAPSED_TIME) / 1000 AS avg_elapsed_sec
FROM SNOWFLAKE.ACCOUNT_USAGE.QUERY_HISTORY
WHERE START_TIME >= DATEADD('day', -30, CURRENT_TIMESTAMP())
  AND EXECUTION_STATUS = 'SUCCESS'
  AND (
      BYTES_SPILLED_TO_LOCAL_STORAGE > 0
      OR BYTES_SPILLED_TO_REMOTE_STORAGE > 0
  )
GROUP BY 1, 2
ORDER BY remote_spill_gb DESC;

Consistent remote spill is a strong signal, but it is not an automatic order to resize the standard warehouse.

If a query is memory-bound and spilling heavily, sizing up a standard warehouse wastes money on CPU cores you do not need. The correct Snowflake-native lever is to switch to a Snowpark Optimized Warehouse of the same size, which gives you 16x the memory per node to eliminate the spill without paying for unneeded compute.

The same principle applies to expensive scans. Large scans are not inherently bad, but if selective queries are scanning far more than they should, evaluate the right lever, which might include query rewrites, natural clustering versus explicit cluster keys, the Search Optimization Service, or Dynamic Tables for incremental processing.

Do not jump straight from an expensive query to adding a cluster key.

The Takeaway

Most Snowflake waste does not come from one catastrophic mistake. It comes from reasonable decisions that were never revisited as the platform grew.

One warehouse becomes the default for everything. The system still works, so nobody redesigns it. Spend drifts upward, nothing looks broken, but plenty is inefficient.

If there is one Snowflake-specific lesson worth keeping, it is this: Do not start by asking whether the warehouse should be smaller. Start by asking whether the workload belongs there at all.

Framework for AWS Aurora PostgreSQL Performance Tuning

Sandeep Bhatia — Fri, 09 May 2025 17:29:03 +0000

1. Introduction

This document outlines a structured framework for identifying, diagnosing, and resolving performance issues in AWS Aurora PostgreSQL. It covers methods to spot performance issues, tools used for analysis, and step-by-step processes for tuning queries, indexes, and parameters.

2. Spotting Performance Issues

2.1 Key Indicators

Long query runtimes
High CPU/IO usage
Deadlocks or Lock Waits
Frequent connection timeouts
Poor concurrency handling

2.2 Monitoring Tools

AWS Performance Insights

Access via AWS Console > RDS > [DB Instance] > Performance Insights.
Look For:
- Top wait events (e.g., I/O:DataFileRead, CPU, LWLock)
- SQL queries consuming highest database load

pg_stat_statements Extension

CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;

Enables tracking execution statistics of all SQL statements

Datadog PostgreSQL Integration

Custom dashboards to view query latency, slow query logs, CPU usage, connections
Set alerts for unusual patterns or threshold breaches

3. Identifying Slow Queries

3.1 From Performance Insights

Navigate to SQL tab and select the top query by average or total load
Extract the full SQL text

3.2 From pg_stat_statements

Run a query ordered by mean_time or total_time

3.3 From PostgreSQL Logs

Ensure logging is enabled:

ALTER SYSTEM SET log_min_duration_statement = 500; -- in milliseconds
SELECT pg_reload_conf();

Then parse logs or use log analysis tools like pgBadger

4. Analyzing Execution Plans

4.1 Running EXPLAIN

EXPLAIN ANALYZE SELECT * FROM table WHERE col = 'value';

4.2 Key Terms to Know

Seq Scan: Full table scan, expensive for large tables
Index Scan/Index Only Scan: Indicates use of indexes
Hash Join / Nested Loop / Merge Join: Type of join algorithm
Rows Removed by Filter: Indicates excessive post-processing
Actual vs Estimated Rows: Large discrepancies may suggest stats are stale

4.3 Common Fixes

Missing index
Bad join order
Redundant data processing

5. Index Optimization

5.1 Determine Candidate Indexes

EXPLAIN SELECT ...;
SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0;

Use hypopg extension to test hypothetical indexes

CREATE EXTENSION hypopg;
SELECT * FROM hypopg_create_index('CREATE INDEX ON my_table(col1, col2)');
EXPLAIN SELECT * FROM my_table WHERE col1 = 'abc' AND col2 = 'xyz';

5.2 Best Practices

Avoid indexes on low-cardinality columns
Multicolumn indexes must match filter order
Use covering indexes if possible (include columns in SELECT)

6. Materialized Views

6.1 When to Use

Costly aggregation or join logic
Rarely changing data
Queries with multiple joins and filters over large datasets

6.2 Example

CREATE MATERIALIZED VIEW daily_summary AS
SELECT facility_id, date_trunc('day', created_at) AS day, COUNT(*)
FROM events
GROUP BY facility_id, day;

-- To refresh:
REFRESH MATERIALIZED VIEW daily_summary;

7. Parameter Tuning

7.1 Key Parameters

work_mem: Memory per operation for sorting, hashing
shared_buffers: RAM used by PostgreSQL to cache data
effective_cache_size: Estimate of how much memory is available for disk caching
max_parallel_workers_per_gather: Controls degree of parallelism

7.2 Determine Current Settings

SHOW work_mem;
SELECT * FROM pg_settings WHERE name IN ('work_mem', 'shared_buffers', 'effective_cache_size');

7.3 Adjust Parameters

ALTER SYSTEM SET work_mem = '64MB';
SELECT pg_reload_conf();

Test effect using query performance and explain plans

8. Plan Stabilization

8.1 When Needed

Queries using prepared statements behave inconsistently due to changing parameters

8.2 Solutions

Use pg_hint_plan extension to guide planner
Avoid generic plans by forcing constant values (e.g., inline SQL instead of prepared statements for high-variance queries)

9. Testing and Validation

9.1 Testing Queries

Use EXPLAIN (ANALYZE, BUFFERS)
Run queries with and without optimization to compare

9.2 A/B Test Indexes

SET enable_seqscan = OFF; -- Force index usage for testing

9.3 Regression Testing

Ensure optimized queries return correct data
Use query result diffing in CI/CD pipelines

10. Change Management

Always test changes in Dev > Stage > Prod
Keep a rollback plan
Track changes using version control and deployment scripts

11. Automation Scripts (Examples)

11.1 Find Unused Indexes

SELECT * FROM pg_stat_user_indexes WHERE idx_scan = 0 AND indexrelname NOT LIKE '%pkey%';

11.2 Monitor Query Latency

SELECT query, calls, total_time, mean_time
FROM pg_stat_statements
ORDER BY mean_time DESC LIMIT 10;

11.3 Refresh Materialized Views in Batch

#!/bin/bash
for view in view1 view2 view3; do
  psql -c "REFRESH MATERIALIZED VIEW CONCURRENTLY $view;"
done

12. Conclusion

Tuning AWS Aurora PostgreSQL is a structured, ongoing process involving monitoring, diagnosis, experimentation, and validation. A combination of query optimizations, indexing, view management, parameter tuning, and careful change control ensures sustained performance improvements.