<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Sandeep Bhatia</title>
    <description>The latest articles on Forem by Sandeep Bhatia (@sandeep_bhatia_e5942a9ad6).</description>
    <link>https://forem.com/sandeep_bhatia_e5942a9ad6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2941240%2Fff2b44e8-babc-4ee5-8a2c-a194d041acb1.png</url>
      <title>Forem: Sandeep Bhatia</title>
      <link>https://forem.com/sandeep_bhatia_e5942a9ad6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/sandeep_bhatia_e5942a9ad6"/>
    <language>en</language>
    <item>
      <title>Healthcare Analytics Has a Cloud Cost Problem. And It's Not the One You Think.</title>
      <dc:creator>Sandeep Bhatia</dc:creator>
      <pubDate>Wed, 18 Mar 2026 06:21:31 +0000</pubDate>
      <link>https://forem.com/sandeep_bhatia_e5942a9ad6/healthcare-analytics-has-a-cloud-cost-problem-and-its-not-the-one-you-think-1l4a</link>
      <guid>https://forem.com/sandeep_bhatia_e5942a9ad6/healthcare-analytics-has-a-cloud-cost-problem-and-its-not-the-one-you-think-1l4a</guid>
      <description>&lt;p&gt;Healthcare analytics organizations are quietly building some of the most complex cloud data architectures in existence. Petabytes of EHR data. Claims streams arriving in near-real time. Clinical trial datasets that span multiple sites, multiple countries, and multiple consent regimes. Predictive models for patient readmission risk running on top of BigQuery ML. Databricks clusters processing Epic and Cerner extracts into curated Delta tables every four hours.&lt;/p&gt;

&lt;p&gt;And almost all of them are significantly overspending.&lt;/p&gt;

&lt;p&gt;Not because they chose the wrong vendors. Not because they over-built their infrastructure. But because the skills required to manage cost at this level of complexity demand something the industry hasn't fully articulated yet: you need to be both a FinOps engineer and a deep query performance specialist, simultaneously, across multiple platforms that each have completely different cost models.&lt;/p&gt;

&lt;p&gt;I want to walk through exactly what that looks like in practice — where the money goes, why it's so hard to catch, and what the engineers who are successfully tackling it are doing differently.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Stack That Healthcare Analytics Actually Runs On
&lt;/h2&gt;

&lt;p&gt;Before we get into cost traps, it's worth describing the architecture I see most often.&lt;/p&gt;

&lt;p&gt;On the Azure side: Databricks for data engineering and ML pipeline work, Azure Synapse Analytics SQL Pools for the analytical warehouse that clinical and operations teams query directly. Jupyter notebooks scattered across the Databricks workspace — some attached to long-running clusters, some firing up their own clusters, some nobody has opened in six months but the cluster is still warm. The orchestration is usually Azure Data Factory or Databricks Workflows, and the governance layer is Unity Catalog if the team is reasonably mature.&lt;/p&gt;

&lt;p&gt;On the GCP side — often run in parallel by the same organization — BigQuery for analytics, BigQuery ML for the predictive modeling work, and Cloud Composer (managed Airflow) to schedule everything. The BigQuery workload tends to be heavier on ad-hoc query traffic because the SQL interface is familiar to clinical analysts who don't want to touch PySpark.&lt;/p&gt;

&lt;p&gt;What drives this dual-cloud topology? Often it's acquisition history. A health system acquires a specialty pharmacy data company that runs on Azure. The core analytics platform was built on GCP three years ago. Nobody wants to migrate anything, and honestly they shouldn't — migration risk in healthcare data environments is real. So you end up with both.&lt;/p&gt;

&lt;p&gt;The result is a cost management problem that spans two major clouds, three or four distinct compute billing models, and a dozen different ways for a single query to silently consume thousands of dollars.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Healthcare Companies Are Now Requiring of Their Data Platform Engineers
&lt;/h2&gt;

&lt;p&gt;Something has shifted in the last eighteen months in the healthcare analytics job market. The job descriptions that used to read "strong SQL skills, experience with ETL pipelines" now require a genuinely different profile.&lt;/p&gt;

&lt;p&gt;Here is what leading healthcare analytics companies are now asking for in their job descriptions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6+ years of hands-on data engineering experience, with significant exposure to &lt;strong&gt;healthcare clinical data&lt;/strong&gt; — EHR systems, claims, clinical trials, population health datasets&lt;/li&gt;
&lt;li&gt;Proficiency with &lt;strong&gt;Databricks&lt;/strong&gt; for building and optimizing data pipelines, including Unity Catalog, Medallion architecture, and Delta Lake optimization&lt;/li&gt;
&lt;li&gt;Strong skills in &lt;strong&gt;Azure Synapse&lt;/strong&gt; and/or &lt;strong&gt;BigQuery&lt;/strong&gt; with an understanding of their respective cost models, not just their query interfaces&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost awareness as a first-class skill&lt;/strong&gt; — the ability to estimate, attribute, and reduce compute spend across platforms, not just as an afterthought but as part of pipeline design&lt;/li&gt;
&lt;li&gt;Experience with &lt;strong&gt;Jupyter Notebooks governance&lt;/strong&gt; — understanding which notebooks are driving cluster costs, which are idle, and how to build guardrails around ad-hoc compute consumption&lt;/li&gt;
&lt;li&gt;Familiarity with &lt;strong&gt;FinOps practices&lt;/strong&gt;: tag-based cost allocation, chargeback to clinical vs. research vs. operations teams, and the tooling required to make that visible&lt;/li&gt;
&lt;li&gt;Understanding of &lt;strong&gt;HIPAA-compliant approaches to cost monitoring&lt;/strong&gt; , which means you can't just pipe query logs to an external SaaS without thinking about what's in the metadata&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point deserves its own paragraph. Cost monitoring tools that work cleanly for a retail analytics team create compliance headaches in healthcare environments. Query text can contain patient identifiers. Execution logs can reveal which clinical cohorts are being analyzed. Any tooling that touches query metadata needs to be evaluated not just for technical capability but for data governance suitability.&lt;/p&gt;

&lt;p&gt;The companies asking for all of this in a single engineer are essentially describing a new subspecialty that sits at the intersection of database performance engineering and FinOps. Let me show you what that looks like across the two most common platform combinations.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 1: Databricks on Azure + Synapse SQL Pools
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;A mid-sized health system I worked with had this setup: Databricks handling the heavy transformation work — ingesting Epic ADT feeds, normalizing Cerner lab result streams, building the curated patient encounter tables that downstream systems consumed. Synapse SQL Pools sat downstream, serving as the analytical warehouse for clinical operations and population health queries. Finance was getting charged a blended rate for "data platform." Nobody knew how much was Databricks versus Synapse, and nobody knew how much of either was actually necessary.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 1: Databricks Cluster Scaling Gone Wrong
&lt;/h3&gt;

&lt;p&gt;Databricks auto-scaling is genuinely useful. It's also one of the most common sources of uncontrolled spend I've seen.&lt;/p&gt;

&lt;p&gt;The problem is that autoscaling responds to pending tasks, not to business priority. An EHR ingestion job that runs at 3 AM and a data scientist running an exploratory cohort analysis at 2 PM can both trigger a cluster scale-out event. But they have completely different cost profiles and completely different latency requirements. The ingestion job can run on a smaller cluster over a longer window. The analyst query needs to complete before the afternoon clinical review meeting.&lt;/p&gt;

&lt;p&gt;What I typically find when I dig into &lt;code&gt;system.billing.usage&lt;/code&gt; on these clusters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;sku_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;usage_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour_bucket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage_quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dbus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;usage_quantity&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;list_price&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;est_cost&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;billing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;usage&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;usage_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_DATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Almost universally, the top cost hours are not the scheduled pipeline windows. They are the hours immediately surrounding when data scientists opened notebooks and started exploring — and the clusters scaled out to accommodate the interactive workload, then didn't scale back in before the next scheduled job ran and kept the cluster alive.&lt;/p&gt;

&lt;p&gt;The fix is workload isolation by job type, not just by time of day. Scheduled EHR ingestion pipelines should run on dedicated job clusters with fixed sizing, sized for the actual task. Interactive exploration should happen on separate clusters with aggressive auto-terminate configured (ten minutes of inactivity is a reasonable starting point for clinical analytics environments where the analyst might step away for a call).&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 2: Shuffle Spill in Clinical Data Joins
&lt;/h3&gt;

&lt;p&gt;The most expensive Spark anti-pattern I see in healthcare pipelines is the shuffle spill on patient-encounter joins. The reason it's endemic to healthcare specifically: encounter tables are enormous (millions of rows per patient-year for high-utilization populations), and the analytical question almost always requires joining them back to patient demographic tables, claims tables, and diagnosis code tables simultaneously.&lt;/p&gt;

&lt;p&gt;When Spark can't fit the shuffle exchange in memory, it spills to local disk and potentially to remote storage. By the time a single claims adjudication query has spilled 15 GB to remote storage, you're looking at significant I/O cost on top of the DBU cost — and the query is also dramatically slower than it needs to be.&lt;/p&gt;

&lt;p&gt;The signature I look for in &lt;code&gt;system.query.history&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;statement_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;statement_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;total_duration_ms&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;spilled_local_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;spilled_remote_bytes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;read_bytes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;spilled_remote_bytes&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1073741824&lt;/span&gt;  &lt;span class="c1"&gt;-- 1 GB&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;execution_status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'FINISHED'&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;spilled_remote_bytes&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The remediation depends on the specific join. For joins where one side is a small reference table (ICD-10 code descriptions, formulary data, provider rosters), adding a &lt;code&gt;/*+ BROADCAST(small_table) */&lt;/code&gt; hint eliminates the shuffle entirely. For genuine large-to-large joins (patient demographics to encounter history), the answer is usually Z-ORDER on the join columns and ensuring partition pruning is firing. If the claims table is partitioned by &lt;code&gt;service_date&lt;/code&gt; and your join query doesn't include a date filter, you're scanning the entire table every time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 3: Synapse DWU Over-Provisioning
&lt;/h3&gt;

&lt;p&gt;Synapse SQL Pools bill by DWU-hour. DW300c, DW500c, DW1000c — each step up is a significant price increase, and the pricing is not linear relative to performance gains for all workload types.&lt;/p&gt;

&lt;p&gt;The pattern I see most often: the Synapse pool was provisioned at DW1000c for a large historical load that ran during initial deployment, and nobody ever downsized it. The ongoing operational workload — population health dashboard queries, clinical operations reports, care management cohort queries — is actually quite manageable at DW300c or DW500c. The expensive compute is idle most of the time.&lt;/p&gt;

&lt;p&gt;Query the DMVs to see what you're actually running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;command&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_class&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_elapsed_time&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;elapsed_sec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rows_returned&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dm_pdw_exec_requests&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Completed'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GETDATE&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;submit_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the vast majority of completed requests show elapsed times under 30 seconds and row counts in the thousands rather than millions, you're running an enterprise warehouse at clinical-trial-data-load scale. That's an expensive mismatch.&lt;/p&gt;

&lt;p&gt;The more subtle version of this problem is the &lt;strong&gt;distribution key mismatch&lt;/strong&gt;. Synapse distributes data across 60 distributions. If your &lt;code&gt;patient_encounters&lt;/code&gt; table is hash-distributed on &lt;code&gt;encounter_id&lt;/code&gt; but your most common analytical query joins on &lt;code&gt;patient_id&lt;/code&gt;, every join requires a data movement operation — shuffling rows between distributions before the join can execute. When I pull &lt;code&gt;sys.dm_pdw_exec_requests&lt;/code&gt; and see consistent &lt;code&gt;data_movement_bytes&lt;/code&gt; in the multi-GB range for routine population health queries, this is almost always the cause.&lt;/p&gt;

&lt;p&gt;The fix requires a table rebuild with the correct distribution key. It's disruptive, but it's often the single highest-leverage Synapse optimization available.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Jupyter Notebook Problem
&lt;/h3&gt;

&lt;p&gt;Data scientists in healthcare analytics environments are frequently the largest uncontrolled cost source on a Databricks platform — not because they're wasteful, but because the notebook interface abstracts away cost visibility entirely.&lt;/p&gt;

&lt;p&gt;A clinical data scientist analyzing a patient cohort for a care management program opens a notebook, attaches it to the shared analytics cluster, runs several exploratory queries against a 500 GB Delta table, gets pulled into a meeting, and leaves the notebook session open. The cluster stays alive. The interactive SQL query that scanned 180 GB of patient encounter data without partition filters runs twelve times as they refine their cohort definition. There's no indication anywhere in the notebook interface that this sequence of operations cost $340 in compute.&lt;/p&gt;

&lt;p&gt;This is a governance problem as much as a technical one. The teams handling it well are doing three things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cluster policies&lt;/strong&gt; that enforce auto-terminate for all interactive clusters, no exceptions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Query tagging&lt;/strong&gt; that attributes Databricks spend to cost centers by notebook name and username, using the &lt;code&gt;custom_tags&lt;/code&gt; field in &lt;code&gt;system.billing.usage&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost visibility at the notebook level&lt;/strong&gt; — surfacing DBU consumption per notebook session to the data scientist in near-real time, so the feedback loop is immediate rather than appearing on a monthly cloud invoice&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The chargeback piece is particularly important in healthcare organizations with distinct clinical, research, and operations functions. Clinical informatics and health economics teams often have separate budgets. Without tag-based cost allocation that maps Databricks workspace spending to those cost centers, the platform team ends up absorbing costs that should be attributable to specific programs — and they lose the ability to have informed conversations about compute allocation priorities.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 2: BigQuery on GCP with ML Capabilities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;On the GCP side, BigQuery handles the analytical layer, and an increasing number of healthcare organizations are moving into BigQuery ML for the predictive modeling work that used to require standing up separate ML infrastructure. Readmission risk models. Drug interaction flags. Patient population stratification for value-based care contracts. Real-world evidence analytics for specialty pharmacy programs.&lt;/p&gt;

&lt;p&gt;BigQuery's billing model is deceptively simple on the surface — on-demand pricing at $6.25 per TB scanned — and deceptively expensive in practice. The reason: BigQuery scans are billed at column-level granularity, which means every SELECT * on a 20-column table is billing you for 20 columns worth of scan even if your WHERE clause only needs three of them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 1: BQML Training Jobs Without Partition Pruning
&lt;/h3&gt;

&lt;p&gt;BigQuery ML training jobs are some of the most expensive queries I've ever seen in a production environment. The pattern in healthcare: a readmission risk model trains on three years of encounter history, and the training SQL looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- What this query actually does to your BigQuery bill:&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="nv"&gt;`clinical.readmission_risk_v3`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'logistic_reg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_label_cols&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'readmitted_30d'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age_at_admission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary_diagnosis_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discharge_disposition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prior_admissions_12mo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length_of_stay_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readmitted_30d&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`clinical.patient_encounters`&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="nv"&gt;`clinical.patient_demographics`&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patient_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patient_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admission_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2022-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Looks reasonable. The problem is that &lt;code&gt;patient_encounters&lt;/code&gt; is partitioned by &lt;code&gt;admission_date&lt;/code&gt; and clustered on &lt;code&gt;patient_id&lt;/code&gt;, but the BQML job runs as a full table scan because the query optimizer doesn't always push partition filters through the JOIN path in training queries. On a 2 TB encounters table, that's a $12.50 charge per training run. Run it weekly as the model retrains on new data, and you're at $650/year for a single model — and this organization has fourteen predictive models.&lt;/p&gt;

&lt;p&gt;The fix: build a partitioned, incrementally maintained feature table first, then train against that bounded dataset. This separates the scan cost from the training cost, makes the scan explicit and optimizable, and avoids the strict limitations and maintenance overhead of BigQuery materialized views with joins.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Build a curated, partitioned feature table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;`clinical.readmission_features`&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;admission_date&lt;/span&gt;
&lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;patient_id&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admission_date&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;patient_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;age_at_admission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;primary_diagnosis_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;discharge_disposition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prior_admissions_12mo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length_of_stay_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;readmitted_30d&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`clinical.patient_encounters`&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="nv"&gt;`clinical.patient_demographics`&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;
  &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;patient_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admission_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2022-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Training now scans the curated feature table, not the raw history&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="nv"&gt;`clinical.readmission_risk_v3`&lt;/span&gt;
&lt;span class="k"&gt;OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'logistic_reg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;input_label_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'readmitted_30d'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;age_at_admission&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;primary_diagnosis_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;discharge_disposition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;prior_admissions_12mo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;length_of_stay_days&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;readmitted_30d&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`clinical.readmission_features`&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;admission_date&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt; &lt;span class="s1"&gt;'2022-01-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single change reduced the weekly training cost for this particular model from $12.50 to under $2.00 — because the feature table pipeline only processes the partitions that have new data since the last refresh, and the model training only scans the curated features.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 2: Partition Pruning Failures on Clinical Data
&lt;/h3&gt;

&lt;p&gt;The most common BigQuery anti-pattern I see in healthcare environments is partition pruning failure on date-partitioned clinical tables. It's particularly insidious because the query &lt;em&gt;looks&lt;/em&gt; correct.&lt;/p&gt;

&lt;p&gt;The encounter table is partitioned by &lt;code&gt;admission_date&lt;/code&gt;. The analyst writes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;patient_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;primary_diagnosis_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;length_of_stay_days&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`clinical.patient_encounters`&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;admission_date&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="nb"&gt;DATE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;CAST()&lt;/code&gt; wrapper defeats partition pruning. BigQuery cannot evaluate the partition filter before scanning because the cast is applied per-row, not per-partition. The query scans the entire table. A small change — removing the redundant cast on a column that is already of type DATE — immediately activates partition elimination.&lt;/p&gt;

&lt;p&gt;The same failure mode appears with date functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Defeats pruning:&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;YEAR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;admission_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2024&lt;/span&gt;

&lt;span class="c1"&gt;-- Activates pruning:&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;admission_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-12-31'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a 2 TB encounters table queried twenty times a day by clinical analysts, the difference between pruned and unpruned queries is approximately $225/day in scan costs. That's $82,000/year from a single table, and I've seen organizations with dozens of tables in this state.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Trap 3: On-Demand vs. Slot Reservations
&lt;/h3&gt;

&lt;p&gt;Most healthcare analytics teams start on BigQuery's on-demand pricing because it requires no commitment. The total spend is low, the per-query cost is invisible, and the billing model seems straightforward.&lt;/p&gt;

&lt;p&gt;As the analytics workload grows — more users, more dashboards, more BQML training jobs — on-demand pricing starts to become expensive and unpredictable. The shift from on-demand to slot reservations (BigQuery Editions) is one of the highest-leverage cost decisions a healthcare data platform team can make, but it requires knowing your actual workload pattern well enough to size the commitment correctly.&lt;/p&gt;

&lt;p&gt;The analysis I run before making this recommendation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;EXTRACT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HOUR&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;creation_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour_of_day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_slot_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;slot_hours&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;total_bytes_billed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;POW&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;on_demand_cost_usd&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="nv"&gt;`region-us`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFORMATION_SCHEMA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JOBS_BY_PROJECT&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;creation_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMP_SUB&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;INTERVAL&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="k"&gt;DAY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;statement_type&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'SCRIPT'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you a per-hour view of slot consumption and equivalent on-demand cost. For most healthcare analytics organizations with 24-hour query activity driven by clinical operations in the US and research teams in other time zones, the breakeven point for a 100-slot Enterprise Edition reservation typically sits around $4,000–$6,000 in monthly on-demand spend. Below that threshold, stay on on-demand. Above it, model the commitment carefully.&lt;/p&gt;

&lt;p&gt;One nuance that matters in healthcare: ML training jobs consume slots in bursts that look very different from steady analytical query traffic. A slot reservation sized for the analytical baseline will run fine most of the time but queue aggressively when training jobs fire. The right answer is usually a baseline reservation for predictable analytical workloads plus on-demand overflow for training bursts — not a single reservation attempting to cover both.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Attribution in Healthcare Environments
&lt;/h3&gt;

&lt;p&gt;The chargeback conversation in healthcare organizations is more complex than in most industries because the cost centers are more numerous and the political stakes of attribution are higher. Clinical informatics, research computing, population health management, specialty pharmacy analytics, and corporate finance analytics all have legitimate claims on the data platform, and they all have separate budget owners.&lt;/p&gt;

&lt;p&gt;The tooling to handle this correctly requires:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Label-based attribution&lt;/strong&gt; on every BigQuery job — &lt;code&gt;team&lt;/code&gt;, &lt;code&gt;cost_center&lt;/code&gt;, &lt;code&gt;project_code&lt;/code&gt;, and &lt;code&gt;data_classification&lt;/code&gt; (the last one matters for HIPAA considerations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated label enforcement&lt;/strong&gt; — labels that have to be set manually will be forgotten; they need to be injected at the pipeline level or caught by a policy that fails jobs without required labels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A reporting layer&lt;/strong&gt; that translates BigQuery billing export data into per-department spend, with drill-down to the job level for the month-end chargeback conversation&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The &lt;code&gt;data_classification&lt;/code&gt; label deserves specific attention. Before you pipe BigQuery job metadata to any external cost reporting tool, you need to understand what's in the query text and execution metadata. For most analytical queries this is innocuous — aggregate counts, summary statistics. But for ad-hoc queries run by clinical analysts, query text can contain patient identifiers, cohort definitions referencing diagnosis codes, or other protected health information. Your cost attribution infrastructure needs to be evaluated under the same data governance framework as your clinical data infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dual Skillset Problem
&lt;/h2&gt;

&lt;p&gt;What makes all of this hard is that it genuinely requires two different types of expertise to do well, and they don't naturally coexist in the same person or the same team.&lt;/p&gt;

&lt;p&gt;Query performance tuning is a deeply technical skill. Understanding why a BigQuery partition filter is failing requires knowing how the query optimizer evaluates predicates against partition metadata. Understanding why a Databricks Spark job is spilling to remote storage requires reading the physical query plan and understanding how broadcast joins compare to sort-merge joins in terms of memory pressure. These are skills that develop over years of reading execution plans and tuning real production workloads.&lt;/p&gt;

&lt;p&gt;FinOps engineering is a different skill. It requires understanding billing models across multiple vendors, building cost attribution infrastructure, having difficult conversations with budget owners about chargeback policies, and designing governance frameworks that work in regulated industries without becoming so restrictive that they impede legitimate analytical work.&lt;/p&gt;

&lt;p&gt;The healthcare analytics market is increasingly recognizing that data platform teams need both. The roles being posted now aren't just "senior data engineer" in the traditional sense — they're asking for engineers who can optimize a Spark shuffle operation in the morning and design a tag-based chargeback policy in the afternoon.&lt;/p&gt;

&lt;p&gt;That combination is rare. And organizations that have it, or are building toward it, have a significant structural advantage: they fix the right thing first. They tune the queries, let utilization settle into its real baseline, and then right-size the infrastructure against that accurate signal. The teams that don't have this expertise tend to do it backwards — they see high infrastructure costs, provision more compute to alleviate the slowness, and end up with an even larger bill for an even more complicated system.&lt;/p&gt;

&lt;p&gt;Healthcare data is too important, and the compute costs are too large, to get that order wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;If your team is managing a multi-platform healthcare analytics environment and you're not sure whether your current spend reflects your actual workload requirements, here is where I'd begin:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For Databricks on Azure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query &lt;code&gt;system.query.history&lt;/code&gt; for &lt;code&gt;spilled_remote_bytes &amp;gt; 1073741824&lt;/code&gt; — any query spilling more than 1 GB to remote storage is a high-priority tuning target&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;system.billing.usage&lt;/code&gt; for DBU consumption by &lt;code&gt;sku_name&lt;/code&gt; broken down by hour — are your peaks during scheduled pipeline windows or during interactive hours?&lt;/li&gt;
&lt;li&gt;Audit notebook cluster attachment — how many notebooks are attached to clusters that have been idle for more than 30 minutes?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For Azure Synapse SQL Pools:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check &lt;code&gt;sys.dm_pdw_exec_requests&lt;/code&gt; for &lt;code&gt;data_movement_bytes&lt;/code&gt; — if routine queries are moving more than 1 GB between distributions, your distribution key strategy needs revisiting&lt;/li&gt;
&lt;li&gt;Look at &lt;code&gt;columnstore_segments_scanned&lt;/code&gt; vs. &lt;code&gt;columnstore_segments_skipped&lt;/code&gt; — if segment elimination is below 50%, your ordered CCI configuration needs work&lt;/li&gt;
&lt;li&gt;Is your DWU tier aligned to your actual peak workload, or to the heaviest query you ever ran during initial setup?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For BigQuery on GCP:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pull &lt;code&gt;INFORMATION_SCHEMA.JOBS_BY_PROJECT&lt;/code&gt; and look for queries where &lt;code&gt;total_bytes_billed&lt;/code&gt; is significantly higher than &lt;code&gt;total_bytes_processed&lt;/code&gt; — this gap often indicates partition pruning failure&lt;/li&gt;
&lt;li&gt;Check your BQML training jobs for full table scans that could be curated feature table references instead&lt;/li&gt;
&lt;li&gt;Run the hourly slot consumption analysis above and compare against your monthly on-demand bill — the flat-rate breakeven may be closer than you think&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The underlying question in all of these cases is the same: are you paying for what you actually need, or are you paying for what a poorly optimized query decided to consume?&lt;/p&gt;

&lt;p&gt;In healthcare analytics, getting that answer right matters more than in most industries. The compute budget that's going to unnecessary BigQuery scans is compute budget that isn't going to expanding your real-world evidence analytics program, or improving your readmission prediction models, or accelerating the clinical trial data aggregation pipelines that your research teams are waiting on.&lt;/p&gt;

&lt;p&gt;The data you're working with is too important to leave money on the table like this. And the tools to find it are, for the most part, already built into the platforms you're running.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you navigated multi-platform cost optimization in a healthcare analytics environment? I'd be particularly interested in hearing how your teams are handling the HIPAA considerations around query metadata in cost monitoring tools — it's one of the most underappreciated constraints in this space.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: FinOps, DataEngineering, HealthcareAnalytics, Databricks, BigQuery, AzureSynapse, CloudCostOptimization, SQL, QueryTuning, HIPAA&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Snowflake Cost Optimization Starts With Workload Design, Not Auto-Suspend</title>
      <dc:creator>Sandeep Bhatia</dc:creator>
      <pubDate>Mon, 16 Mar 2026 02:56:31 +0000</pubDate>
      <link>https://forem.com/sandeep_bhatia_e5942a9ad6/snowflake-cost-optimization-starts-with-workload-design-not-auto-suspend-3p8n</link>
      <guid>https://forem.com/sandeep_bhatia_e5942a9ad6/snowflake-cost-optimization-starts-with-workload-design-not-auto-suspend-3p8n</guid>
      <description>&lt;h3&gt;
  
  
  Snowflake Cost Optimization Starts With Workload Design, Not Auto-Suspend
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The biggest savings usually come from attribution, workload isolation, and choosing the right Snowflake-native optimization lever for the problem you actually have.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most Snowflake cost advice starts with the same recommendation: lower your auto-suspend timeout.&lt;/p&gt;

&lt;p&gt;That advice is not wrong. It is just too small.&lt;/p&gt;

&lt;p&gt;In smaller environments, aggressive auto-suspend can capture obvious idle time. In larger Snowflake estates, the biggest savings usually come from something else, such as mixed workloads sharing the same warehouse, concurrency problems misread as sizing problems, expensive scans treated as unavoidable, and teams using billing data as a proxy for warehouse behavior.&lt;/p&gt;

&lt;p&gt;That is why the best Snowflake cost reviews do not start with asking whether this warehouse should be smaller.&lt;/p&gt;

&lt;p&gt;They start with a better question: &lt;em&gt;Does this workload belong on this warehouse at all?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That framing changes everything. It pushes you to identify what is running, where it is running, and whether the warehouse design matches the workload. It also helps you choose the right Snowflake-native lever, whether that is workload isolation, a multi-cluster strategy, Snowpark Optimized Warehouses, Dynamic Tables, or simply better attribution.&lt;/p&gt;

&lt;p&gt;Here is the review sequence I trust most.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Start with attribution, not assumptions
&lt;/h4&gt;

&lt;p&gt;Before you tune a single warehouse, answer the most basic cost question in the account: &lt;em&gt;Who is consuming credits, and for what?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In many Snowflake environments, one warehouse is serving all of this at once, including dbt model runs, BI dashboards, analyst ad hoc SQL, notebook exploration, and health checks.&lt;/p&gt;

&lt;p&gt;If those are not clearly identifiable, cost optimization becomes guesswork. The fastest improvement is usually operational, which involves standardizing your query tags, using warehouse names that encode the workload and environment, and separating service users from human users.&lt;/p&gt;

&lt;p&gt;Stop manually setting query tags for dbt. Instead of running session-level alter commands, use the native dbt-snowflake adapter configurations. Set your query tag in your project configuration file to automatically inject the model name, environment, and execution ID into the Snowflake query history.&lt;/p&gt;

&lt;p&gt;Then inspect query history:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;QUERY_TAG&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;USER_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BYTES_SCANNED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;POWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_gb_scanned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOTAL_ELAPSED_TIME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_elapsed_sec&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT_USAGE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QUERY_HISTORY&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;EXECUTION_STATUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SUCCESS'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;query_count&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  2. Stop treating billing data like utilization data
&lt;/h4&gt;

&lt;p&gt;This is one of the most common Snowflake mistakes. The metering history tells you what was billed, but it does not tell you whether a warehouse was saturated, queueing, or mostly idle between bursts. For that, you need the load history.&lt;/p&gt;

&lt;p&gt;Start with metering to see the cost, then compare it with load:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;DATE_TRUNC&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'hour'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;hour_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_RUNNING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_running&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_QUEUED_LOAD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_queued_load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_QUEUED_PROVISIONING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_queued_provisioning&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT_USAGE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WAREHOUSE_LOAD_HISTORY&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That distinction matters because the fixes are different. High credits with low queueing often indicate idle time, over-provisioning, or poor workload placement. High queueing often points to concurrency pressure. Long-running heavy queries with spill often reveal memory pressure, query shape issues, or poor pruning.&lt;/p&gt;

&lt;p&gt;If you use metering as a stand-in for everything, you will solve the wrong problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Find mixed-workload warehouses before you resize anything
&lt;/h4&gt;

&lt;p&gt;One of the most expensive Snowflake patterns is a warehouse serving workloads with completely different economics. Typical examples include heavy ELT transformations sharing compute with BI dashboards, or tiny monitoring queries running on a warehouse sized for large joins.&lt;/p&gt;

&lt;p&gt;Averages hide this, but distribution exposes it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_SIZE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;BYTES_SCANNED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p10_bytes_scanned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;BYTES_SCANNED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p50_bytes_scanned&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;PERCENTILE_CONT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;WITHIN&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;BYTES_SCANNED&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;p90_bytes_scanned&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT_USAGE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QUERY_HISTORY&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;EXECUTION_STATUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SUCCESS'&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If your 10th percentile bytes scanned is tiny while your 90th percentile is massive, you have several workloads sharing one compute pool. That is your segmentation signal. Split them into a transform warehouse, an interactive warehouse, and a utility warehouse.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Treat sizing and concurrency as different decisions
&lt;/h4&gt;

&lt;p&gt;Snowflake tuning is often reduced to going up one warehouse size. Sometimes that is right, but often it is incomplete.&lt;/p&gt;

&lt;p&gt;Larger warehouses generally improve single-query performance. But if the real issue is many users or dashboards hitting the warehouse at once, the better answer is a multi-cluster warehouse, not just a bigger size.&lt;/p&gt;

&lt;p&gt;That is why load history matters so much:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_RUNNING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_running&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_QUEUED_LOAD&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_queued_load&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVG_QUEUED_PROVISIONING&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_queued_provisioning&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT_USAGE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WAREHOUSE_LOAD_HISTORY&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;avg_queued_load&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A good rule of thumb is that if one heavy query is slow and spilling, you should think about sizing or query shape. If many small queries are waiting, you should think about your concurrency strategy. If both are happening on one warehouse, you must split the workload before deciding anything else.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. Use auto-suspend as a workload setting, not a religion
&lt;/h4&gt;

&lt;p&gt;Auto-suspend still matters, but it just is not a universal answer. For intermittent interactive work, short auto-suspend settings often make sense, but assuming 60 seconds is always right is too absolute.&lt;/p&gt;

&lt;p&gt;Inspect current settings with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;WAREHOUSES&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A better default approach involves tailoring the timeout to the use case. For interactive analyst warehouses, start around 60 to 300 seconds, then validate against actual usage. For scheduled batch warehouses, rely on a tight 60-second auto-suspend so the warehouse safely spins down the moment the orchestration tool stops sending queries, and never use your orchestration tool to explicitly send suspend commands since a tool crash means the warehouse burns credits forever. As a modern alternative for heavy and frequent batch pipelines, stop managing warehouses entirely and move these workloads to Dynamic Tables or Serverless Tasks, which automatically scale compute to the exact size needed and scale to zero the millisecond the job finishes.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Treat spill and scan inefficiency as diagnostic signals
&lt;/h4&gt;

&lt;p&gt;Spill is one of the most useful signals in Snowflake, especially for transformation-heavy workloads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;WAREHOUSE_NAME&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;QUERY_TYPE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;query_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BYTES_SPILLED_TO_LOCAL_STORAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;POWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;local_spill_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BYTES_SPILLED_TO_REMOTE_STORAGE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;POWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;remote_spill_gb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TOTAL_ELAPSED_TIME&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;avg_elapsed_sec&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;SNOWFLAKE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ACCOUNT_USAGE&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QUERY_HISTORY&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;START_TIME&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;DATEADD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CURRENT_TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;EXECUTION_STATUS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'SUCCESS'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="n"&gt;BYTES_SPILLED_TO_LOCAL_STORAGE&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
      &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;BYTES_SPILLED_TO_REMOTE_STORAGE&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;remote_spill_gb&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consistent remote spill is a strong signal, but it is not an automatic order to resize the standard warehouse.&lt;/p&gt;

&lt;p&gt;If a query is memory-bound and spilling heavily, sizing up a standard warehouse wastes money on CPU cores you do not need. The correct Snowflake-native lever is to switch to a Snowpark Optimized Warehouse of the same size, which gives you 16x the memory per node to eliminate the spill without paying for unneeded compute.&lt;/p&gt;

&lt;p&gt;The same principle applies to expensive scans. Large scans are not inherently bad, but if selective queries are scanning far more than they should, evaluate the right lever, which might include query rewrites, natural clustering versus explicit cluster keys, the Search Optimization Service, or Dynamic Tables for incremental processing.&lt;/p&gt;

&lt;p&gt;Do not jump straight from an expensive query to adding a cluster key.&lt;/p&gt;

&lt;h4&gt;
  
  
  The Takeaway
&lt;/h4&gt;

&lt;p&gt;Most Snowflake waste does not come from one catastrophic mistake. It comes from reasonable decisions that were never revisited as the platform grew.&lt;/p&gt;

&lt;p&gt;One warehouse becomes the default for everything. The system still works, so nobody redesigns it. Spend drifts upward, nothing looks broken, but plenty is inefficient.&lt;/p&gt;

&lt;p&gt;If there is one Snowflake-specific lesson worth keeping, it is this: &lt;strong&gt;Do not start by asking whether the warehouse should be smaller. Start by asking whether the workload belongs there at all.&lt;/strong&gt;&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>architecture</category>
      <category>database</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
    <item>
      <title>Framework for AWS Aurora PostgreSQL Performance Tuning</title>
      <dc:creator>Sandeep Bhatia</dc:creator>
      <pubDate>Fri, 09 May 2025 17:29:03 +0000</pubDate>
      <link>https://forem.com/sandeep_bhatia_e5942a9ad6/framework-for-aws-aurora-postgresql-performance-tuning-1iap</link>
      <guid>https://forem.com/sandeep_bhatia_e5942a9ad6/framework-for-aws-aurora-postgresql-performance-tuning-1iap</guid>
      <description>&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;This document outlines a structured framework for identifying, diagnosing, and resolving performance issues in AWS Aurora PostgreSQL. It covers methods to spot performance issues, tools used for analysis, and step-by-step processes for tuning queries, indexes, and parameters.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Spotting Performance Issues
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 Key Indicators
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Long query runtimes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;High CPU/IO usage&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Deadlocks or Lock Waits&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Frequent connection timeouts&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Poor concurrency handling&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2.2 Monitoring Tools
&lt;/h3&gt;

&lt;h4&gt;
  
  
  AWS Performance Insights
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Access&lt;/strong&gt; via AWS Console &amp;gt; RDS &amp;gt; [DB Instance] &amp;gt; Performance Insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Look For&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Top wait events (e.g., I/O:DataFileRead, CPU, LWLock)&lt;/li&gt;
&lt;li&gt;SQL queries consuming highest database load&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h4&gt;
  
  
  pg_stat_statements Extension
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;IF&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;EXISTS&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;total_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Enables tracking execution statistics of all SQL statements&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Datadog PostgreSQL Integration
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Custom dashboards to view query latency, slow query logs, CPU usage, connections&lt;/li&gt;
&lt;li&gt;Set alerts for unusual patterns or threshold breaches&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Identifying Slow Queries
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 From Performance Insights
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Navigate to SQL tab and select the top query by average or total load&lt;/li&gt;
&lt;li&gt;Extract the full SQL text&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 From pg_stat_statements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Run a query ordered by &lt;code&gt;mean_time&lt;/code&gt; or &lt;code&gt;total_time&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.3 From PostgreSQL Logs
&lt;/h3&gt;

&lt;p&gt;Ensure logging is enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;log_min_duration_statement&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- in milliseconds&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then parse logs or use log analysis tools like &lt;code&gt;pgBadger&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Analyzing Execution Plans
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Running EXPLAIN
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;ANALYZE&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'value'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4.2 Key Terms to Know
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Seq Scan&lt;/strong&gt;: Full table scan, expensive for large tables&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Index Scan/Index Only Scan&lt;/strong&gt;: Indicates use of indexes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash Join / Nested Loop / Merge Join&lt;/strong&gt;: Type of join algorithm&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rows Removed by Filter&lt;/strong&gt;: Indicates excessive post-processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Actual vs Estimated Rows&lt;/strong&gt;: Large discrepancies may suggest stats are stale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4.3 Common Fixes
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Missing index&lt;/li&gt;
&lt;li&gt;Bad join order&lt;/li&gt;
&lt;li&gt;Redundant data processing&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Index Optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Determine Candidate Indexes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="p"&gt;...;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;hypopg&lt;/code&gt; extension to test hypothetical indexes
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;hypopg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;hypopg_create_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'CREATE INDEX ON my_table(col1, col2)'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;EXPLAIN&lt;/span&gt; &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;my_table&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;col1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'abc'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;col2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'xyz'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5.2 Best Practices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Avoid indexes on low-cardinality columns&lt;/li&gt;
&lt;li&gt;Multicolumn indexes must match filter order&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;covering indexes&lt;/code&gt; if possible (include columns in SELECT)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  6. Materialized Views
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6.1 When to Use
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Costly aggregation or join logic&lt;/li&gt;
&lt;li&gt;Rarely changing data&lt;/li&gt;
&lt;li&gt;Queries with multiple joins and filters over large datasets&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.2 Example
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;daily_summary&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;facility_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'day'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;facility_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;day&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- To refresh:&lt;/span&gt;
&lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="n"&gt;MATERIALIZED&lt;/span&gt; &lt;span class="k"&gt;VIEW&lt;/span&gt; &lt;span class="n"&gt;daily_summary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  7. Parameter Tuning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Key Parameters
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;work_mem&lt;/strong&gt;: Memory per operation for sorting, hashing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;shared_buffers&lt;/strong&gt;: RAM used by PostgreSQL to cache data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;effective_cache_size&lt;/strong&gt;: Estimate of how much memory is available for disk caching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;max_parallel_workers_per_gather&lt;/strong&gt;: Controls degree of parallelism&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.2 Determine Current Settings
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SHOW&lt;/span&gt; &lt;span class="n"&gt;work_mem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_settings&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'work_mem'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'shared_buffers'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'effective_cache_size'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7.3 Adjust Parameters
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;SYSTEM&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;work_mem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'64MB'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_reload_conf&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Test effect using query performance and explain plans&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Plan Stabilization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 When Needed
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Queries using prepared statements behave inconsistently due to changing parameters&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8.2 Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;pg_hint_plan&lt;/code&gt; extension to guide planner&lt;/li&gt;
&lt;li&gt;Avoid generic plans by forcing constant values (e.g., inline SQL instead of prepared statements for high-variance queries)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  9. Testing and Validation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  9.1 Testing Queries
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Use &lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Run queries with and without optimization to compare&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9.2 A/B Test Indexes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;enable_seqscan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;OFF&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;-- Force index usage for testing&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  9.3 Regression Testing
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ensure optimized queries return correct data&lt;/li&gt;
&lt;li&gt;Use query result diffing in CI/CD pipelines&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. Change Management
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Always test changes in &lt;strong&gt;Dev&lt;/strong&gt; &amp;gt; &lt;strong&gt;Stage&lt;/strong&gt; &amp;gt; &lt;strong&gt;Prod&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Keep a rollback plan&lt;/li&gt;
&lt;li&gt;Track changes using version control and deployment scripts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  11. Automation Scripts (Examples)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  11.1 Find Unused Indexes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_user_indexes&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;idx_scan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;indexrelname&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;LIKE&lt;/span&gt; &lt;span class="s1"&gt;'%pkey%'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.2 Monitor Query Latency
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;calls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;total_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean_time&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_statements&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;mean_time&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  11.3 Refresh Materialized Views in Batch
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;view &lt;span class="k"&gt;in &lt;/span&gt;view1 view2 view3&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;psql &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"REFRESH MATERIALIZED VIEW CONCURRENTLY &lt;/span&gt;&lt;span class="nv"&gt;$view&lt;/span&gt;&lt;span class="s2"&gt;;"&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  12. Conclusion
&lt;/h2&gt;

&lt;p&gt;Tuning AWS Aurora PostgreSQL is a structured, ongoing process involving monitoring, diagnosis, experimentation, and validation. A combination of query optimizations, indexing, view management, parameter tuning, and careful change control ensures sustained performance improvements.&lt;/p&gt;




</description>
    </item>
  </channel>
</rss>
