<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Aaron Wiegel</title>
    <description>The latest articles on Forem by Aaron Wiegel (@aawiegel).</description>
    <link>https://forem.com/aawiegel</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3727268%2Fc402d4b0-e403-4044-9236-01555bd8182f.jpeg</url>
      <title>Forem: Aaron Wiegel</title>
      <link>https://forem.com/aawiegel</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/aawiegel"/>
    <language>en</language>
    <item>
      <title>Synthetic Data and the Privacy Problem: Beyond Alice and Bob</title>
      <dc:creator>Aaron Wiegel</dc:creator>
      <pubDate>Wed, 04 Mar 2026 20:11:40 +0000</pubDate>
      <link>https://forem.com/aawiegel/synthetic-data-and-the-privacy-problem-beyond-alice-and-bob-4j3k</link>
      <guid>https://forem.com/aawiegel/synthetic-data-and-the-privacy-problem-beyond-alice-and-bob-4j3k</guid>
      <description>&lt;p&gt;The fixtures in &lt;a href="https://dev.to/aawiegel/from-bronze-to-silver-staging-intermediate-and-the-art-of-the-trustworthy-join-3ng5"&gt;this series&lt;/a&gt; have always been honest about what they were optimizing for. Posts 1 through 3 generated vendor CSV files designed to capture structural chaos: typos in column headers, shifting measurement packages, metadata rows that Spark reads with misplaced confidence, using the same tools we'll develop further here. The goal was a bronze layer that could absorb whatever shape a vendor file arrived in without requiring code changes. The fixture data itself, a collection of pH readings and copper concentrations, was never the point. Nobody's privacy interests are implicated by a synthetic soil sample.&lt;/p&gt;

&lt;p&gt;Customer records are a different matter entirely.&lt;/p&gt;

&lt;p&gt;A soil lab does not only process measurements. It processes submissions from real farms, research institutions, and agricultural businesses. Names, addresses, billing contacts, field histories. The moment a pipeline needs to model that relationship, "generate some plausible-looking data" stops being a casual decision and starts being a question worth taking seriously. What does realistic mean for sensitive data? How do you test against customer records when the real ones are legally and ethically off-limits? And if you cannot use real data, how do you know your fixtures are testing the right things?&lt;/p&gt;

&lt;p&gt;This post answers those questions by building two tools that address related but distinct problems. The customer generator produces realistic profiles derived deterministically from barcodes that already exist in the pipeline. The masking library addresses a separate situation entirely: when real production data needs to enter a development environment with appropriate controls applied. Both tools are useful. They are not interchangeable.&lt;/p&gt;

&lt;p&gt;The code for this post can be found &lt;a href="https://github.com/aawiegel/zen_bronze_data" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Feel free to follow along or dig in if you want more details.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Customer Generator Produces
&lt;/h2&gt;

&lt;p&gt;The names, addresses, and contact details come from numpy's random number generator seeded with a SHA-256 digest of each barcode. The same barcode produces the same customer profile every time, in every environment, without storing anything. Faker would have produced more linguistically convincing output and probably should have been the tool for this job. The fixtures work fine regardless. Realism of content was never the point. Stability of attachment was.&lt;/p&gt;

&lt;p&gt;The mechanism is SHA-256 used as a pure function into numpy's seed space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;derive_seed_from_barcode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;barcode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;barcode&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_bytes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;byteorder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;big&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Given any string, this returns a deterministic non-negative integer suitable for seeding &lt;code&gt;numpy.random.default_rng&lt;/code&gt;. The same string always produces the same seed, which means the same &lt;code&gt;customer_id&lt;/code&gt; always produces the same profile, regardless of when or how many times the generator runs. This is referential stability without a database.&lt;/p&gt;

&lt;p&gt;The customer data divides into two tables. A &lt;code&gt;customers&lt;/code&gt; table holds one row per unique customer: name, address, contact info, date of birth, age, and a free-text notes field that sounds like it came from someone who has been doing soil testing long enough to have opinions about irrigation timing. A &lt;code&gt;customer_samples&lt;/code&gt; table holds one row per barcode, linking it to a customer with submission-level fields: &lt;code&gt;crop_type&lt;/code&gt; and &lt;code&gt;sample_date&lt;/code&gt;. The separation matters because customers are entities and samples are facts. A corn farmer submits many soil samples across many fields and seasons. Flattening that into a single table would denormalize the relationship in ways that create trouble the moment you want customer-level aggregation or need to update a profile field.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;customers_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;forge_customers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_barcodes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;//&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;customer_samples_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;forge_customer_sample_assignments&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_barcodes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The five-to-one ratio means most customers appear across multiple sample submissions, which reflects how actual lab relationships work. One detail worth mentioning: the address generator includes a Wisconsin PLSS coordinate format with about 20% probability. &lt;code&gt;N5024W3295&lt;/code&gt; is a real addressing convention from the Public Land Survey System, common in rural parts of the Midwest. It shows up verbatim in actual vendor lab reports (usually to the quiet dismay of whoever first encounters it), which means it shows up in the fixtures, which means any address parsing logic gets tested against it. That is the point of realistic fixtures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use the Generated Data Directly?
&lt;/h2&gt;

&lt;p&gt;That raises a legitimate question, though. Randomly generated data is structurally valid but statistically hollow. The real distributions, the genuine quirks, the specific shenanigans your data gets up to when nobody is watching: all of that is gone, replaced by Alice and Bob. Those two are perfectly serviceable for unit testing logic in isolation. Based on the number of unit test failures I've seen with them, I would not trust them together at a bar on a Saturday night. Similarly, I would not trust them as a proxy for production data's full range of quirks. If the goal is integration testing, fixtures that never existed in the real world can only tell you so much.&lt;/p&gt;

&lt;p&gt;This is where masking enters as a separate tool for a separate problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Masking Strategies
&lt;/h2&gt;

&lt;p&gt;The alternative to generating synthetic data is masking real data. Synthetic generation works when you need data that never existed and want full control over its structure. Masking works when you need a development dataset derived from real production records, preserving real distributions, real anomalies, and real edge cases that a generator might miss. In practice, many teams need both: generated data for early development, masked production data for integration testing before launch.&lt;/p&gt;

&lt;p&gt;Four strategies cover the meaningful design space, although this is not meant to exhaustive, merely illustrative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shuffle&lt;/strong&gt; permutes values within a column independently, which preserves the marginal distribution: if 30% of customers are in Wisconsin before shuffling, 30% still are afterward. Every value in the column is real. Every value passed format validation before masking and will pass it again after. Nothing looks wrong.&lt;/p&gt;

&lt;p&gt;The hazard is relational, not statistical. Shuffle severs the connection between a value and the row it belonged to. Sometimes that severance is the goal. The association between a particular farm and a particular set of samples can be as sensitive as the customer record itself. For end-to-end testing, whether barcode LAB-001 actually belongs to Sandra Hernandez is irrelevant to whether the pipeline processes it correctly, and severing that link adds a layer of protection if the development environment is ever compromised. Shuffle breaks the relationship deliberately and completely, which is occasionally exactly what a privacy requirement demands.&lt;/p&gt;

&lt;p&gt;Where shuffle becomes hazardous is when the relationship itself is load-bearing. A dataset released for external research might need customer histories intact but otherwise anonymized to be analytically meaningful. Shuffle would destroy that meaning while preserving the appearance of validity: joins complete, results look plausible, and the underlying analysis is quietly wrong. That use case is outside the scope of what we are building here, but it is worth understanding before reaching for shuffle by default.&lt;/p&gt;

&lt;p&gt;The strategy is not wrong. It requires knowing whether the relationship you are severing was one you needed to keep. Most teams discover this distinction at an inopportune time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Imputation&lt;/strong&gt; replaces column values with output from a callable. The masking module takes an &lt;code&gt;imputers&lt;/code&gt; dictionary mapping column names to functions that accept a row count and return a list of replacement values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;imputers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Anonymous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;masked&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mask_impute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customers_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;imputers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every customer becomes Anonymous. No distribution is preserved, no format inference required. For fields where the actual value is irrelevant to what you are testing, that is perfectly sufficient.&lt;/p&gt;

&lt;p&gt;The limitation is the flip side of the explicitness. A dataset where every customer name is Anonymous is not anonymized in any meaningful sense; it is a dataset with a broken name column. If downstream logic does anything with name format, uniqueness, or even basic non-nullness, imputation will expose that dependency immediately. This is occasionally useful information. A pipeline that silently assumes customer names are unique has a latent bug that imputation will surface faster than any other strategy.&lt;/p&gt;

&lt;p&gt;The callable interface also means imputation can do more than substitute a constant. A more sophisticated imputer could draw from a list of plausible replacements, apply format rules, or generate values that satisfy downstream validation constraints. Anonymous is the simplest possible implementation. The mechanism supports considerably more nuance when the situation calls for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resampling&lt;/strong&gt; handles numeric columns by fitting a normal distribution to the existing values and drawing a fresh set of values clipped to the original range:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;column_mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;column_std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original_values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;column_mean&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column_std&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;masked_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;clip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resampled&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;original_values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;original_values&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The statistical shape of the column is preserved: similar mean, similar spread, values that fall within the original bounds. A data scientist working with a resampled age column will see a realistic distribution without seeing any real ages. For downstream logic that cares about aggregates rather than individual values, this is the most analytically faithful masking strategy available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hashing&lt;/strong&gt; applies SHA-256 and keeps the first eight hex characters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;masked_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;masked_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;salt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical property is determinism. The same input with the same salt always produces the same output, which means hashed values can be joined across tables. Hash &lt;code&gt;customer_id&lt;/code&gt; in both &lt;code&gt;customers&lt;/code&gt; and &lt;code&gt;customer_samples&lt;/code&gt; using the same salt and &lt;code&gt;CUST-CE8B39&lt;/code&gt; becomes &lt;code&gt;c2fd6c19&lt;/code&gt; in both places. The foreign key relationship survives. This is the only masking strategy where that is true.&lt;/p&gt;

&lt;p&gt;The tradeoffs are real and worth understanding. An eight-character hex string truncated from SHA-256 is not cryptographically robust, particularly for low-cardinality columns where brute force recovery is straightforward. A salt raises the cost of that attack without eliminating it. Hash masking is appropriate for development environments where the goal is protecting data from casual exposure, not for datasets approaching public distribution.&lt;/p&gt;

&lt;p&gt;The salt also creates an operational dependency worth naming explicitly. Lose the salt, and your masked datasets become unrelatable across tables and unreproducible from scratch. It belongs in environment configuration alongside your database credentials, not in the codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Combining the Strategies
&lt;/h2&gt;

&lt;p&gt;The four strategies combine into a single &lt;code&gt;apply_masking&lt;/code&gt; call that dispatches per column based on a configuration dictionary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mask_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;impute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_of_birth&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;email&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;phone&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;street_address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;city&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resample&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;masked_customers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_masking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;customers_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mask_config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;imputers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Anonymous&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;salt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MASK_SALT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;customer_samples&lt;/code&gt; table requires the same salt on &lt;code&gt;customer_id&lt;/code&gt; to preserve the join:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;masked_assignments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;apply_masking&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;customer_samples_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;customer_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;hash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crop_type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shuffle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;generator&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;gen&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;salt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MASK_SALT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fact that this requires explicit decisions about every column is a feature. It forces the question of what each field actually is before deciding how to treat it. That is a conversation worth having before the data enters a development environment, not after.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Masked Data Actually Looks Like
&lt;/h2&gt;

&lt;p&gt;We can now compare &lt;code&gt;customers.csv&lt;/code&gt; and &lt;code&gt;customers_masked.csv&lt;/code&gt;. One row from each:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CUST-CE8B39, Sandra Hernandez, 1975-09-01, 51, sandra.hernandez@protonmail.com,
(923) 867-3934, 4332 Hollow Trl, Oxford, WV, 27369, Comparison plot for trial program
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Masked:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;c2fd6c19, Anonymous, 1957-07-21, 72.89, mark.jones@aol.com,
(364) 820-3448, 4848 Pasture Dr, Dover, WV, 27369, Comparison plot for trial program
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Most of this looks reasonable until you look carefully. The customer ID is hashed, the name is imputed, the date of birth, phone, street address, and city have been shuffled to values that belonged to different customers.&lt;/p&gt;

&lt;p&gt;The age has been resampled to 72.89. The masked output is technically within range and statistically plausible in aggregate, but no human being has ever reported themselves as 72.89 years old. Any schema that enforces INTEGER on that column will reject it immediately. This is the kind of thing that looks fine in a test that checks whether a value exists and looks obviously wrong the moment anyone actually reads it.&lt;/p&gt;

&lt;p&gt;Then there is the email address.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mark.jones@aol.com&lt;/code&gt; is sitting in the masked output fully readable. Shuffling an email address does not protect it. It reassigns it. Sandra's email is now attached to someone else's row and Mark's is attached to Sandra's. Both are still completely exposed. If the goal was protecting contact information, the masking configuration failed quietly and completely.&lt;/p&gt;

&lt;p&gt;The city and zip code tell a similar story. Oxford shuffled to Dover, but the zip code stayed as 27369. Column-level masking applied independently has no awareness of relationships between columns. The result passes any single-column validation and fails the moment anything checks whether the address makes geographic sense.&lt;/p&gt;

&lt;p&gt;None of these are bugs in the masking implementation (OK, the age thing is). All of them are the correct output of the configuration we provided. That is precisely the point: the strategies do exactly what they are told, not what you meant.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Silver Layer Extension and Unit Tests
&lt;/h2&gt;

&lt;p&gt;The fixture infrastructure is only useful if something actually runs against it. Before loading masked data into a test schema and invoking &lt;code&gt;dbt build&lt;/code&gt;, it helps to know that the model logic itself is correct. The unit test handles that first.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;int_lab_samples_with_customers&lt;/code&gt; performs a two-hop join across three models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;int_lab_samples_standardized&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;stg_customer_samples&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;int_lab_samples_standardized&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_barcode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stg_customer_samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_barcode&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;stg_customers&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;stg_customer_samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stg_customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model adds &lt;code&gt;has_customer_assignment&lt;/code&gt;, a boolean that is &lt;code&gt;TRUE&lt;/code&gt; when a measurement's barcode has a matching row in &lt;code&gt;customer_samples&lt;/code&gt;, which makes unmatched measurements findable without requiring every downstream query to filter on NULL customer fields.&lt;/p&gt;

&lt;p&gt;The join has three distinct behavioral regimes worth testing explicitly: a barcode with a full match through both hops, a barcode with no customer assignment at all, and a barcode that matches &lt;code&gt;customer_samples&lt;/code&gt; but whose &lt;code&gt;customer_id&lt;/code&gt; has no corresponding row in &lt;code&gt;customers&lt;/code&gt;. That third case, an orphaned assignment, is realistic enough to justify its own fixture row. Anyone who has spent meaningful time with production data has encountered some version of this: a submission that arrived before its parent record, or a customer that got cleaned up while their samples stayed behind. A barcode submitted before a customer record existed, or after one was deleted, should produce &lt;code&gt;has_customer_assignment: true&lt;/code&gt; and &lt;code&gt;customer_name: null&lt;/code&gt;. Testing it explicitly verifies the full join chain rather than just the happy path.&lt;/p&gt;

&lt;p&gt;Once the unit tests pass, the masked fixtures are ready to do their actual job. Load &lt;code&gt;masked_customers&lt;/code&gt; and &lt;code&gt;masked_assignments&lt;/code&gt; into a test schema, point &lt;code&gt;dbt build --select your_models --target test&lt;/code&gt; at it, and you have an integration test that exercises the pipeline against data with the structural properties and relational complexity of production records, without any of the liability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The masking strategies covered here represent a design space, not a checklist. Shuffle preserves distributions but severs relationships, deliberately or otherwise. Imputation is explicit but analytically hollow. Resampling maintains statistical shape but loses type fidelity. Hashing preserves referential integrity at the cost of cryptographic robustness. No single strategy is correct in isolation. The configuration you choose reflects a set of tradeoffs that are worth making consciously rather than discovering later when something downstream produces confident, well-formatted, wrong answers.&lt;/p&gt;

&lt;p&gt;The fixture infrastructure this post builds serves a specific purpose: giving the pipeline realistic data to run against without the liability of real records. The masked customer dataset is not a privacy guarantee. It is a development tool, and like any development tool its value depends on using it for the right job. Getting the configuration right before you need it is substantially easier than explaining a masking decision you made at 4pm on a Thursday.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Complete working example:&lt;/strong&gt; The labforge customer and masking modules are in &lt;a href="https://github.com/aawiegel/zen_bronze_data/tree/main/src/labforge" rel="noopener noreferrer"&gt;src/labforge/&lt;/a&gt;. The dbt models are in &lt;a href="https://github.com/aawiegel/zen_bronze_data/tree/main/src/crucible/models/silver" rel="noopener noreferrer"&gt;src/crucible/models/silver/&lt;/a&gt;. Example data comparing raw and masked output is in &lt;a href="https://github.com/aawiegel/zen_bronze_data/tree/main/example_data/" rel="noopener noreferrer"&gt;example_data/&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>From Bronze to Silver: Staging, Intermediate, and the Art of the Trustworthy Join</title>
      <dc:creator>Aaron Wiegel</dc:creator>
      <pubDate>Wed, 25 Feb 2026 15:50:32 +0000</pubDate>
      <link>https://forem.com/aawiegel/from-bronze-to-silver-staging-intermediate-and-the-art-of-the-trustworthy-join-3ng5</link>
      <guid>https://forem.com/aawiegel/from-bronze-to-silver-staging-intermediate-and-the-art-of-the-trustworthy-join-3ng5</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/aawiegel/the-zen-of-the-bronze-layer-embracing-schema-chaos-7hn"&gt;Part 3&lt;/a&gt; solved vendor schema chaos by treating column names as data. The bronze layer now holds one row per attribute per measurement: &lt;code&gt;lab_provided_attribute&lt;/code&gt; carries whatever the vendor typed as a column name, &lt;code&gt;lab_provided_value&lt;/code&gt; carries the cell value, and the table schema stays fixed and vendor-agnostic regardless of how creatively the vendor chose to name things. No fuzzy matching, no superset schemas, no vendor-specific parsing logic. The chaos was encoded as rows rather than fought through transformations.&lt;/p&gt;

&lt;p&gt;It was the right call for ingestion. It is a genuinely terrible format for answering questions.&lt;/p&gt;

&lt;p&gt;Ask "what was the average copper measurement for samples received in October?" against long-format EAV data and you're looking at a subquery to identify which &lt;code&gt;lab_provided_attribute&lt;/code&gt; values represent copper, a pivot to bring those values into a column, a join to find the collection date from another EAV row, and a filter on the date. Do that once, and it's manageable. Do it in every query anyone ever writes against this data, and you've simply moved the transformation burden from ingestion to analysis. The point of a multi-layer pipeline is to make that transformation happen once, correctly, in a place where it can be tested and maintained.&lt;/p&gt;

&lt;p&gt;Silver is where we make it happen. This post introduces the dbt models that carry bronze data from EAV chaos into an analytical schema, walks through the architectural reasoning behind each layer, and then runs the first unit test. It does not go well.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why dbt for Silver Transformations
&lt;/h2&gt;

&lt;p&gt;Part 3 expressed the bronze layer in Python and PySpark. We could take the same approach for silver: write Spark SQL transformations directly, materialize staging and intermediate results as Delta tables, and wire them together with a notebook or a Databricks job. The transformation logic is the same regardless of the execution environment.&lt;/p&gt;

&lt;p&gt;The reason to use dbt here is not that the SQL would be different. It's that dbt enforces structural discipline that raw Spark SQL scripts require you to maintain yourself, through consistent conventions and sheer force of will. One of those scales better than the other.&lt;/p&gt;

&lt;p&gt;Each dbt model lives in its own file with an explicit dependency graph. When &lt;code&gt;int_lab_samples_joined&lt;/code&gt; references &lt;code&gt;ref('stg_lab_samples_unpivoted')&lt;/code&gt;, dbt knows to run the staging model first, tracks the lineage between models, and rebuilds downstream models when upstream ones change. Schema documentation lives alongside the models in YAML files and can be generated into browsable documentation. Tests run against the models directly. Materialization strategies (view, table, incremental) are model-level configuration, not something you manage separately in SQL DDL. None of this is impossible with raw Spark SQL scripts. It just requires the kind of ironclad consistency that engineering teams reliably maintain right up until they don't. dbt builds the discipline into the workflow rather than assuming it from the humans.&lt;/p&gt;

&lt;p&gt;For a transformation layer with multiple models in a dependency chain, that structure is worth the overhead of learning a new tool. The silver layer here has at least three models with an explicit sequence: staging must run before intermediate, and getting that sequence wrong produces silent incorrect results rather than an obvious error. dbt makes those dependencies explicit and enforces them automatically. And for what it's worth, dbt Core and Databricks Community Edition are both free, so the overhead is purely cognitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Silver Has Layers
&lt;/h2&gt;

&lt;p&gt;Here is something the dbt documentation will not tell you: Ralph Kimball described this architecture thirty years ago.&lt;/p&gt;

&lt;p&gt;The Data Warehouse Toolkit (first published 1996, revised 2013) describes a staging area whose job is to clean and conform source data before it enters the warehouse. The staging area strips away source-system idiosyncrasies. It normalizes formats, resolves encoding variations, and creates consistent representations that downstream logic can depend on. The integration layer then applies business logic: joining conformed source data to reference tables, enriching records with canonical meaning, building the integrated representation that analysts actually query. dbt's best-practices guides on project structure [1] use exactly this division: staging models are thin, source-cleaning layers; intermediate models are where complex joins and business logic live.&lt;/p&gt;

&lt;p&gt;The vocabulary has changed; the conceptual structure has not.&lt;/p&gt;

&lt;p&gt;This matters because it explains WHY the layers exist, not just what they happen to contain. Conventions without reasoning are just folklore. The staging and intermediate separation encodes a functional dependency: intermediate models depend on guarantees that staging models make. Staging exists to create a contract. Intermediate relies on that contract to do its job. When those responsibilities blur, with source-system cleaning mixed into join logic and business rules embedded in normalization models, the result is transformation logic that is hard to reason about, hard to test, and hard to change safely.&lt;/p&gt;

&lt;p&gt;With EAV bronze data, this layering matters more than usual. The staging layer has a specific, structural job to do before any downstream join is even possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Staging Model: Creating a Joinable Key
&lt;/h2&gt;

&lt;p&gt;The problem with bronze EAV data is that &lt;code&gt;lab_provided_attribute&lt;/code&gt; holds whatever the vendor typed. Vendor A sends &lt;code&gt;"Sample Concentration"&lt;/code&gt;. Vendor B sends &lt;code&gt;"sample_concentration "&lt;/code&gt; (trailing space included). A third vendor might send &lt;code&gt;"SAMPLE CONC."&lt;/code&gt;. These represent the same measurement. No string comparison will match them without normalization.&lt;/p&gt;

&lt;p&gt;The staging layer's job is to solve this problem once, correctly, in a place downstream models can depend on. Every join that comes after this model is only possible because this model did its job first. The model adds one column, &lt;code&gt;attribute_standardized&lt;/code&gt;, that normalizes &lt;code&gt;lab_provided_attribute&lt;/code&gt; into a consistent form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;lab_samples_unpivoted&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt;  &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'bronze'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'lab_samples_unpivoted'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;lab_samples_unpivoted_staged&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;LOWER&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;TRIM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="k"&gt;TRANSLATE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;lab_samples_unpivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s1"&gt;'-$()#./ %@!'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s1"&gt;'___________'&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;attribute_standardized&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;lab_samples_unpivoted&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;lab_samples_unpivoted_staged&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cleaning chain does three things. &lt;code&gt;TRANSLATE&lt;/code&gt; replaces eleven special characters (&lt;code&gt;-$()#./ %@!&lt;/code&gt;) with underscores. &lt;code&gt;TRIM&lt;/code&gt; strips leading and trailing whitespace. &lt;code&gt;LOWER&lt;/code&gt; normalizes to lowercase so that capitalization differences don't produce false mismatches. With their powers combined, these three functions will help us later pivot back to wide format cleanly for specific analytical views.&lt;/p&gt;

&lt;p&gt;Applied to realistic vendor data: &lt;code&gt;"Sample Concentration"&lt;/code&gt; becomes &lt;code&gt;sample_concentration&lt;/code&gt;. &lt;code&gt;"-a$lot(of)weird#symbols.why/vendors%why@!"&lt;/code&gt; becomes &lt;code&gt;a_lot_of_weird_symbols_why_vendors_why&lt;/code&gt;, which is arguably an improvement on the original in more ways than one.&lt;/p&gt;

&lt;p&gt;Kimball called the underlying concept "conforming." Conformed dimensions bring disparate source systems into a shared vocabulary so they can be joined and analyzed together. You build a date dimension once, in a shared format, and every fact table that references dates joins to the same table. You don't maintain separate date representations for each source system. We call what the staging model does "attribute standardization." The goal is identical: create a canonical representation that downstream logic can join on without worrying about source-system variation. Same church, different pew.&lt;/p&gt;

&lt;p&gt;Notice what this model does not do. It doesn't join to any reference data, doesn't classify anything, doesn't apply a single business rule. It adds one column whose entire purpose is to make the next model's join possible. That constraint is deliberate, and violating it is how staging models become the thing nobody wants to touch six months later. dbt's best practices [2] describe staging models as thin layers that do the minimum necessary to make source data useful downstream: rename columns to consistent conventions, cast types, add computed identifiers. Business logic belongs in intermediate. The moment staging models start containing complex transformations, you've lost the separation that makes both layers maintainable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intermediate Layer: Enrichment Through Joining
&lt;/h2&gt;

&lt;p&gt;With &lt;code&gt;attribute_standardized&lt;/code&gt; available, the first intermediate model can do what staging made possible.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;int_lab_samples_joined&lt;/code&gt; performs two joins. Neither is complicated. What matters is why they happen in this order and what the LEFT JOIN choice is actually saying about our relationship with imperfect data.&lt;/p&gt;

&lt;p&gt;The first join connects staged measurements to a vendor column mapping table on &lt;code&gt;attribute_standardized&lt;/code&gt; and &lt;code&gt;vendor_id&lt;/code&gt;. The mapping table is reference data that records the connection between vendor-specific attribute names and canonical column identifiers. The second join connects the mapping result to a canonical column definitions table, which carries authoritative metadata about each canonical column: its data type, its category (measurement, identifier, date), and whether it should be treated as a metadata column or a measurement column in downstream transformations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;stg_lab_samples&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'stg_lab_samples_unpivoted'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'bronze'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'vendor_column_mapping'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;canonical_column_definitions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;source&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'silver'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'canonical_column_definitions'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;joined&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;EXCEPT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;notes&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;vendor_mapping_notes&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;stg_lab_samples&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;
        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;stg_lab_samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;attribute_standardized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_column_name&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;stg_lab_samples&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;canonical_column_definitions&lt;/span&gt;
        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;canonical_column_definitions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;joined&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The join condition includes both &lt;code&gt;attribute_standardized&lt;/code&gt; and &lt;code&gt;vendor_id&lt;/code&gt;. Two vendors might use similar standardized attribute names for genuinely different measurements, or the canonical mapping might differ by vendor for domain-specific reasons. Joining on both columns preserves that specificity.&lt;/p&gt;

&lt;p&gt;The LEFT JOIN choice is a deliberate QA decision, and it reflects a specific philosophy about what bronze-to-silver transformation should do with ambiguity. Rows that don't match the mapping table don't disappear. They survive with &lt;code&gt;canonical_column_name IS NULL&lt;/code&gt;, which is not a failure state. It's a diagnostic signal. The pipeline is saying "something arrived that I don't recognize" rather than quietly discarding evidence that something unexpected happened.&lt;/p&gt;

&lt;p&gt;Downstream QA can filter on that null to find unmapped attributes and determine whether the mapping table needs a new entry, whether the vendor introduced something unexpected, or whether an existing mapping has a standardization error. The silver layer doesn't resolve that ambiguity; it surfaces it in a way that makes it findable.&lt;/p&gt;

&lt;p&gt;This is Kimball's integration layer in practice [3], thirty years later, running on Delta Lake instead of whatever disk arrays cost a fortune in 1996. A vendor-specific attribute in EAV format becomes a row that carries canonical column name, data type, category, and enrichment notes. The enrichment is what makes downstream transformations tractable. Without it, every downstream model or query would need to join to the mapping table itself, with full knowledge of vendor-specific attribute naming. That knowledge belongs here, applied once.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Intermediate Layer: Completing the Schema Transformation
&lt;/h2&gt;

&lt;p&gt;Now that we have our joined data model available in intermediate, we can make this data more usable.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;int_lab_samples_standardized&lt;/code&gt; handles this with conditional aggregation, which is the SQL equivalent of sorting your mail: same pile, separated by what actually matters.&lt;/p&gt;

&lt;p&gt;The challenge is structural. Bronze EAV format is gloriously egalitarian: sample barcodes, collection dates, copper measurements, and pH readings all get the same treatment, one row each, no hierarchy, no distinction. That democratic impulse was exactly right for ingestion. For analysis it is a nightmare, because analysts don't want to write a subquery just to find out what day a sample was collected.&lt;/p&gt;

&lt;p&gt;In the bronze EAV format, metadata columns (sample barcode, lab ID, collection date, analysis date) live in the same row structure as measurement columns (copper, zinc, pH). A typical row from a sample with three measurements and four metadata columns produces seven EAV rows: four metadata rows and three measurement rows, all sharing the same &lt;code&gt;row_index&lt;/code&gt;. Analysts want one record per measurement with the metadata attached as columns, not a set of EAV rows where metadata and measurements are interleaved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'view'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;

&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;int_lab_samples_joined&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;{{&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'int_lab_samples_joined'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}}&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;metadata_pivoted&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;row_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;vendor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;canonical_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'sample_barcode'&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;lab_provided_value&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sample_barcode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;canonical_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'lab_id'&lt;/span&gt;         &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;lab_provided_value&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lab_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;canonical_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'date_received'&lt;/span&gt;  &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;lab_provided_value&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;date_received&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;canonical_column_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'date_analyzed'&lt;/span&gt;  &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="n"&gt;lab_provided_value&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;date_analyzed&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;int_lab_samples_joined&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_metadata_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;row_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="c1"&gt;-- All non-metadata rows, including unmapped ones (is_metadata_column IS NULL).&lt;/span&gt;
&lt;span class="c1"&gt;-- Unmapped rows are preserved for QA; filter on is_metadata_column IS NULL to find problem attributes.&lt;/span&gt;
&lt;span class="n"&gt;measurements&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;EXCEPT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;is_metadata_column&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;int_lab_samples_joined&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;is_metadata_column&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;FALSE&lt;/span&gt; &lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="n"&gt;is_metadata_column&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;),&lt;/span&gt;

&lt;span class="n"&gt;standardized&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;measurements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;EXCEPT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sample_barcode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_received&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;date_analyzed&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;measurements&lt;/span&gt;
    &lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;
        &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;measurements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;row_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;row_index&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;measurements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt;
        &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;measurements&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;metadata_pivoted&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;standardized&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The logic splits the enriched EAV data into two paths. The &lt;code&gt;metadata_pivoted&lt;/code&gt; CTE filters for rows where &lt;code&gt;is_metadata_column = TRUE&lt;/code&gt; and collapses them into one record per &lt;code&gt;row_index&lt;/code&gt;, &lt;code&gt;vendor_id&lt;/code&gt;, and &lt;code&gt;file_name&lt;/code&gt;, using &lt;code&gt;MAX(CASE WHEN ...)&lt;/code&gt; to pull each metadata attribute into its own named column. The &lt;code&gt;measurements&lt;/code&gt; CTE takes the other path: rows where &lt;code&gt;is_metadata_column = FALSE&lt;/code&gt; (actual measurement attributes) and rows where &lt;code&gt;is_metadata_column IS NULL&lt;/code&gt; (attributes that didn't match the mapping table) both pass through. The final join recombines the two paths on all three columns.&lt;/p&gt;

&lt;p&gt;The three-column join condition deserves a note, because this is exactly the kind of thing that looks fine until it catastrophically isn't. &lt;code&gt;row_index&lt;/code&gt; is the position of a row within a single CSV file; it resets to zero at the start of each file. Joining on &lt;code&gt;row_index&lt;/code&gt; and &lt;code&gt;vendor_id&lt;/code&gt; alone would incorrectly match row 5 from &lt;code&gt;file_a.csv&lt;/code&gt; with row 5 from &lt;code&gt;file_b.csv&lt;/code&gt; if both came from the same vendor. Including &lt;code&gt;file_name&lt;/code&gt; makes the join key specific to a row within a specific file from a specific vendor: the granularity we actually want.&lt;/p&gt;

&lt;p&gt;Two other design decisions in this model are worth making explicit.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;is_metadata_column IS NULL&lt;/code&gt; inclusion in the measurements CTE reflects the same philosophy as the LEFT JOIN in the previous model: preserve ambiguity rather than discard it. An unmapped row in silver is information. It says "something arrived that we haven't classified yet." Discarding it would make the gap invisible; surfacing it makes the gap findable.&lt;/p&gt;

&lt;p&gt;Date columns (&lt;code&gt;date_received&lt;/code&gt;, &lt;code&gt;date_analyzed&lt;/code&gt;) remain as raw strings through these intermediate models. This is not an oversight. Casting vendor date strings to an actual date type without knowing the vendor's format either fails loudly on unexpected input or silently coerces values into something plausible but wrong. Silent and wrong is the worst possible outcome in a data pipeline. Now that we have the data in a more usable format, we can start validating this with further intermediate models before it ends up in a gold table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Writing the Contract Down
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu8xjm5df0xqyn757uqc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhu8xjm5df0xqyn757uqc.png" alt="Schematic representation of pipeline" width="800" height="249"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The three models form a coherent chain. Staging creates a joinable key. The first intermediate model uses that key to enrich with canonical meaning. The second pivots the result into an analytical schema while preserving unmapped rows for QA. From a design standpoint, the transformation logic is complete.&lt;/p&gt;

&lt;p&gt;From a validation standpoint, we've asserted that it works without demonstrating it. "Looks correct" is not a testing strategy. It is a feeling, and feelings have a well-documented history of being wrong about SQL.&lt;/p&gt;

&lt;p&gt;This is where test-driven thinking applies. The staging model makes a specific claim: given a &lt;code&gt;lab_provided_attribute&lt;/code&gt;, &lt;code&gt;attribute_standardized&lt;/code&gt; will be trimmed, lowercased, and have special characters replaced with underscores. That claim is precise enough to verify directly.&lt;/p&gt;

&lt;p&gt;dbt's unit testing feature [4], introduced in version 1.8, makes this claim machine-verifiable. You define mock input rows and expected output rows in the model's YAML configuration. dbt runs the model against your mock inputs in isolation, without touching production data, and tells you whether reality matches the specification. It is, in the most literal sense, writing the contract down and then checking whether anyone actually honored it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;unit_tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;test_column_name_standardization&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Check&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;that&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;column&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;names&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;correctly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;trimmed,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;made&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;lower&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;case,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;have&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;symbols&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;removed"&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stg_lab_samples_unpivoted&lt;/span&gt;
    &lt;span class="na"&gt;given&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;source('bronze', 'lab_samples_unpivoted')&lt;/span&gt;
        &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;extra&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;spaces&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CAPITALS'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-a$lot(of)weird#symbols.why/vendors%why@!'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;expect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;rows&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;attribute_standardized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;extra_spaces'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;attribute_standardized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;capitals'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;attribute_standardized&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a_lot_of_weird_symbols_why_vendors_why'&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three test cases, three behaviors. The whitespace case verifies that leading and trailing spaces are removed. The capitalization case verifies that mixed or uppercase input produces lowercase output. The symbols case verifies that punctuation and special characters become underscores. Together, they define the contract that downstream models rely on when they join to the mapping table on &lt;code&gt;attribute_standardized&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Let's run it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tests Fail
&lt;/h2&gt;

&lt;p&gt;Two of the three cases fail. The test does not care about our feelings about this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failure in unit_test test_column_name_standardization

actual differs from expected:

@@ ,attribute_standardized
+++,_a_lot_of_weird_symbols_why_vendors_why__
+++,_extra_spaces_
---,a_lot_of_weird_symbols_why_vendors_why
   ,capitals
---,extra_spaces
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consider the whitespace case first. Input: &lt;code&gt;' extra spaces '&lt;/code&gt;. Expected: &lt;code&gt;'extra_spaces'&lt;/code&gt;. Actual: &lt;code&gt;'_extra_spaces_'&lt;/code&gt;. Something in that chain is executing in the wrong order. Let's trace it.&lt;/p&gt;

&lt;p&gt;The functions nest inside each other, so they execute from the innermost outward: &lt;code&gt;TRANSLATE&lt;/code&gt; runs first, &lt;code&gt;TRIM&lt;/code&gt; runs second, &lt;code&gt;LOWER&lt;/code&gt; runs third.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TRANSLATE&lt;/code&gt; replaces the eleven characters listed in its second argument with underscores. The character set is &lt;code&gt;'-$()#./ %@!'&lt;/code&gt;. Count carefully: there is a space between the forward slash and the percent sign. Every space in the input string becomes an underscore. &lt;code&gt;' extra spaces '&lt;/code&gt; becomes &lt;code&gt;'_extra_spaces_'&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;TRIM&lt;/code&gt; runs on the result of &lt;code&gt;TRANSLATE&lt;/code&gt;. &lt;code&gt;TRIM&lt;/code&gt; removes leading and trailing whitespace characters. There is no whitespace left; the leading and trailing spaces became underscores in the previous step. &lt;code&gt;TRIM&lt;/code&gt; finds nothing to remove. The string stays &lt;code&gt;'_extra_spaces_'&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;LOWER&lt;/code&gt; runs. No change; the string is already lowercase.&lt;/p&gt;

&lt;p&gt;The bug is an ordering problem, and it is the specific kind of ordering problem that looks completely reasonable until you trace one concrete input through the full execution sequence and watch it go wrong in slow motion. &lt;code&gt;TRIM&lt;/code&gt; must run before &lt;code&gt;TRANSLATE&lt;/code&gt;. Strip the whitespace first, then translate special characters, and the leading and trailing spaces are gone before &lt;code&gt;TRANSLATE&lt;/code&gt; ever encounters them. The fix is a one-line change to the nesting order.&lt;/p&gt;

&lt;p&gt;The symbols case fails for related reasons, and tracing through it is left as an exercise. The root cause is the same.&lt;/p&gt;

&lt;p&gt;Incidentally, I got this wrong the first time. The test told me.&lt;/p&gt;

&lt;p&gt;Here is the uncomfortable implication of that fact, and it applies regardless of how the models were written. As I was working through this post, I noticed that the join in &lt;code&gt;int_lab_samples_standardized&lt;/code&gt; was initially written on &lt;code&gt;row_index&lt;/code&gt; and &lt;code&gt;vendor_id&lt;/code&gt; alone, which would produce incorrect metadata associations whenever a vendor sends more than one file, since &lt;code&gt;row_index&lt;/code&gt; resets to zero at the start of each file. I caught it by thinking carefully about what the columns actually represent. I would not have caught it by looking at the SQL and deciding it seemed reasonable.&lt;/p&gt;

&lt;p&gt;That bug would have existed in the code whether the models were written with an agent, generated from a Claude prompt and pasted in, or typed out manually by an engineer who had just had a very productive morning. The staging model's unit test demonstrates the alternative: make the claim explicit, make it machine-verifiable, and find out whether the claim is true before production data depends on it.&lt;/p&gt;

&lt;p&gt;The question of how to extend that discipline to the intermediate models, and to the outputs of the entire pipeline against real data, is where Part 5 begins.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Silver Looks Like Now, and What Comes Next
&lt;/h2&gt;

&lt;p&gt;The architecture is sound. Three models form a coherent transformation chain, each layer making the next one possible, each design decision traceable back to a specific problem it exists to solve. Kimball would recognize it. He would just be surprised it took thirty years to get decent dependency management.&lt;/p&gt;

&lt;p&gt;The first unit test does not pass. The intermediate models have no tests at all. We have built something that looks correct and demonstrated, scientifically, that looking correct is insufficient evidence.&lt;/p&gt;

&lt;p&gt;How do we know that actual vendor data flowing through this pipeline maps correctly to canonical columns? How do we validate that the mapping table covers the attributes vendors actually send, rather than the attributes we assumed they would send? How do we catch a new analysis package that introduces column names we have never seen, silently producing nulls in silver while everyone downstream wonders why the copper measurements disappeared?&lt;/p&gt;

&lt;p&gt;Those questions require validation against real data, at the boundaries where layers hand off to each other, against statistical expectations that reflect what vendors actually send rather than what we hope they send. Building that validation framework is the next post.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Complete "working" example:&lt;/strong&gt; The dbt models described in this post are in &lt;a href="https://github.com/aawiegel/zen_bronze_data/tree/feature/dbt_databricks/src/crucible" rel="noopener noreferrer"&gt;GIthub&lt;/a&gt;. The staging model, both intermediate models, and the unit test definition are all in that directory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;References:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;[1] dbt Labs. (2026). &lt;a href="https://docs.getdbt.com/guides/best-practices/how-we-structure/1-guide-overview" rel="noopener noreferrer"&gt;&lt;em&gt;How we structure our dbt projects&lt;/em&gt;.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[2] dbt Labs. (2026). &lt;a href="https://docs.getdbt.com/guides/best-practices/how-we-structure/2-staging" rel="noopener noreferrer"&gt;&lt;em&gt;Staging: Preparing and cleaning source data&lt;/em&gt;.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;[3] Kimball, R., &amp;amp; Ross, M. (2013). &lt;em&gt;The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling&lt;/em&gt; (3rd ed.). Wiley.&lt;/p&gt;

&lt;p&gt;[4] dbt Labs. (2026). &lt;a href="https://docs.getdbt.com/docs/build/unit-tests" rel="noopener noreferrer"&gt;&lt;em&gt;Unit tests&lt;/em&gt;.&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>database</category>
      <category>spark</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The Zen of the Bronze Layer: Embracing Schema Chaos</title>
      <dc:creator>Aaron Wiegel</dc:creator>
      <pubDate>Fri, 06 Feb 2026 04:51:38 +0000</pubDate>
      <link>https://forem.com/aawiegel/the-zen-of-the-bronze-layer-embracing-schema-chaos-7hn</link>
      <guid>https://forem.com/aawiegel/the-zen-of-the-bronze-layer-embracing-schema-chaos-7hn</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/aawiegel/medallion-architecture-101-building-data-pipelines-that-dont-fall-apart-1gil"&gt;Part 1&lt;/a&gt;, we introduced the Medallion Architecture with clean, well-behaved vendor data. In &lt;a href="https://dev.to/aawiegel/when-bronze-goes-rogue-schema-chaos-in-the-wild-16kf"&gt;Part 2&lt;/a&gt;, we watched the bronze layer transform from "just land the data" into an eight-step ingestion pipeline with vendor-specific logic, fuzzy matching heuristics, header detection, and character sanitization. The final form barely resembles the simple bronze layer from Part 1. We ended by asking uncomfortable questions about whether we're still preserving "raw" data, and what happens when Vendor C arrives.&lt;/p&gt;

&lt;p&gt;This post answers the question we left hanging: What if we stop treating column names as schema and start treating them as data?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cognitive Shift
&lt;/h2&gt;

&lt;p&gt;Traditional thinking treats CSV column names as schema constraints. You design a bronze table with specific columns (ph, copper_ppm, zinc_ppm), and vendor data either fits that schema or requires transformation to match it. When Vendor B calls their pH column "acidity" instead, you write mapping logic. When schemas change between analysis packages, you build superset schemas that accommodate all possible columns. When typos appear, you add fuzzy matching.&lt;/p&gt;

&lt;p&gt;Each vendor variation becomes a code problem requiring a code solution. By the time you're writing &lt;code&gt;handle_vendor_c_special_case_for_march_exports()&lt;/code&gt;, you start questioning your life choices.&lt;/p&gt;

&lt;p&gt;Consider what happens as this approach scales. With two vendors and two analysis packages each, you manage four schema mappings. Add a third vendor with three packages, and you're at nine. The combination space grows faster than the vendor count. Combinatorial explosions are not, despite what you might think from our enthusiasm for complexity, actually a data engineer's favorite kind of surprise. Testing requires sample files for every vendor/package/quirk combination; the test matrix explodes exponentially.&lt;/p&gt;

&lt;p&gt;The fundamental issue is treating column names as structural constraints when they're actually metadata about measurements.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "Raw" Actually Means
&lt;/h2&gt;

&lt;p&gt;When vendors send CSV files, they're dumping data from lab systems or Excel into whatever format was easiest to export. The wide CSV format (one column per measurement) is convenient for humans viewing spreadsheets. It's also convenient for data engineers who harbor the sweet, naive hope that vendor files arrive in thoughtful formats. We've all been there. But wide format creates a problem: semantics are encoded in structure.&lt;/p&gt;

&lt;p&gt;Consider what this means in practice. Column positions and names carry meaning; you need to know that the third column represents copper measurements before you can interpret the value 10.2. This encoding is precisely what created the brittleness we fought in Part 2.&lt;/p&gt;

&lt;p&gt;Here's a deceptively simple question: What is the "raw" form of vendor data? Is it the wide CSV they sent, or is it the atomic facts before they got serialized into columns?&lt;/p&gt;

&lt;p&gt;Consider this sample from Vendor A:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,ph,copper_ppm,zinc_ppm
ABC123,6.5,10.2,15.3
DEF456,7.2,8.7,12.1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The "raw" facts are: Sample ABC123 has a pH of 6.5, copper of 10.2, and zinc of 15.3. The wide format is a presentation choice, not the essential structure. Column names are data labels that happen to be stored as structural metadata.&lt;/p&gt;

&lt;p&gt;What if we converted this into a different format?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,attribute,value,row_number
ABC123,ph,6.5,1
ABC123,copper_ppm,10.2,1
ABC123,zinc_ppm,15.3,1
DEF456,ph,7.2,2
DEF456,copper_ppm,8.7,2
DEF456,zinc_ppm,12.1,2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now column names are data in the &lt;code&gt;attribute&lt;/code&gt; field. The structure describes position and value; semantics come from the attribute values themselves.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn49t5rkhu83d0jrvumps.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn49t5rkhu83d0jrvumps.png" alt="Unpivot/melt into wide format warts and all" width="743" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This transformation is called unpivoting (or melting). It converts wide format into long format by treating each cell as an individual record with explicit position tracking.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Power of Vendor-Agnostic Structure
&lt;/h2&gt;

&lt;p&gt;This pattern isn't novel; it's a variant of the Entity-Attribute-Value (EAV) model that's been used in database design for decades, particularly in healthcare and scientific domains where schemas are highly variable. In simpler terms, we're storing key-value pairs with position metadata. The statistical community calls this "long format" or "tidy data". The concept is well-established. Data engineers have been independently "discovering" EAV for decades, and each time we're absolutely CERTAIN our use case is special. It usually isn't, but the confidence is admirable. What's perhaps less common is applying it specifically to bronze layer ingestion as a solution to vendor schema chaos.&lt;/p&gt;

&lt;p&gt;Once data is in long format, the bronze table schema becomes fixed and vendor-agnostic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;row_index&lt;/code&gt;: Position in the original file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;column_index&lt;/code&gt;: Column position in the original file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lab_provided_attribute&lt;/code&gt;: The exact column name the vendor used&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lab_provided_value&lt;/code&gt;: The measurement value&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;vendor_id&lt;/code&gt;: Which vendor sent this data&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;file_name&lt;/code&gt;: Source file name&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ingestion_timestamp&lt;/code&gt;: When we received it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every vendor file, regardless of its schema, gets transformed into this same structure. Vendor A sends pH as "ph"? That goes in &lt;code&gt;lab_provided_attribute&lt;/code&gt;. Vendor B sends it as "acidity"? That also goes in &lt;code&gt;lab_provided_attribute&lt;/code&gt;. Typo it as "recieved_date"? Preserved exactly as received in &lt;code&gt;lab_provided_attribute&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The schema chaos doesn't disappear; we've just stopped fighting it. Turns out the solution was acceptance all along. Very zen. Very therapy. The bronze layer no longer needs to know what "ph" or "acidity" mean. It just preserves attribute/value pairs with position metadata.&lt;/p&gt;

&lt;p&gt;This has profound implications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No vendor-specific logic:&lt;/strong&gt; The same unpivot transformation works for every vendor. No if/elif branches based on vendor_name.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No superset schemas:&lt;/strong&gt; The bronze table doesn't grow new columns when vendors add measurements. Additional measurements just create more rows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No fuzzy matching:&lt;/strong&gt; Typos are preserved as-is in &lt;code&gt;lab_provided_attribute&lt;/code&gt;. We're not making quality decisions about which variations are "close enough."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No header detection complexity:&lt;/strong&gt; While we still need to find the header row, once found, every column gets the same treatment. There's no domain knowledge about which columns are measurements versus metadata.&lt;/p&gt;

&lt;p&gt;The unpivot pattern trades structural rigidity for structural consistency. Instead of a brittle schema that breaks with variations, we have a flexible schema that accepts any variation and pushes semantic interpretation to the silver layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation: The CSV Table Parser
&lt;/h2&gt;

&lt;p&gt;Having established why column names should be data, let's examine how to actually implement this transformation. The unpivot process has three steps: find the header row, clean up the structure, and transform wide to long.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/src/parse/base.py" rel="noopener noreferrer"&gt;CSVTableParser&lt;/a&gt; uses Python's standard &lt;code&gt;csv&lt;/code&gt; module rather than pandas or Spark for initial parsing. This matters: pandas and Spark make assumptions about data types and structure that interfere with preserving data exactly as received. The csv module gives us raw strings without interpretation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding the Header Row
&lt;/h3&gt;

&lt;p&gt;CSV files from vendors often have metadata rows above the actual data header. Vendor A might include "Lab Report Generated: 2024-01-15" in row 1, with the real header in row 3. Rather than hardcoding vendor-specific knowledge about metadata row patterns (which inevitably leads to a function called &lt;code&gt;detect_vendor_a_weird_header_thing()&lt;/code&gt;), the parser uses a simple heuristic: the header row is the first row with a sufficient number of non-null columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt; &lt;span class="n"&gt;min_found&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Find the first row with enough non-null values to be considered the header
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;header_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;non_nulls&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;non_nulls&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_found&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;header_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;header_index&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Could not find header row.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;header_index&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The threshold (&lt;code&gt;min_found&lt;/code&gt;) is configurable, not embedded in code. If Vendor A typically has 15 columns and Vendor B has 8, you can initialize the parser with different thresholds: &lt;code&gt;CSVTableParser({"header_detection_threshold": 15})&lt;/code&gt; for Vendor A and &lt;code&gt;CSVTableParser({"header_detection_threshold": 7})&lt;/code&gt; for Vendor B. This is configuration data, not branching logic. The algorithm remains the same; only the parameter changes.&lt;/p&gt;

&lt;p&gt;Configurable thresholds let you adapt to vendor differences without encoding vendor knowledge into the codebase. You still need to know the magic number; we've just given it a better address. The complexity didn't vanish, it got relocated.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cleaning Column Structure
&lt;/h3&gt;

&lt;p&gt;Vendors sometimes export CSVs with empty columns (extra commas creating phantom columns) or duplicate column names. The parser drops empty columns and deduplicates names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;clean_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Drop empty columns and deduplicate column names&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Drop columns where header is None
&lt;/span&gt;    &lt;span class="n"&gt;cols_to_drop&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cols_to_drop&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Deduplicate column names by appending _1, _2, etc.
&lt;/span&gt;    &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_dedupe_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a vendor's export includes duplicate column names (perhaps two "notes" columns), they become "notes" and "notes_1". The structure is preserved, not rejected.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Unpivot Transformation
&lt;/h3&gt;

&lt;p&gt;This is where wide format becomes long format. Each cell in the original table becomes a row in the output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;unpivot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Transform wide format to long format with position tracking&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;attributes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;row_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:],&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;column_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nf"&gt;zip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;attributes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;row_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;row_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;column_index&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;column_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;attribute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lab_provided_value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The loop structure is straightforward: for each row (starting from row 1 after the header), for each column, create a record with the position (row_index, column_index), the attribute name, and the value. A 50-row CSV with 20 columns becomes 1,000 records (50 × 20).&lt;/p&gt;

&lt;p&gt;Position tracking matters. &lt;code&gt;row_index&lt;/code&gt; and &lt;code&gt;column_index&lt;/code&gt; preserve the original structure. If an issue appears with a measurement, you can trace it back to the exact cell in the source file (row 42, column 7). This is critical for debugging and audit trails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Ingestion: One Loop for All Vendors
&lt;/h3&gt;

&lt;p&gt;With the CSVTableParser handling the transformation, bronze ingestion becomes remarkably simple. Here's the actual code from our &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/notebooks/003_bronze_silver_unpivot_demo.py" rel="noopener noreferrer"&gt;demo notebook&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Initialize parser with configuration
&lt;/span&gt;&lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CSVTableParser&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header_detection_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="c1"&gt;# Track ingestion timestamp
&lt;/span&gt;&lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Collect all parsed records
&lt;/span&gt;&lt;span class="n"&gt;all_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Process each vendor file
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;file_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;vendor_files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Get file path
&lt;/span&gt;    &lt;span class="n"&gt;csv_file_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_csv_file_path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;VOLUME_PATH&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Parse and unpivot
&lt;/span&gt;    &lt;span class="n"&gt;records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;csv_file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Add metadata to each record
&lt;/span&gt;    &lt;span class="n"&gt;vendor_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_vendor_id&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;record&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vendor_id&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;file_name&lt;/span&gt;
        &lt;span class="n"&gt;record&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion_timestamp&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;

    &lt;span class="n"&gt;all_records&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;records&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create bronze table from all records
&lt;/span&gt;&lt;span class="n"&gt;spark_df_bronze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;bronze_schema_def&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;spark_df_bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bronze_table_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This same loop processes 11 different vendor files:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vendor A: basic_clean, full_clean, messy_typos, messy_casing, messy_whitespace, excel_nightmare&lt;/li&gt;
&lt;li&gt;Vendor B: standard_clean, full_clean, messy_combo, excel_disaster, db_nightmare&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No vendor-specific branches. No analysis package logic. No special handling for typos or special characters. The parser treats every file identically; the bronze layer preserves everything as data.&lt;/p&gt;

&lt;p&gt;Compare this to the eight-step bronze function from Part 2 with its vendor-specific mappings, superset schema alignment, fuzzy matching, and character sanitization. The unpivot approach collapses all that complexity into a single generic transformation. What took eight carefully orchestrated steps now takes one aggressively indifferent loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  Silver Layer: Standardization Through Data
&lt;/h2&gt;

&lt;p&gt;Bronze preserves chaos; silver brings order. The key insight is that standardization happens through data (mapping tables), not code (if/elif logic).&lt;/p&gt;

&lt;p&gt;The unpivoted bronze table contains &lt;code&gt;lab_provided_attribute&lt;/code&gt; values like "ph", "acidity", "copper_ppm", "cu_total", "sample_barcode", "sample_barcod" (typo), "lab_id". These need standardization regardless of whether they're measurements or metadata columns. The silver layer resolves this through a unified mapping approach.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vendor Column Mapping Table
&lt;/h3&gt;

&lt;p&gt;This mapping table connects vendor-specific column names to canonical column identifiers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;vendor_id&lt;/th&gt;
&lt;th&gt;vendor_column_name&lt;/th&gt;
&lt;th&gt;canonical_column_id&lt;/th&gt;
&lt;th&gt;notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;vendor_a&lt;/td&gt;
&lt;td&gt;ph&lt;/td&gt;
&lt;td&gt;col_ph&lt;/td&gt;
&lt;td&gt;Direct pH measurement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor_a&lt;/td&gt;
&lt;td&gt;copper_ppm&lt;/td&gt;
&lt;td&gt;col_copper&lt;/td&gt;
&lt;td&gt;Copper in ppm&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor_b&lt;/td&gt;
&lt;td&gt;acidity&lt;/td&gt;
&lt;td&gt;col_ph&lt;/td&gt;
&lt;td&gt;Vendor B calls pH 'acidity'&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor_b&lt;/td&gt;
&lt;td&gt;cu_total&lt;/td&gt;
&lt;td&gt;col_copper&lt;/td&gt;
&lt;td&gt;Chemical symbol notation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor_a&lt;/td&gt;
&lt;td&gt;sample_barcode&lt;/td&gt;
&lt;td&gt;col_sample_id&lt;/td&gt;
&lt;td&gt;Sample identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;vendor_b&lt;/td&gt;
&lt;td&gt;sample_barcod&lt;/td&gt;
&lt;td&gt;col_sample_id&lt;/td&gt;
&lt;td&gt;Typo preserved in bronze&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice that "ph" and "acidity" both map to &lt;code&gt;col_ph&lt;/code&gt;. Similarly, "sample_barcode" and "sample_barcod" (typo) both map to &lt;code&gt;col_sample_id&lt;/code&gt;. The mapping handles both measurements and metadata columns uniformly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canonical Column Definitions
&lt;/h3&gt;

&lt;p&gt;The silver staging area maintains simple canonical definitions for all columns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;canonical_column_id&lt;/th&gt;
&lt;th&gt;canonical_column_name&lt;/th&gt;
&lt;th&gt;column_category&lt;/th&gt;
&lt;th&gt;data_type&lt;/th&gt;
&lt;th&gt;description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;col_ph&lt;/td&gt;
&lt;td&gt;ph&lt;/td&gt;
&lt;td&gt;measurement&lt;/td&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;Soil pH level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;col_copper&lt;/td&gt;
&lt;td&gt;copper_ppm&lt;/td&gt;
&lt;td&gt;measurement&lt;/td&gt;
&lt;td&gt;numeric&lt;/td&gt;
&lt;td&gt;Copper concentration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;col_sample_id&lt;/td&gt;
&lt;td&gt;sample_barcode&lt;/td&gt;
&lt;td&gt;sample_identifier&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Unique sample ID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;col_lab_id&lt;/td&gt;
&lt;td&gt;lab_id&lt;/td&gt;
&lt;td&gt;lab_metadata&lt;/td&gt;
&lt;td&gt;string&lt;/td&gt;
&lt;td&gt;Laboratory identifier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;col_date_received&lt;/td&gt;
&lt;td&gt;date_received&lt;/td&gt;
&lt;td&gt;date&lt;/td&gt;
&lt;td&gt;date&lt;/td&gt;
&lt;td&gt;Sample receipt date&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This isn't a full dimensional model yet; it's standard staging-area column normalization. The gold layer builds actual star schema dimensions (analyte dimensions with units and valid ranges, sample dimensions with tracking metadata, etc.). Silver simply establishes canonical naming and basic categorization.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgcmtqri96ehz7bzl2jk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgcmtqri96ehz7bzl2jk.png" alt="Relation between bronze long format and table attributes" width="800" height="225"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Silver Transformation
&lt;/h3&gt;

&lt;p&gt;Joining bronze with these mapping tables produces standardized column names. Here's the SQL pattern:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_samples_standardized&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="c1"&gt;-- Original bronze columns for lineage&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;row_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_provided_attribute&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_provided_value&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;file_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ingestion_timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="c1"&gt;-- Standardized column information&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_category&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data_type&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_samples_unpivoted&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_column_mapping&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_provided_attribute&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_column_name&lt;/span&gt;
    &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vendor_id&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_definitions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;canonical&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;canonical&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;canonical_column_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The left join handles unmapped columns gracefully; anything not in the mapping table gets NULL for canonical information. You can filter for &lt;code&gt;canonical_column_id IS NOT NULL&lt;/code&gt; to get recognized columns, or keep everything for complete lineage.&lt;/p&gt;

&lt;p&gt;This approach treats all columns uniformly. Whether it's "ph" vs "acidity" (measurement) or "sample_barcode" vs "sample_barcod" (identifier), the pattern is the same: map vendor naming to canonical naming through configuration data.&lt;/p&gt;

&lt;p&gt;After this transformation, you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Original context preserved:&lt;/strong&gt; The exact column name vendor used (&lt;code&gt;lab_provided_attribute&lt;/code&gt;), the vendor_id, file_name, and ingestion_timestamp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standardized semantics added:&lt;/strong&gt; Canonical analyte_name, unit, and validation metadata&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-vendor comparability:&lt;/strong&gt; pH measurements from both vendors now share the same analyte_name and analyte_id&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What We've Achieved
&lt;/h3&gt;

&lt;p&gt;The silver layer demonstrates three architectural principles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configuration over code:&lt;/strong&gt; Vendor differences are expressed as rows in mapping tables, not if/elif branches in functions. Adding a vendor means inserting rows; changing how a vendor names their columns means updating rows. Database operations, not deployments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separation of concerns:&lt;/strong&gt; Bronze handles structure preservation (unpivoting). Silver handles semantic interpretation (mapping). Each layer has a single, clear responsibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data-driven evolution:&lt;/strong&gt; The mapping tables are versioned data that can be managed by data stewards, not just engineers. Domain experts can maintain vendor-to-column mappings without understanding the ingestion code.&lt;/p&gt;

&lt;p&gt;Vendor-specific knowledge still exists; we haven't eliminated the need to understand that "acidity" means pH or "sample_barcod" means sample_barcode. But we've moved that knowledge from code (brittle, requires deployments) to data (flexible, requires inserts/updates). The complexity didn't vanish; it got a new address and better management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A note on dimensional modeling:&lt;/strong&gt; This silver layer approach establishes canonical naming but doesn't build full star schema dimensions. That's the gold layer's job. Gold takes canonical columns and builds proper dimension tables (analyte dimensions with units and valid ranges, sample dimensions with tracking metadata, date dimensions, etc.). Silver is a staging area for standardization; gold is where dimensional modeling happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rethinking "Raw" Data
&lt;/h2&gt;

&lt;p&gt;The unpivot pattern raises deeper questions about data engineering philosophy. Part 2 ended by questioning whether the complex eight-step bronze layer was still preserving "raw" data. This pattern forces a more careful definition of what "raw" actually means.&lt;/p&gt;

&lt;p&gt;When vendors export CSVs, they're likely just dumping data from Excel or their lab information systems without much thought. Calling anything that survived Excel's date-parsing tendencies "raw" requires some philosophical flexibility. Excel thinks everything is a date. Not just the data type; the kind where it buys your sample IDs dinner and doesn't call the next day. Very unprofessional. We've had conversations. But here we are. The wide format (one column per measurement) is convenient for humans viewing spreadsheets, but it encodes domain knowledge into structure. The wide format (one column per measurement) is convenient for humans viewing spreadsheets, but it encodes domain knowledge into structure. To understand that the third column represents copper measurements, you need to read the header; the column position itself carries no semantic meaning.&lt;/p&gt;

&lt;p&gt;The unpivot transformation exposes the atomic facts hiding in this structure: this sample, this attribute, this value, at this position. Column names stop being structural constraints and become data values we can query, filter, and join against. Whether a vendor calls it "ph" or "pH_lvl", it's just a string value in &lt;code&gt;lab_provided_attribute&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In this sense, unpivoted bronze is closer to "raw" than wide bronze. The messiness doesn't disappear; it becomes explicit. Typos appear as queryable data in &lt;code&gt;lab_provided_attribute&lt;/code&gt; rather than as structural variations that break schema assumptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Paradox of Control
&lt;/h3&gt;

&lt;p&gt;The unpivot pattern presents a paradox: by giving up control (accepting any schema), we gain control (one ingestion pattern).&lt;/p&gt;

&lt;p&gt;Part 2's approach tried to control vendor chaos through transformation logic: detect headers, fix typos, sanitize characters, map column names, align to superset schemas. Each transformation attempted to force vendor data into expected structure. The system became increasingly fragile, as systems do when you try to control chaos through sheer force of will. The harder we fought for control, the more brittle the system became. Around transformation seven, I started empathizing with King Canute trying to command the tide.&lt;/p&gt;

&lt;p&gt;In contrast, the unpivot approach accepts vendor chaos by treating it as data to preserve rather than problems to solve. Bronze doesn't validate column names or fix typos; those are silver layer concerns solved through mapping tables. When vendor schemas change, bronze doesn't break. It just creates different &lt;code&gt;lab_provided_attribute&lt;/code&gt; values, and the mapping tables handle semantic evolution without code changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Schemas Are Data, Not Code
&lt;/h3&gt;

&lt;p&gt;Traditional databases treat schemas as structural constraints: the CREATE TABLE statement defines columns, and INSERT statements must conform. The unpivot pattern inverts this. The bronze schema (row_index, column_index, lab_provided_attribute, lab_provided_value) is fixed, but what constitutes valid data is open-ended. Any attribute name is acceptable.&lt;/p&gt;

&lt;p&gt;This inversion has significant implications:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding vendors becomes a data operation.&lt;/strong&gt; Add rows to vendor_analyte_mapping; the new vendor's data flows through unchanged code paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schema changes become data operations.&lt;/strong&gt; Vendor renames "ph" to "pH_value"? Update vendor_analyte_mapping. No deployment required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Profiling schema drift becomes possible.&lt;/strong&gt; Create snapshots that show how the schema is changing over time. Bonus: you can measure just how often vendors send you zany Excel files and compare who sends higher quality data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing becomes systematic.&lt;/strong&gt; Test the generic unpivot transformation once; vendor differences are data fixtures, not code paths.&lt;/p&gt;

&lt;p&gt;The cognitive shift is recognizing that vendor quirks are metadata about their export process, not structural properties of the data itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reclaiming the Bronze Layer
&lt;/h3&gt;

&lt;p&gt;Part 2 asked: "Is this still a bronze layer?" after watching transformation logic accumulate. The unpivot pattern reclaims bronze simplicity by giving it one job: parse CSV structure and preserve information as position/value pairs. No quality decisions about typos. No business logic about semantics. Just structural transformation from wide to long format. Consequently, we can defer those concerns to the silver layer where they rightfully belong.&lt;/p&gt;

&lt;p&gt;This IS minimal transformation. Values aren't modified; only their organization changes. The transformation is generic (same code for all vendors) and reversible (pivot back using row_index and column_index).&lt;/p&gt;

&lt;p&gt;Silver handles semantic complexity through mapping tables. Gold does analysis and aggregation. Each layer has clear boundaries and single responsibilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Use This Pattern
&lt;/h2&gt;

&lt;p&gt;The unpivot pattern solves specific problems; it's not a universal solution. Recognizing when it adds unnecessary complexity is as important as knowing when it provides value.&lt;/p&gt;

&lt;h3&gt;
  
  
  Important Caveat: This Is an Intermediate Form
&lt;/h3&gt;

&lt;p&gt;The unpivot pattern creates long-format data as an intermediate representation. You don't stop here; silver and gold layers transform this back into analyst-friendly structures (wide tables, aggregations, dimensional models). If you're not building a multi-stage transformation pipeline where bronze feeds silver feeds gold (or marts in other architectural patterns), unpivoting adds unnecessary complexity without delivering its benefits. The pattern makes sense when you have layer separation and further transformations; it's the wrong tool if you need a simple data landing zone for direct consumption.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don't Use Unpivot When Schemas Are Stable
&lt;/h3&gt;

&lt;p&gt;If you work with one or two vendors who provide stable schemas that rarely change, the unpivot pattern may be unnecessary overhead. Some people have stable vendor relationships. I hear they exist. When Vendor A's contract specifies that &lt;code&gt;copper_ppm&lt;/code&gt; won't change without notice and you haven't seen schema drift in two years, simpler approaches suffice.&lt;/p&gt;

&lt;p&gt;Similarly, at low volume (say, 500 samples per month from two stable vendors), the unpivot infrastructure might cost more to build and maintain than occasional vendor-specific adjustments. The pattern's benefits scale with schema chaos; if you don't have chaos, you don't need the solution.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do Use Unpivot When Schema Chaos Is Real
&lt;/h3&gt;

&lt;p&gt;Conversely, use the pattern when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have multiple vendors with divergent naming conventions.&lt;/strong&gt; Three vendors calling pH by three different names; vendor-specific if/elif logic is already feeling brittle and hard to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Schemas change frequently within vendors.&lt;/strong&gt; Same vendor sends different column sets based on analysis package ordered, or evolves their format quarterly without coordination.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You're building for extensibility.&lt;/strong&gt; You expect vendor count to grow, or you're building a platform where schema flexibility is a product feature, not just a maintenance challenge. (e.g., scientific data)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need complete provenance.&lt;/strong&gt; Regulatory requirements demand preserving exact column names as received, with full traceability to source files and cells. The unpivot pattern with position tracking provides audit-grade lineage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standardization requires domain expertise.&lt;/strong&gt; Mapping between vendor terminologies involves domain knowledge that should be managed by data stewards as reference data, not hardcoded by engineers in transformation logic.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Decision Framework
&lt;/h3&gt;

&lt;p&gt;Ask yourself:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;How many vendors do you have (current and expected in 2 years)?&lt;/li&gt;
&lt;li&gt;How often do schemas change, and how coordinated are those changes?&lt;/li&gt;
&lt;li&gt;Are you building a multi-stage transformation pipeline with proper layer separation?&lt;/li&gt;
&lt;li&gt;What are your provenance and audit requirements?&lt;/li&gt;
&lt;li&gt;What's the actual cost of schema-related maintenance today?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the answers suggest high vendor count, frequent uncoordinated schema changes, multi-stage architecture, strong audit needs, and meaningful current maintenance burden, the unpivot pattern likely pays dividends. If answers point toward stability, predictability, and simple requirements, simpler approaches might suffice.&lt;/p&gt;

&lt;p&gt;There's no universal answer; architectural decisions require weighing specific context and constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Zen of It
&lt;/h2&gt;

&lt;p&gt;In Part 1, we traced how medallion architecture evolved from Kimball's dimensional modeling framework—not replacement, but simplification. Part 2 ended with vendor-specific logic, superset schemas, fuzzy matching, and character sanitization accumulating until I questioned what 'bronze' even meant. The solution wasn't more sophisticated logic; it was remembering Kimball's staging area principle: preserve source structure before imposing semantics. The unpivoted bronze IS that source structure before the staging area, with vendor chaos encoded as data rather than fought through transformations.&lt;/p&gt;

&lt;p&gt;By treating column names as data instead of schema, we eliminated brittleness without eliminating complexity. Vendor chaos still exists, but it's no longer a code problem. Column name variations become rows in mapping tables. Schema evolution becomes data updates, not deployments. The complexity moves from scattered if/elif logic into structured dimension tables managed by people who understand vendor semantics.&lt;/p&gt;

&lt;p&gt;This reveals a broader principle: sometimes the elegant solution isn't solving the problem, it's reframing what the problem actually is. Schema chaos looked like a structural problem requiring sophisticated transformation logic. Reframed as a metadata problem, it becomes manageable through configuration.&lt;/p&gt;

&lt;p&gt;The paradoxes stack up: by giving up control (accepting any schema), we gain control (one ingestion path). By preserving more of what vendors send (typos included), we achieve better standardization (explicit mapping, not implicit assumptions). By doing less transformation in bronze, we enable cleaner layer separation.&lt;/p&gt;

&lt;p&gt;Data engineering is about finding the right abstraction level. Too concrete and you drown in special cases. Too abstract and you can't solve actual problems. The unpivot pattern finds the middle ground: generic enough to handle any vendor's wide CSV, specific enough to preserve structure and position.&lt;/p&gt;

&lt;p&gt;The code is simpler. The testing is systematic. The evolution path is clear. That's finding zen in apparent chaos.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Complete working example:&lt;/strong&gt; The &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/notebooks/003_bronze_silver_unpivot_demo.py" rel="noopener noreferrer"&gt;demo notebook&lt;/a&gt; processes 11 vendor files (clean, messy, typos, Excel nightmares) using the patterns described in this post. &lt;/p&gt;

</description>
      <category>python</category>
      <category>database</category>
      <category>spark</category>
    </item>
    <item>
      <title>When Bronze Goes Rogue: Schema Chaos in the Wild</title>
      <dc:creator>Aaron Wiegel</dc:creator>
      <pubDate>Wed, 28 Jan 2026 05:04:14 +0000</pubDate>
      <link>https://forem.com/aawiegel/when-bronze-goes-rogue-schema-chaos-in-the-wild-16kf</link>
      <guid>https://forem.com/aawiegel/when-bronze-goes-rogue-schema-chaos-in-the-wild-16kf</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/aawiegel/medallion-architecture-101-building-data-pipelines-that-dont-fall-apart-1gil"&gt;Part 1&lt;/a&gt;, we explored the Medallion Architecture with clean, well-behaved vendor data. The bronze layer simply landed the raw CSV files. The silver layer standardized measurement names. The gold layer aggregated for analysis. Everything worked beautifully.&lt;/p&gt;

&lt;p&gt;Then reality arrived.&lt;/p&gt;

&lt;p&gt;This post demonstrates what happens when vendor CSV files exhibit the full spectrum of real-world data quality issues. We'll watch the bronze layer transform from "just land the data" into an increasingly complex series of transformations, vendor-specific logic, and fragile workarounds. By the end, we'll be asking uncomfortable questions about what "bronze" actually means.&lt;/p&gt;

&lt;p&gt;The code examples are adapted from a &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/notebooks/pybay_presentation_2025-10.py" rel="noopener noreferrer"&gt;conference talk I gave at PyBay 2025&lt;/a&gt;, where I walked through this exact problem progression live.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 1: Different Column Names for the Same Measurements
&lt;/h2&gt;

&lt;p&gt;Vendor A and Vendor B measure the same soil properties. Both provide pH, copper concentration, and zinc concentration. Their CSV files look like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor A schema:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vendor B schema:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,lab_id,date_received,date_processed,acidity,cu_total,zn_total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same measurements. Different names. Vendor B calls pH "acidity." They use chemical symbols with &lt;code&gt;_total&lt;/code&gt; suffixes instead of element names with &lt;code&gt;_ppm&lt;/code&gt; suffixes.&lt;/p&gt;

&lt;p&gt;This is not a data quality problem. This is a legitimate difference in how two professional laboratories name their measurements. (Although pedantically you might wonder about a chemistry lab that thinks pH and acidity are the same thing.) Both schemas are internally consistent and well-documented. The challenge is ours: we need both vendors' data in the same bronze table.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer: Approach 1 (Add Vendor-Specific Column Mapping)
&lt;/h3&gt;

&lt;p&gt;We create a standardization function for each vendor:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Vendor A column names are already standard&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Map Vendor B columns to standard names&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lab_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;acidity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;           &lt;span class="c1"&gt;# Different name
&lt;/span&gt;        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cu_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;  &lt;span class="c1"&gt;# Different name
&lt;/span&gt;        &lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zn_total&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;    &lt;span class="c1"&gt;# Different name
&lt;/span&gt;    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bronze ingestion now includes a vendor detection step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_vendor_to_bronze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bronze layer with vendor-specific column mapping&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Read CSV
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply vendor-specific standardization
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.lab_samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. We can now query both vendors' data using consistent column names. The bronze layer contains standardized schemas.&lt;/p&gt;

&lt;p&gt;But we just added vendor-specific business logic to what was supposed to be a raw data landing zone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 2: Schema Instability Within the Same Vendor
&lt;/h2&gt;

&lt;p&gt;The vendor-specific mapping holds up until Vendor A sends a new file. Our ingestion pipeline fails with a schema mismatch error. Examining the file reveals that Vendor A now includes additional analytes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendor A - Basic package (what we had):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Vendor A - Metals package (what we just received):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm,lead_ppm,iron_ppm,manganese_ppm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The schema changes based on which analysis package the customer ordered. Sometimes they order basic soil testing. Sometimes they add heavy metals analysis. The vendor includes only the columns relevant to what was tested.&lt;/p&gt;

&lt;p&gt;This is also not a data quality problem. Including only requested measurements is reasonable and reduces file size. But it breaks our bronze table schema.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer: Approach 2 (Create Superset Schema)
&lt;/h3&gt;

&lt;p&gt;The solution requires accommodating all possible variations. We create a superset schema containing all possible columns from all analysis packages. When ingesting files with fewer columns, we add NULL values for missing measurements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;align_to_superset_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Add missing columns as NULL to match superset schema&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Define superset of all possible columns for this vendor
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;all_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lab_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;# basic package
&lt;/span&gt;            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lead_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;iron_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;manganese_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# metals package
&lt;/span&gt;            &lt;span class="c1"&gt;# ... more columns as we discover them
&lt;/span&gt;        &lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Add missing columns as NULL
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;all_columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Reorder columns to match superset
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now our bronze ingestion tracks package types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_vendor_to_bronze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bronze layer with superset schema alignment&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply vendor-specific column mapping
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Align to superset schema
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;align_to_superset_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.lab_samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. But now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Our bronze table is sparse (most columns are NULL for most rows)&lt;/li&gt;
&lt;li&gt;We must maintain a master list of all possible columns for each vendor&lt;/li&gt;
&lt;li&gt;Adding new analytes requires code changes&lt;/li&gt;
&lt;li&gt;We can't distinguish between "wasn't measured" and "measurement failed"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The bronze layer is accumulating knowledge about vendor schemas and business rules.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 3: Typos in Column Headers
&lt;/h2&gt;

&lt;p&gt;Our superset schema handles varying column sets, but the next issue reveals a different category of problem. A file from Vendor A fails to parse correctly. Examining the raw CSV, we find:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcod,lab_id,date_recieved,date_proccessed,ph,copper_ppm,zinc_ppm
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three typos: &lt;code&gt;sample_barcod&lt;/code&gt; (missing 'e'), &lt;code&gt;date_recieved&lt;/code&gt; (i before e), &lt;code&gt;date_proccessed&lt;/code&gt; (double c). The vendor's export system mangles column names. Occasionally.&lt;/p&gt;

&lt;p&gt;These files are otherwise valid. The data values are correct. Only the header row has issues. Rejecting these files would delay processing by days while we contact the vendor.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer: Approach 3 (Add Fuzzy Column Matching)
&lt;/h3&gt;

&lt;p&gt;Rejecting files creates unacceptable delays, so we implement fuzzy matching to handle common typos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;fix_column_typos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Fix common typos in column names&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;column_mapping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;col_lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Check for common typos
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recieved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;new_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recieved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;reciev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proccessed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;new_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proccessed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proccess&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;process&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;barcod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col_lower&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;col_lower&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;new_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;barcod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;new_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;

        &lt;span class="n"&gt;column_mapping&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_col&lt;/span&gt;

    &lt;span class="c1"&gt;# Rename columns
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our bronze ingestion grows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_vendor_to_bronze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bronze layer with fuzzy column matching&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fix typos in column names
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fix_column_typos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply vendor-specific column mapping
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Align to superset schema
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;align_to_superset_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.lab_samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works. But we're now making quality decisions about what constitutes an acceptable typo. We're interpreting intent. The bronze layer is no longer just landing raw data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 4: Excel Nightmares
&lt;/h2&gt;

&lt;p&gt;Vendor B sends a file that completely breaks our parser. Opening it in a text editor reveals the structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Contact:,lab@testing.com,"","","","","","","","","","","","","","","","",""
Generated:,2024-10-15,"","","","","","","","","","","","","","","","",""
Lab Name:,Premium Soil Testing,"","","","","","","","","","","","","","","","",""
Sampl_Barcode,lab_id,DATE_RECEIVED,Date_Proccessed,acidity,cu_totl,ZN_TOTL,pb_total,fe_total,Mn_Totl,b_total,mo_totl,ec_ms_cm,Organic_Carbon_Pct,"","","","",""
PYB2475-266277,AT6480 68463,2024-05-12,2024-05-19,6.46,6.63,29.5,4.22,103.,3.56,0.759,0.186,1.44,0.30,"","","","",""
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three metadata rows precede the actual header. Additionally, the file has empty column name padding (those trailing empty strings). The file exhibits the telltale signs of an Excel export where someone navigated beyond the data range and accidentally pressed enter before saving.&lt;/p&gt;

&lt;p&gt;The actual data is fine. The measurements are valid. We just need to skip the metadata rows and ignore the empty columns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer: Approach 4 (Add Header Detection and Column Filtering)
&lt;/h3&gt;

&lt;p&gt;We implement header detection to skip metadata:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;detect_header_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Find the row that looks like a header (most non-null values)&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;header_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;max_non_null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;non_null_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;non_null_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;max_non_null&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;max_non_null&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;non_null_count&lt;/span&gt;
            &lt;span class="n"&gt;header_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;header_idx&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We filter out empty columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_empty_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove columns with empty names or all empty values&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;cols_to_keep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;  &lt;span class="c1"&gt;# Has a name
&lt;/span&gt;            &lt;span class="c1"&gt;# Check if column has any non-empty values
&lt;/span&gt;            &lt;span class="n"&gt;non_empty_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;isNotNull&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;non_empty_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;cols_to_keep&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cols_to_keep&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And implement re-reading from the correct header position:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reread_with_header_at_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header_idx&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Re-read CSV file with header at a specific row index&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Read the entire file as text, skip to header row
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract the header row
&lt;/span&gt;    &lt;span class="n"&gt;header_row&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;header_idx&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;new_columns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;header_row&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;val&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="c1"&gt;# Skip rows before header and use header_idx row as column names
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="c1"&gt;# Filter to rows after header
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;monotonically_increasing_id&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;header_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Rename columns using detected header
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;col_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_columns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;col_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bronze ingestion continues to grow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_vendor_to_bronze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bronze layer with header detection and column filtering&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Read CSV without assuming first row is header
&lt;/span&gt;    &lt;span class="n"&gt;df_peek&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Detect where the real header is
&lt;/span&gt;    &lt;span class="n"&gt;header_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;detect_header_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_peek&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Re-read with correct header
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reread_with_header_at_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove empty columns
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;remove_empty_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fix typos in column names
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fix_column_typos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply vendor-specific column mapping
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Align to superset schema
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;align_to_superset_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.lab_samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bronze layer now includes heuristics for detecting valid data. We're making educated guesses about file structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Problem 5: Database-Hostile Column Names
&lt;/h2&gt;

&lt;p&gt;Vendor B's files sometimes include special characters in column names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#sample_id,lab_id,organic_matter%,cu-total,zn-total
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;#&lt;/code&gt; prefix, &lt;code&gt;%&lt;/code&gt; suffix, and hyphens require backtick escaping in SQL queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="nv"&gt;`#sample_id`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`organic_matter%`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;`cu-total`&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lab_samples&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;vendor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'vendor_b'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every analyst who touches this data must remember the escaping rules. Queries become brittle and harder to read.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bronze Layer: Approach 5 (Add Character Sanitization)
&lt;/h3&gt;

&lt;p&gt;We sanitize column names to be database-friendly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;sanitize_column_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove invalid database characters from column names&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;column_mapping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Remove #, convert % to _pct, replace - with _
&lt;/span&gt;        &lt;span class="n"&gt;new_col&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;%&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_pct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;column_mapping&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_col&lt;/span&gt;

    &lt;span class="c1"&gt;# Rename columns
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;column_mapping&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;old_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The complete bronze ingestion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load_vendor_to_bronze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Bronze layer: The final form&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Read CSV to detect header
&lt;/span&gt;    &lt;span class="n"&gt;df_peek&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Detect and skip metadata rows
&lt;/span&gt;    &lt;span class="n"&gt;header_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;detect_header_row&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df_peek&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Re-read with correct header position
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reread_with_header_at_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;header_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove empty columns
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;remove_empty_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fix typos in column names
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fix_column_typos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Sanitize database characters
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sanitize_column_names&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Apply vendor-specific column mapping
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_a_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;standardize_vendor_b_columns&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Align to superset schema
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;align_to_superset_schema&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vendor_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;analysis_package&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;    &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mergeSchema&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bronze.lab_samples&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Eight transformation steps. Vendor-specific logic branches. Fuzzy matching heuristics. Schema knowledge. Quality decisions.&lt;/p&gt;

&lt;p&gt;This was supposed to be "just land the raw data."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Uncomfortable Questions
&lt;/h2&gt;

&lt;p&gt;We started with a simple bronze layer that read CSV files and wrote them to a table. We now have a complex ingestion pipeline that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Applies business logic:&lt;/strong&gt; Vendor-specific column mapping encodes knowledge about what measurements mean across different schemas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes quality decisions:&lt;/strong&gt; Fuzzy matching determines which typos are acceptable and how to fix them&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interprets structure:&lt;/strong&gt; Header detection guesses where real data begins&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modifies data:&lt;/strong&gt; Character sanitization changes the raw column names we received&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Is this still a "bronze layer"? The Medallion Architecture describes bronze as raw data with minimal transformation. We're well beyond minimal.&lt;/p&gt;

&lt;p&gt;What happens when Vendor C arrives? We add more column mappings to the function, another branch in the if/elif chain, and hope their quirks don't conflict with the assumptions we've baked into our existing logic. And how do we decide what the "default" name for the column should be?&lt;/p&gt;

&lt;p&gt;How do we test this? We need sample files for every vendor, every analysis package, every combination of issues. The test matrix grows exponentially.&lt;/p&gt;

&lt;p&gt;We haven't addressed date format differences, unit conversions, vendor-specific codes, or the dozens of other variations we'll encounter as more vendors join the system.&lt;/p&gt;

&lt;p&gt;The bronze layer has gotten away from us.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Would You Manage This Complexity?
&lt;/h2&gt;

&lt;p&gt;Before we explore solutions in the next post, consider how you would handle this problem in your own systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Would you continue adding transformation logic to the bronze layer until it handles every edge case?&lt;/li&gt;
&lt;li&gt;Would you reject files that don't conform to expected formats and force vendors to fix their exports?&lt;/li&gt;
&lt;li&gt;Would you build a configuration system where new vendor quirks can be added without code changes?&lt;/li&gt;
&lt;li&gt;Would you accept some data quality issues and handle them downstream in the silver layer?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each approach has tradeoffs. Adding more transformations makes the bronze layer complex and fragile. Rejecting files delays processing and frustrates vendors and users of the data alike. Configuration systems add their own complexity. Pushing problems downstream just moves the pain to a different layer.&lt;/p&gt;

&lt;p&gt;What if the fundamental problem is that we're treating column names as schema when they should be treated as data?&lt;/p&gt;

&lt;p&gt;In the next post, we'll explore this alternative. Instead of fighting schema chaos with increasingly complex transformations, we'll embrace it. We'll examine how a single transformation applied to all vendors can replace vendor-specific logic, superset schemas, fuzzy matching, and header detection with something simpler and more robust.&lt;/p&gt;

&lt;p&gt;The solution involves questioning what "raw" actually means.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Coming Soon:&lt;/strong&gt; Part 3 - The Zen of the Bronze Layer: Embracing Schema Chaos&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; All examples are available in the &lt;a href="https://github.com/aawiegel/zen_bronze_data" rel="noopener noreferrer"&gt;zen_bronze_data repository&lt;/a&gt;. The &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/notebooks/pybay_presentation_2025-10.py" rel="noopener noreferrer"&gt;PyBay presentation notebook&lt;/a&gt; contains runnable versions of these transformations.&lt;/p&gt;

</description>
      <category>python</category>
      <category>database</category>
      <category>spark</category>
    </item>
    <item>
      <title>Medallion Architecture 101: Building Data Pipelines That Don't Fall Apart</title>
      <dc:creator>Aaron Wiegel</dc:creator>
      <pubDate>Fri, 23 Jan 2026 09:10:19 +0000</pubDate>
      <link>https://forem.com/aawiegel/medallion-architecture-101-building-data-pipelines-that-dont-fall-apart-1gil</link>
      <guid>https://forem.com/aawiegel/medallion-architecture-101-building-data-pipelines-that-dont-fall-apart-1gil</guid>
      <description>&lt;p&gt;Medallion architecture appears everywhere in modern data engineering. Bronze, Silver, Gold. Raw data, refined data, analytics-ready data. Every Databricks tutorial mentions it. Every lakehouse pitch deck includes the diagram. Every "modern data stack" blog post treats it as gospel.&lt;/p&gt;

&lt;p&gt;Here's the part most people skip: the core ideas trace back to Ralph Kimball's data warehouse methodology from the 1990s. Kimball advocated for staging areas that preserved raw data, integration zones for applying business logic, and dimensional models for analytics delivery. His framework included thirty-four subsystems covering everything from data extraction to audit dimension management. The medallion pattern distills these principles into something clearer and more actionable: three layers with distinct responsibilities.&lt;/p&gt;

&lt;p&gt;This evolution represents genuine improvement, not just rebranding. Kimball's warehouse architecture was comprehensive but complex, designed for batch ETL in relational databases. Medallion architecture preserves the wisdom about data quality and layer separation while adapting to lakehouse capabilities like schema evolution and streaming ingestion. We kept what worked and simplified what didn't. Bronze gets you on the podium. Silver refines your performance. Gold takes the championship.&lt;/p&gt;

&lt;p&gt;The pattern itself is straightforward. Bronze layers ingest raw data with minimal transformation, preserving source formats and tracking metadata about where data originated. Silver layers apply business logic, cleaning and standardizing data into queryable formats. Gold layers deliver analytics-ready datasets, pre-aggregated and optimized for dashboards and reports.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg777vfpx285y5ro3yzye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg777vfpx285y5ro3yzye.png" alt="Medallion diagram" width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This series starts with the ideal case. Clean vendor files, stable schemas, cooperative data sources. We'll implement proper medallion architecture with metadata tracking and layer discipline. Then reality arrives. Post two explores what happens when vendors send chaos instead of clean CSVs; typos in headers, unstable schemas, Excel nightmares that make you question your career choices. Post three reveals an elegant solution that handles the chaos without drowning in vendor-specific transformation code.&lt;/p&gt;

&lt;p&gt;But first, the foundation. Let's build medallion architecture the right way.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;THE EVOLUTION&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Kimball's Data Warehouse (1996) → Medallion Lakehouse (2020)

34 subsystems          → 3 layers (Bronze, Silver, Gold)
Staging Area          → Bronze (raw ingestion, preserve source)
Integration Zone      → Silver (cleaned, standardized, queryable)  
Dimensional Delivery  → Gold (aggregated, analytics-ready)

Batch ETL             → Streaming + batch
Relational warehouses → Cloud lakehouses

Same core wisdom. Simpler execution.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Understanding the Layers
&lt;/h2&gt;

&lt;p&gt;Understanding medallion architecture requires understanding separation of concerns. Each layer serves one purpose. Bronze preserves raw data exactly as received, adding only metadata for tracking and auditability. Silver applies business rules and standardization, transforming raw inputs into consistent, queryable formats. Gold optimizes for specific analytical use cases, pre-aggregating and structuring data for dashboards, reports, and data science workflows.&lt;/p&gt;

&lt;p&gt;The discipline matters more than the metaphor. Resist the temptation to "just quickly clean this in bronze" or "add one small aggregation to silver." Each compromise weakens the architecture. Bronze becomes unpredictable when transformations creep in. Silver becomes cluttered when analytics logic appears. Gold loses focus when it tries to serve every possible use case. Maintain clear boundaries between layers, and the entire pipeline becomes easier to debug, test, and extend.&lt;/p&gt;

&lt;p&gt;Enough concepts. Let's implement this with actual code and actual data. We'll work with a clean vendor CSV file with stable schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Use Case
&lt;/h2&gt;

&lt;p&gt;Biotech companies frequently work with contract research organizations to perform specialized laboratory measurements. These external labs analyze samples and return results as CSV or Excel files. Each vendor has their own file format, column naming conventions, and data delivery schedules. Our task is to ingest this vendor data and make it available for analysis while maintaining data quality and traceability. You can follow along with a Databricks notebook on Github &lt;a href="https://github.com/aawiegel/zen_bronze_data/blob/main/notebooks/001_medallion_demo.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Our scenario involves a soil chemistry lab that measures pH levels and metal concentrations. They send results as CSV files with one row per sample. Each file contains sample identifiers, lab batch information, collection and processing dates, and measurement results. For this first post, we're working with the ideal case: clean data, stable schema, consistent formatting.&lt;/p&gt;

&lt;p&gt;Here's what the vendor file looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm
PYB6134-404166,PSL-73 72846,2024-02-02,2024-02-08,6.74,10.7,5.23
PYB8638-328304,PSL-77 74041,2024-10-11,2024-10-17,6.43,6.34,5.64
PYB7141-642256,PSL-82 22558,2024-08-28,2024-09-03,5.58,3.64,39.8
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's ingest this through our medallion architecture, tracking metadata at each stage and maintaining proper layer separation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bronze Layer: Raw Ingestion with Metadata
&lt;/h2&gt;

&lt;p&gt;The bronze layer preserves raw data with minimal transformation. The challenge lies in tracking provenance without corrupting the source data itself. Every record needs an audit trail: which file did this come from, when did it arrive, which vendor sent it. This metadata proves essential when data quality issues appear downstream or when business users question analytical results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Metadata Tracking Patterns
&lt;/h3&gt;

&lt;p&gt;There are several approaches to tracking ingestion metadata, each with different trade-offs:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedded metadata&lt;/strong&gt; adds provenance columns directly to the bronze table. Every row carries its source file path, ingestion timestamp, and vendor identifier. This approach is simple to implement and query since no joins are needed to trace a record back to its source. The trade-off is that file-level information repeats across every row. In practice, columnar storage formats like Parquet and Delta compress these repetitive string values efficiently, making the storage overhead minimal.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate ingestion table&lt;/strong&gt; maintains a dedicated &lt;code&gt;ingestion_metadata&lt;/code&gt; table with a surrogate key (like a &lt;code&gt;batch_id&lt;/code&gt;). Bronze rows reference this key, keeping the data table compact. Need to analyze ingestion patterns or identify missing vendor deliveries? Query the lightweight metadata table without scanning data. Need complete lineage for specific records? Join the tables. This pattern scales better for production systems with high ingestion volumes and complex monitoring requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data vault-style through table&lt;/strong&gt; uses a link or satellite table to associate ingestion batches with data records. This supports many-to-many relationships: a record appearing across multiple loads, or a single load touching multiple target tables. This is the most flexible pattern for complex lineage scenarios but adds architectural complexity.&lt;/p&gt;

&lt;p&gt;For this demonstration, we use embedded metadata. It keeps the focus on medallion layer mechanics without introducing additional tables. Production systems with high data volumes or complex ingestion monitoring should consider the separate table or data vault patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingesting with Databricks' &lt;code&gt;_metadata&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Databricks exposes file-level metadata automatically through a special &lt;code&gt;_metadata&lt;/code&gt; struct column. By including it in our select, we get the source file path, file name, and modification time without any manual tracking. We supplement this with business-level metadata: a unique ingestion ID, the data source identifier, a timestamp, and a row position for ordering.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;datetime&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid4&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="c1"&gt;# Read vendor CSV, including Databricks file metadata
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Add ingestion metadata (the ONLY transformation in bronze)
&lt;/span&gt;&lt;span class="n"&gt;df_bronze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df&lt;/span&gt;
    &lt;span class="c1"&gt;# Extract from Databricks' _metadata struct
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_metadata.file_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_file_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_metadata.file_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_modified_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_metadata.file_modification_time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Add business metadata
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingestion_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ingested_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;file_row_number&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonically_increasing_id&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to bronze
&lt;/span&gt;&lt;span class="n"&gt;df_bronze&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.vendor_a_samples_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The bronze table now contains both the raw data and its provenance trail:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode | lab_id       | date_received | ... | source_file_path              | source_file_name           | file_modified_at     | ingestion_id                          | data_source | ingested_at          | file_row_number
PYB6134-404166 | PSL-73 72846 | 2024-02-02   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 0
PYB8638-328304 | PSL-77 74041 | 2024-10-11   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 1
PYB7141-642256 | PSL-82 22558 | 2024-08-28   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what we did and didn't do. We loaded the data into a SQL-queryable table and added metadata so any record can be traced back to its source file and ingestion event. We did not convert data types (everything remains strings), rename columns, validate values, or apply business logic. Bronze is about preservation, not transformation. When someone's dashboard breaks downstream, the metadata lets us walk backward through the entire pipeline to find where problems originated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Silver Layer: Cleaning and Standardization
&lt;/h2&gt;

&lt;p&gt;The silver layer applies business logic and standardization. Here we convert data types, apply semantic renaming, and transform raw vendor formats into consistent structures that downstream users can query reliably. Silver is where we enforce data contracts and catch quality issues before they propagate to analytics.&lt;/p&gt;

&lt;p&gt;Our vendor data needs several transformations. The date columns arrive as strings and need conversion to proper date types. The numeric measurements remain strings from CSV parsing and require casting to doubles. We also apply minor semantic renaming: &lt;code&gt;data_source&lt;/code&gt; becomes &lt;code&gt;vendor_name&lt;/code&gt; to better reflect its business meaning, and we rename the raw &lt;code&gt;_metadata&lt;/code&gt; struct to &lt;code&gt;databricks_ingestion_metadata&lt;/code&gt; for clarity. The original measurement column names (&lt;code&gt;ph&lt;/code&gt;, &lt;code&gt;copper_ppm&lt;/code&gt;, &lt;code&gt;zinc_ppm&lt;/code&gt;) are already clear and descriptive, so we keep them as-is.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;DateType&lt;/span&gt;

&lt;span class="c1"&gt;# Read from bronze
&lt;/span&gt;&lt;span class="n"&gt;df_bronze&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.vendor_a_samples_raw&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Apply transformations
&lt;/span&gt;&lt;span class="n"&gt;df_silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df_bronze&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumnsRenamed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;databricks_ingestion_metadata&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_source&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Parse dates
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_processed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yyyy-MM-dd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="c1"&gt;# Cast measurement types
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;DoubleType&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;
    &lt;span class="c1"&gt;# Add processing timestamp
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silver_processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to silver
&lt;/span&gt;&lt;span class="n"&gt;df_silver&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;silver_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.vendor_a_samples_cleaned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The silver table now contains properly typed columns with semantic naming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sample_barcode | lab_id       | date_received | date_processed | ph   | copper_ppm | zinc_ppm | ... | vendor_name | ingestion_id                          | file_row_number | silver_processed_at
PYB6134-404166 | PSL-73 72846 | 2024-02-02   | 2024-02-08     | 6.74 | 10.7       | 5.23     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 0               | 2024-12-16 10:24:15
PYB8638-328304 | PSL-77 74041 | 2024-10-11   | 2024-10-17     | 6.43 | 6.34       | 5.64     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 1               | 2024-12-16 10:24:15
PYB7141-642256 | PSL-82 22558 | 2024-08-28   | 2024-09-03     | 5.58 | 3.64       | 39.8     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 2               | 2024-12-16 10:24:15
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Silver serves multiple purposes beyond type conversion. Production implementations often augment data through joins with reference tables, split datasets into separate dimension and fact tables following dimensional modeling patterns, apply complex business calculations, or enrich records with derived attributes. This simple example focuses on basic standardization to establish the pattern. Later posts will explore more sophisticated transformations as our architecture evolves to handle real-world complexity.&lt;/p&gt;

&lt;p&gt;Notice we preserved the ingestion identifier and file row number. These lineage columns allow us to trace any silver record back to its bronze source and ultimately to the original vendor file. This becomes essential when data quality issues emerge or when business users question specific values. We can walk backward through the entire pipeline to find where problems originated.&lt;/p&gt;

&lt;p&gt;The silver layer also serves as a natural place for data quality checks. We could add validation rules here: pH values should fall between 0 and 14, concentration values should be non-negative, processing dates should occur after received dates. For this simple example we're skipping validation, but production implementations would include these checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Gold Layer: Analytics-Ready Aggregations
&lt;/h2&gt;

&lt;p&gt;The gold layer delivers pre-aggregated datasets optimized for specific analytical use cases. Rather than forcing analysts to repeatedly write the same aggregation queries against silver, we materialize common patterns. This improves query performance and ensures consistent metric definitions across dashboards and reports.&lt;/p&gt;

&lt;p&gt;For our soil sample data, we'll create a monthly summary table that aggregates measurements by vendor. This serves common analytical questions: how many samples did each vendor deliver per month, what were the average pH and concentration levels, did measurements vary significantly across time periods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;functions&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;

&lt;span class="c1"&gt;# Read from silver
&lt;/span&gt;&lt;span class="n"&gt;df_silver&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;table&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;silver_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.vendor_a_samples_cleaned&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create monthly summary aggregations
&lt;/span&gt;&lt;span class="n"&gt;df_gold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;df_silver&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;date_trunc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;col&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date_received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupBy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;month_start&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vendor_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_barcode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sample_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stddev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stddev_ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;min_ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_ph&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_copper_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;avg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;alias&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;avg_zinc_ppm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;withColumn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gold_processed_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;current_timestamp&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Write to gold
&lt;/span&gt;&lt;span class="n"&gt;df_gold&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;delta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;mode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;overwrite&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;saveAsTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gold_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.monthly_vendor_a_summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gold table provides quick access to monthly summary statistics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;month_start | vendor_name | sample_count | avg_ph | stddev_ph | min_ph | max_ph | avg_copper_ppm | avg_zinc_ppm | gold_processed_at
2024-02-01  | vendor_a    | 5            | 6.38   | 0.53      | 5.46   | 7.06   | 5.37           | 8.31         | 2024-12-16 10:25:30
2024-03-01  | vendor_a    | 4            | 6.57   | 0.57      | 5.71   | 7.54   | 3.76           | 13.5         | 2024-12-16 10:25:30
2024-08-01  | vendor_a    | 5            | 6.21   | 0.42      | 5.58   | 6.74   | 5.89           | 18.6         | 2024-12-16 10:25:30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Dashboards query this gold table instead of aggregating silver data repeatedly. Business intelligence tools connect directly to gold for their visualizations. Data scientists pull from gold for initial exploration before diving into silver for detailed analysis. Each layer serves its purpose: bronze preserves raw truth, silver provides clean queryable data, gold optimizes for consumption.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Foundation Is Set
&lt;/h2&gt;

&lt;p&gt;We've implemented medallion architecture in its ideal form. Bronze preserves raw vendor data with minimal transformation, tracking file lineage through embedded metadata columns drawn from Databricks' &lt;code&gt;_metadata&lt;/code&gt; struct and our own business-level provenance fields. Silver applies business logic and standardization, converting string columns to proper types and applying semantic naming. Gold delivers pre-aggregated monthly summaries optimized for analytical consumption. Each layer maintains clear boundaries and serves a distinct purpose.&lt;/p&gt;

&lt;p&gt;This discipline pays dividends when requirements change. Need to recalculate silver with different business rules? Bronze remains untouched as your source of truth. Analytics team wants new gold aggregations? Silver provides clean, typed data ready for transformation. Vendor changes their file format? Only bronze ingestion logic needs adjustment while downstream layers continue functioning.&lt;/p&gt;

&lt;p&gt;The architecture we built assumes cooperative vendors who send clean files with stable schemas. Our sample data arrived perfectly formatted with consistent column names, valid data types, and no surprises. This scenario establishes the pattern and demonstrates proper layer separation. Kimball's audit dimension tables inspired the separate ingestion table pattern we discussed earlier, and that remains the better choice for production systems with high ingestion volumes or complex monitoring needs. For our demo, embedding metadata directly in bronze keeps the focus on layer mechanics. Kimball's staging areas become our bronze layer. His integration zones become our silver transformations. His dimensional delivery layer maps to gold aggregations. The terminology evolved but the wisdom remained constant.&lt;/p&gt;

&lt;p&gt;Reality rarely cooperates this nicely. To see why, let's peek at what Vendor B sends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Peek at Vendor B's file
&lt;/span&gt;&lt;span class="n"&gt;vendor_b_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/Volumes/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;bronze_schema&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;incoming_volume&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/vendor_b_standard_clean.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;df_vendor_b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;option&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;header&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor_b_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_vendor_b&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;printSchema&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;root
 |-- sample_barcode: string
 |-- lab_id: string
 |-- date_received: string
 |-- date_processed: string
 |-- acidity: string
 |-- cu_total: string
 |-- zn_total: string
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wait. &lt;code&gt;acidity&lt;/code&gt; instead of &lt;code&gt;ph&lt;/code&gt;? &lt;code&gt;cu_total&lt;/code&gt; instead of &lt;code&gt;copper_ppm&lt;/code&gt;? &lt;code&gt;zn_total&lt;/code&gt; instead of &lt;code&gt;zinc_ppm&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;Same measurements. Different column names. How do we handle this without writing vendor-specific transformation code in our silver layer? Do we create separate tables for each vendor? Vendor-specific case statements? A config file with column mappings that grows longer with every new vendor?&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://dev.to/aawiegel/when-bronze-goes-rogue-schema-chaos-in-the-wild-16kf"&gt;next post&lt;/a&gt; explores what happens when vendors send chaos instead of clean CSVs (typos in column headers, unstable schemas, Excel nightmares) and how our clean three-layer architecture begins accumulating complexity. We'll discover an elegant solution that handles schema variations without drowning in vendor-specific code.&lt;/p&gt;

&lt;h2&gt;
  
  
  References and Further Reading
&lt;/h2&gt;

&lt;p&gt;Kimball, Ralph, and Margy Ross. &lt;em&gt;The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling&lt;/em&gt;, 3rd ed. Wiley, 2013. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chapter 2 provides an overview of Kimball's dimensional modeling techniques&lt;/li&gt;
&lt;li&gt;Chapter 8 covers audit dimensions and metadata tracking patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Kimball Group. "Dimensional Data Warehousing Resources." &lt;a href="https://www.kimballgroup.com/" rel="noopener noreferrer"&gt;https://www.kimballgroup.com/&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Archive of articles, design tips, and dimensional modeling techniques from the originators of the methodology&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Databricks. "What is the medallion lakehouse architecture?" Databricks Documentation. &lt;a href="https://docs.databricks.com/lakehouse/medallion.html" rel="noopener noreferrer"&gt;https://docs.databricks.com/lakehouse/medallion.html&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official documentation on medallion architecture patterns and best practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The complete code examples and demo notebooks for this blog series are available at: &lt;a href="https://github.com/aawiegel/zen_bronze_data" rel="noopener noreferrer"&gt;https://github.com/aawiegel/zen_bronze_data&lt;/a&gt;&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>database</category>
      <category>python</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
