Forem: Aaron Wiegel

Synthetic Data and the Privacy Problem: Beyond Alice and Bob

Aaron Wiegel — Wed, 04 Mar 2026 20:11:40 +0000

The fixtures in this series have always been honest about what they were optimizing for. Posts 1 through 3 generated vendor CSV files designed to capture structural chaos: typos in column headers, shifting measurement packages, metadata rows that Spark reads with misplaced confidence, using the same tools we'll develop further here. The goal was a bronze layer that could absorb whatever shape a vendor file arrived in without requiring code changes. The fixture data itself, a collection of pH readings and copper concentrations, was never the point. Nobody's privacy interests are implicated by a synthetic soil sample.

Customer records are a different matter entirely.

A soil lab does not only process measurements. It processes submissions from real farms, research institutions, and agricultural businesses. Names, addresses, billing contacts, field histories. The moment a pipeline needs to model that relationship, "generate some plausible-looking data" stops being a casual decision and starts being a question worth taking seriously. What does realistic mean for sensitive data? How do you test against customer records when the real ones are legally and ethically off-limits? And if you cannot use real data, how do you know your fixtures are testing the right things?

This post answers those questions by building two tools that address related but distinct problems. The customer generator produces realistic profiles derived deterministically from barcodes that already exist in the pipeline. The masking library addresses a separate situation entirely: when real production data needs to enter a development environment with appropriate controls applied. Both tools are useful. They are not interchangeable.

The code for this post can be found here. Feel free to follow along or dig in if you want more details.

What the Customer Generator Produces

The names, addresses, and contact details come from numpy's random number generator seeded with a SHA-256 digest of each barcode. The same barcode produces the same customer profile every time, in every environment, without storing anything. Faker would have produced more linguistically convincing output and probably should have been the tool for this job. The fixtures work fine regardless. Realism of content was never the point. Stability of attachment was.

The mechanism is SHA-256 used as a pure function into numpy's seed space:

def derive_seed_from_barcode(barcode: str) -> int:
    digest = hashlib.sha256(barcode.encode("utf-8")).digest()
    return int.from_bytes(digest[:8], byteorder="big")

Given any string, this returns a deterministic non-negative integer suitable for seeding numpy.random.default_rng. The same string always produces the same seed, which means the same customer_id always produces the same profile, regardless of when or how many times the generator runs. This is referential stability without a database.

The customer data divides into two tables. A customers table holds one row per unique customer: name, address, contact info, date of birth, age, and a free-text notes field that sounds like it came from someone who has been doing soil testing long enough to have opinions about irrigation timing. A customer_samples table holds one row per barcode, linking it to a customer with submission-level fields: crop_type and sample_date. The separation matters because customers are entities and samples are facts. A corn farmer submits many soil samples across many fields and seasons. Flattening that into a single table would denormalize the relationship in ways that create trouble the moment you want customer-level aggregation or need to update a profile field.

customers_df = forge_customers(len(all_barcodes) // 5, gen)
customer_samples_df = forge_customer_sample_assignments(all_barcodes, customers_df, gen)

The five-to-one ratio means most customers appear across multiple sample submissions, which reflects how actual lab relationships work. One detail worth mentioning: the address generator includes a Wisconsin PLSS coordinate format with about 20% probability. N5024W3295 is a real addressing convention from the Public Land Survey System, common in rural parts of the Midwest. It shows up verbatim in actual vendor lab reports (usually to the quiet dismay of whoever first encounters it), which means it shows up in the fixtures, which means any address parsing logic gets tested against it. That is the point of realistic fixtures.

Why Not Just Use the Generated Data Directly?

That raises a legitimate question, though. Randomly generated data is structurally valid but statistically hollow. The real distributions, the genuine quirks, the specific shenanigans your data gets up to when nobody is watching: all of that is gone, replaced by Alice and Bob. Those two are perfectly serviceable for unit testing logic in isolation. Based on the number of unit test failures I've seen with them, I would not trust them together at a bar on a Saturday night. Similarly, I would not trust them as a proxy for production data's full range of quirks. If the goal is integration testing, fixtures that never existed in the real world can only tell you so much.

This is where masking enters as a separate tool for a separate problem.

The Masking Strategies

The alternative to generating synthetic data is masking real data. Synthetic generation works when you need data that never existed and want full control over its structure. Masking works when you need a development dataset derived from real production records, preserving real distributions, real anomalies, and real edge cases that a generator might miss. In practice, many teams need both: generated data for early development, masked production data for integration testing before launch.

Four strategies cover the meaningful design space, although this is not meant to exhaustive, merely illustrative.

Shuffle permutes values within a column independently, which preserves the marginal distribution: if 30% of customers are in Wisconsin before shuffling, 30% still are afterward. Every value in the column is real. Every value passed format validation before masking and will pass it again after. Nothing looks wrong.

The hazard is relational, not statistical. Shuffle severs the connection between a value and the row it belonged to. Sometimes that severance is the goal. The association between a particular farm and a particular set of samples can be as sensitive as the customer record itself. For end-to-end testing, whether barcode LAB-001 actually belongs to Sandra Hernandez is irrelevant to whether the pipeline processes it correctly, and severing that link adds a layer of protection if the development environment is ever compromised. Shuffle breaks the relationship deliberately and completely, which is occasionally exactly what a privacy requirement demands.

Where shuffle becomes hazardous is when the relationship itself is load-bearing. A dataset released for external research might need customer histories intact but otherwise anonymized to be analytically meaningful. Shuffle would destroy that meaning while preserving the appearance of validity: joins complete, results look plausible, and the underlying analysis is quietly wrong. That use case is outside the scope of what we are building here, but it is worth understanding before reaching for shuffle by default.

The strategy is not wrong. It requires knowing whether the relationship you are severing was one you needed to keep. Most teams discover this distinction at an inopportune time.

Imputation replaces column values with output from a callable. The masking module takes an imputers dictionary mapping column names to functions that accept a row count and return a list of replacement values:

imputers = {"customer_name": lambda n: ["Anonymous"] * n}
masked = mask_impute(customers_df, ["customer_name"], imputers)

Every customer becomes Anonymous. No distribution is preserved, no format inference required. For fields where the actual value is irrelevant to what you are testing, that is perfectly sufficient.

The limitation is the flip side of the explicitness. A dataset where every customer name is Anonymous is not anonymized in any meaningful sense; it is a dataset with a broken name column. If downstream logic does anything with name format, uniqueness, or even basic non-nullness, imputation will expose that dependency immediately. This is occasionally useful information. A pipeline that silently assumes customer names are unique has a latent bug that imputation will surface faster than any other strategy.

The callable interface also means imputation can do more than substitute a constant. A more sophisticated imputer could draw from a list of plausible replacements, apply format rules, or generate values that satisfy downstream validation constraints. Anonymous is the simplest possible implementation. The mechanism supports considerably more nuance when the situation calls for it.

Resampling handles numeric columns by fitting a normal distribution to the existing values and drawing a fresh set of values clipped to the original range:

column_mean = original_values.mean()
column_std = original_values.std()
resampled = generator.normal(column_mean, column_std, size=n)
masked_df[column] = np.clip(resampled, original_values.min(), original_values.max())

The statistical shape of the column is preserved: similar mean, similar spread, values that fall within the original bounds. A data scientist working with a resampled age column will see a realistic distribution without seeing any real ages. For downstream logic that cares about aggregates rather than individual values, this is the most analytically faithful masking strategy available.

Hashing applies SHA-256 and keeps the first eight hex characters:

masked_df[column] = masked_df[column].apply(
    lambda value: hashlib.sha256((salt + str(value)).encode()).hexdigest()[:8]
)

The critical property is determinism. The same input with the same salt always produces the same output, which means hashed values can be joined across tables. Hash customer_id in both customers and customer_samples using the same salt and CUST-CE8B39 becomes c2fd6c19 in both places. The foreign key relationship survives. This is the only masking strategy where that is true.

The tradeoffs are real and worth understanding. An eight-character hex string truncated from SHA-256 is not cryptographically robust, particularly for low-cardinality columns where brute force recovery is straightforward. A salt raises the cost of that attack without eliminating it. Hash masking is appropriate for development environments where the goal is protecting data from casual exposure, not for datasets approaching public distribution.

The salt also creates an operational dependency worth naming explicitly. Lose the salt, and your masked datasets become unrelatable across tables and unreproducible from scratch. It belongs in environment configuration alongside your database credentials, not in the codebase.

Combining the Strategies

The four strategies combine into a single apply_masking call that dispatches per column based on a configuration dictionary:

mask_config = {
    "customer_id": "hash",
    "customer_name": "impute",
    "date_of_birth": "shuffle",
    "email": "shuffle",
    "phone": "shuffle",
    "street_address": "shuffle",
    "city": "shuffle",
    "age": "resample",
}
masked_customers = apply_masking(
    customers_df,
    mask_config,
    generator=gen,
    imputers={"customer_name": lambda n: ["Anonymous"] * n},
    salt=MASK_SALT,
)

The customer_samples table requires the same salt on customer_id to preserve the join:

masked_assignments = apply_masking(
    customer_samples_df,
    {"customer_id": "hash", "crop_type": "shuffle"},
    generator=gen,
    salt=MASK_SALT,
)

The fact that this requires explicit decisions about every column is a feature. It forces the question of what each field actually is before deciding how to treat it. That is a conversation worth having before the data enters a development environment, not after.

What the Masked Data Actually Looks Like

We can now compare customers.csv and customers_masked.csv. One row from each:

Original:

CUST-CE8B39, Sandra Hernandez, 1975-09-01, 51, sandra.hernandez@protonmail.com,
(923) 867-3934, 4332 Hollow Trl, Oxford, WV, 27369, Comparison plot for trial program

Masked:

c2fd6c19, Anonymous, 1957-07-21, 72.89, mark.jones@aol.com,
(364) 820-3448, 4848 Pasture Dr, Dover, WV, 27369, Comparison plot for trial program

Most of this looks reasonable until you look carefully. The customer ID is hashed, the name is imputed, the date of birth, phone, street address, and city have been shuffled to values that belonged to different customers.

The age has been resampled to 72.89. The masked output is technically within range and statistically plausible in aggregate, but no human being has ever reported themselves as 72.89 years old. Any schema that enforces INTEGER on that column will reject it immediately. This is the kind of thing that looks fine in a test that checks whether a value exists and looks obviously wrong the moment anyone actually reads it.

Then there is the email address.

mark.jones@aol.com is sitting in the masked output fully readable. Shuffling an email address does not protect it. It reassigns it. Sandra's email is now attached to someone else's row and Mark's is attached to Sandra's. Both are still completely exposed. If the goal was protecting contact information, the masking configuration failed quietly and completely.

The city and zip code tell a similar story. Oxford shuffled to Dover, but the zip code stayed as 27369. Column-level masking applied independently has no awareness of relationships between columns. The result passes any single-column validation and fails the moment anything checks whether the address makes geographic sense.

None of these are bugs in the masking implementation (OK, the age thing is). All of them are the correct output of the configuration we provided. That is precisely the point: the strategies do exactly what they are told, not what you meant.

The Silver Layer Extension and Unit Tests

The fixture infrastructure is only useful if something actually runs against it. Before loading masked data into a test schema and invoking dbt build, it helps to know that the model logic itself is correct. The unit test handles that first.

int_lab_samples_with_customers performs a two-hop join across three models:

FROM int_lab_samples_standardized
LEFT JOIN stg_customer_samples
    ON int_lab_samples_standardized.sample_barcode = stg_customer_samples.sample_barcode
LEFT JOIN stg_customers
    ON stg_customer_samples.customer_id = stg_customers.customer_id

The model adds has_customer_assignment, a boolean that is TRUE when a measurement's barcode has a matching row in customer_samples, which makes unmatched measurements findable without requiring every downstream query to filter on NULL customer fields.

The join has three distinct behavioral regimes worth testing explicitly: a barcode with a full match through both hops, a barcode with no customer assignment at all, and a barcode that matches customer_samples but whose customer_id has no corresponding row in customers. That third case, an orphaned assignment, is realistic enough to justify its own fixture row. Anyone who has spent meaningful time with production data has encountered some version of this: a submission that arrived before its parent record, or a customer that got cleaned up while their samples stayed behind. A barcode submitted before a customer record existed, or after one was deleted, should produce has_customer_assignment: true and customer_name: null. Testing it explicitly verifies the full join chain rather than just the happy path.

Once the unit tests pass, the masked fixtures are ready to do their actual job. Load masked_customers and masked_assignments into a test schema, point dbt build --select your_models --target test at it, and you have an integration test that exercises the pipeline against data with the structural properties and relational complexity of production records, without any of the liability.

Closing

The masking strategies covered here represent a design space, not a checklist. Shuffle preserves distributions but severs relationships, deliberately or otherwise. Imputation is explicit but analytically hollow. Resampling maintains statistical shape but loses type fidelity. Hashing preserves referential integrity at the cost of cryptographic robustness. No single strategy is correct in isolation. The configuration you choose reflects a set of tradeoffs that are worth making consciously rather than discovering later when something downstream produces confident, well-formatted, wrong answers.

The fixture infrastructure this post builds serves a specific purpose: giving the pipeline realistic data to run against without the liability of real records. The masked customer dataset is not a privacy guarantee. It is a development tool, and like any development tool its value depends on using it for the right job. Getting the configuration right before you need it is substantially easier than explaining a masking decision you made at 4pm on a Thursday.

Complete working example: The labforge customer and masking modules are in src/labforge/. The dbt models are in src/crucible/models/silver/. Example data comparing raw and masked output is in example_data/.

From Bronze to Silver: Staging, Intermediate, and the Art of the Trustworthy Join

Aaron Wiegel — Wed, 25 Feb 2026 15:50:32 +0000

Part 3 solved vendor schema chaos by treating column names as data. The bronze layer now holds one row per attribute per measurement: lab_provided_attribute carries whatever the vendor typed as a column name, lab_provided_value carries the cell value, and the table schema stays fixed and vendor-agnostic regardless of how creatively the vendor chose to name things. No fuzzy matching, no superset schemas, no vendor-specific parsing logic. The chaos was encoded as rows rather than fought through transformations.

It was the right call for ingestion. It is a genuinely terrible format for answering questions.

Ask "what was the average copper measurement for samples received in October?" against long-format EAV data and you're looking at a subquery to identify which lab_provided_attribute values represent copper, a pivot to bring those values into a column, a join to find the collection date from another EAV row, and a filter on the date. Do that once, and it's manageable. Do it in every query anyone ever writes against this data, and you've simply moved the transformation burden from ingestion to analysis. The point of a multi-layer pipeline is to make that transformation happen once, correctly, in a place where it can be tested and maintained.

Silver is where we make it happen. This post introduces the dbt models that carry bronze data from EAV chaos into an analytical schema, walks through the architectural reasoning behind each layer, and then runs the first unit test. It does not go well.

Why dbt for Silver Transformations

Part 3 expressed the bronze layer in Python and PySpark. We could take the same approach for silver: write Spark SQL transformations directly, materialize staging and intermediate results as Delta tables, and wire them together with a notebook or a Databricks job. The transformation logic is the same regardless of the execution environment.

The reason to use dbt here is not that the SQL would be different. It's that dbt enforces structural discipline that raw Spark SQL scripts require you to maintain yourself, through consistent conventions and sheer force of will. One of those scales better than the other.

Each dbt model lives in its own file with an explicit dependency graph. When int_lab_samples_joined references ref('stg_lab_samples_unpivoted'), dbt knows to run the staging model first, tracks the lineage between models, and rebuilds downstream models when upstream ones change. Schema documentation lives alongside the models in YAML files and can be generated into browsable documentation. Tests run against the models directly. Materialization strategies (view, table, incremental) are model-level configuration, not something you manage separately in SQL DDL. None of this is impossible with raw Spark SQL scripts. It just requires the kind of ironclad consistency that engineering teams reliably maintain right up until they don't. dbt builds the discipline into the workflow rather than assuming it from the humans.

For a transformation layer with multiple models in a dependency chain, that structure is worth the overhead of learning a new tool. The silver layer here has at least three models with an explicit sequence: staging must run before intermediate, and getting that sequence wrong produces silent incorrect results rather than an obvious error. dbt makes those dependencies explicit and enforces them automatically. And for what it's worth, dbt Core and Databricks Community Edition are both free, so the overhead is purely cognitive.

Why Silver Has Layers

Here is something the dbt documentation will not tell you: Ralph Kimball described this architecture thirty years ago.

The Data Warehouse Toolkit (first published 1996, revised 2013) describes a staging area whose job is to clean and conform source data before it enters the warehouse. The staging area strips away source-system idiosyncrasies. It normalizes formats, resolves encoding variations, and creates consistent representations that downstream logic can depend on. The integration layer then applies business logic: joining conformed source data to reference tables, enriching records with canonical meaning, building the integrated representation that analysts actually query. dbt's best-practices guides on project structure [1] use exactly this division: staging models are thin, source-cleaning layers; intermediate models are where complex joins and business logic live.

The vocabulary has changed; the conceptual structure has not.

This matters because it explains WHY the layers exist, not just what they happen to contain. Conventions without reasoning are just folklore. The staging and intermediate separation encodes a functional dependency: intermediate models depend on guarantees that staging models make. Staging exists to create a contract. Intermediate relies on that contract to do its job. When those responsibilities blur, with source-system cleaning mixed into join logic and business rules embedded in normalization models, the result is transformation logic that is hard to reason about, hard to test, and hard to change safely.

With EAV bronze data, this layering matters more than usual. The staging layer has a specific, structural job to do before any downstream join is even possible.

The Staging Model: Creating a Joinable Key

The problem with bronze EAV data is that lab_provided_attribute holds whatever the vendor typed. Vendor A sends "Sample Concentration". Vendor B sends "sample_concentration " (trailing space included). A third vendor might send "SAMPLE CONC.". These represent the same measurement. No string comparison will match them without normalization.

The staging layer's job is to solve this problem once, correctly, in a place downstream models can depend on. Every join that comes after this model is only possible because this model did its job first. The model adds one column, attribute_standardized, that normalizes lab_provided_attribute into a consistent form:

{{ config(materialized='view') }}

WITH lab_samples_unpivoted AS (
    SELECT * FROM  {{ source('bronze', 'lab_samples_unpivoted') }}
),

lab_samples_unpivoted_staged AS (
    SELECT
        *,
        LOWER(
            TRIM(
                TRANSLATE(
                    lab_samples_unpivoted.lab_provided_attribute,
                    '-$()#./ %@!',
                    '___________'
                )
            )
        ) AS attribute_standardized
    FROM lab_samples_unpivoted
)

SELECT * FROM lab_samples_unpivoted_staged

The cleaning chain does three things. TRANSLATE replaces eleven special characters (-$()#./ %@!) with underscores. TRIM strips leading and trailing whitespace. LOWER normalizes to lowercase so that capitalization differences don't produce false mismatches. With their powers combined, these three functions will help us later pivot back to wide format cleanly for specific analytical views.

Applied to realistic vendor data: "Sample Concentration" becomes sample_concentration. "-a$lot(of)weird#symbols.why/vendors%why@!" becomes a_lot_of_weird_symbols_why_vendors_why, which is arguably an improvement on the original in more ways than one.

Kimball called the underlying concept "conforming." Conformed dimensions bring disparate source systems into a shared vocabulary so they can be joined and analyzed together. You build a date dimension once, in a shared format, and every fact table that references dates joins to the same table. You don't maintain separate date representations for each source system. We call what the staging model does "attribute standardization." The goal is identical: create a canonical representation that downstream logic can join on without worrying about source-system variation. Same church, different pew.

Notice what this model does not do. It doesn't join to any reference data, doesn't classify anything, doesn't apply a single business rule. It adds one column whose entire purpose is to make the next model's join possible. That constraint is deliberate, and violating it is how staging models become the thing nobody wants to touch six months later. dbt's best practices [2] describe staging models as thin layers that do the minimum necessary to make source data useful downstream: rename columns to consistent conventions, cast types, add computed identifiers. Business logic belongs in intermediate. The moment staging models start containing complex transformations, you've lost the separation that makes both layers maintainable.

The Intermediate Layer: Enrichment Through Joining

With attribute_standardized available, the first intermediate model can do what staging made possible.

int_lab_samples_joined performs two joins. Neither is complicated. What matters is why they happen in this order and what the LEFT JOIN choice is actually saying about our relationship with imperfect data.

The first join connects staged measurements to a vendor column mapping table on attribute_standardized and vendor_id. The mapping table is reference data that records the connection between vendor-specific attribute names and canonical column identifiers. The second join connects the mapping result to a canonical column definitions table, which carries authoritative metadata about each canonical column: its data type, its category (measurement, identifier, date), and whether it should be treated as a metadata column or a measurement column in downstream transformations.

{{ config(materialized='view') }}

WITH stg_lab_samples AS (
    SELECT * FROM {{ ref('stg_lab_samples_unpivoted') }}
),

vendor_column_mapping AS (
    SELECT * FROM {{ source('bronze', 'vendor_column_mapping') }}
),

canonical_column_definitions AS (
    SELECT * FROM {{ source('silver', 'canonical_column_definitions') }}
),

joined AS (
    SELECT
        * EXCEPT (
            vendor_column_mapping.vendor_id,
            vendor_column_mapping.vendor_column_name,
            vendor_column_mapping.canonical_column_id,
            vendor_column_mapping.notes
        ),
        notes AS vendor_mapping_notes
    FROM stg_lab_samples
    LEFT JOIN vendor_column_mapping
        ON stg_lab_samples.attribute_standardized = vendor_column_mapping.vendor_column_name
        AND stg_lab_samples.vendor_id = vendor_column_mapping.vendor_id
    LEFT JOIN canonical_column_definitions
        ON vendor_column_mapping.canonical_column_id = canonical_column_definitions.canonical_column_id
)

SELECT * FROM joined

The join condition includes both attribute_standardized and vendor_id. Two vendors might use similar standardized attribute names for genuinely different measurements, or the canonical mapping might differ by vendor for domain-specific reasons. Joining on both columns preserves that specificity.

The LEFT JOIN choice is a deliberate QA decision, and it reflects a specific philosophy about what bronze-to-silver transformation should do with ambiguity. Rows that don't match the mapping table don't disappear. They survive with canonical_column_name IS NULL, which is not a failure state. It's a diagnostic signal. The pipeline is saying "something arrived that I don't recognize" rather than quietly discarding evidence that something unexpected happened.

Downstream QA can filter on that null to find unmapped attributes and determine whether the mapping table needs a new entry, whether the vendor introduced something unexpected, or whether an existing mapping has a standardization error. The silver layer doesn't resolve that ambiguity; it surfaces it in a way that makes it findable.

This is Kimball's integration layer in practice [3], thirty years later, running on Delta Lake instead of whatever disk arrays cost a fortune in 1996. A vendor-specific attribute in EAV format becomes a row that carries canonical column name, data type, category, and enrichment notes. The enrichment is what makes downstream transformations tractable. Without it, every downstream model or query would need to join to the mapping table itself, with full knowledge of vendor-specific attribute naming. That knowledge belongs here, applied once.

The Intermediate Layer: Completing the Schema Transformation

Now that we have our joined data model available in intermediate, we can make this data more usable.

int_lab_samples_standardized handles this with conditional aggregation, which is the SQL equivalent of sorting your mail: same pile, separated by what actually matters.

The challenge is structural. Bronze EAV format is gloriously egalitarian: sample barcodes, collection dates, copper measurements, and pH readings all get the same treatment, one row each, no hierarchy, no distinction. That democratic impulse was exactly right for ingestion. For analysis it is a nightmare, because analysts don't want to write a subquery just to find out what day a sample was collected.

In the bronze EAV format, metadata columns (sample barcode, lab ID, collection date, analysis date) live in the same row structure as measurement columns (copper, zinc, pH). A typical row from a sample with three measurements and four metadata columns produces seven EAV rows: four metadata rows and three measurement rows, all sharing the same row_index. Analysts want one record per measurement with the metadata attached as columns, not a set of EAV rows where metadata and measurements are interleaved.

{{ config(materialized='view') }}

WITH int_lab_samples_joined AS (
    SELECT * FROM {{ ref('int_lab_samples_joined') }}
),

metadata_pivoted AS (
    SELECT
        row_index,
        vendor_id,
        file_name,
        ingestion_timestamp,
        MAX(CASE WHEN canonical_column_name = 'sample_barcode' THEN lab_provided_value END) AS sample_barcode,
        MAX(CASE WHEN canonical_column_name = 'lab_id'         THEN lab_provided_value END) AS lab_id,
        MAX(CASE WHEN canonical_column_name = 'date_received'  THEN lab_provided_value END) AS date_received,
        MAX(CASE WHEN canonical_column_name = 'date_analyzed'  THEN lab_provided_value END) AS date_analyzed
    FROM int_lab_samples_joined
    WHERE is_metadata_column = TRUE
    GROUP BY row_index, vendor_id, file_name, ingestion_timestamp
),

-- All non-metadata rows, including unmapped ones (is_metadata_column IS NULL).
-- Unmapped rows are preserved for QA; filter on is_metadata_column IS NULL to find problem attributes.
measurements AS (
    SELECT * EXCEPT (is_metadata_column)
    FROM int_lab_samples_joined
    WHERE is_metadata_column = FALSE OR is_metadata_column IS NULL
),

standardized AS (
    SELECT
        measurements.* EXCEPT (file_name, ingestion_timestamp),
        metadata_pivoted.file_name,
        metadata_pivoted.ingestion_timestamp,
        metadata_pivoted.sample_barcode,
        metadata_pivoted.lab_id,
        metadata_pivoted.date_received,
        metadata_pivoted.date_analyzed
    FROM measurements
    LEFT JOIN metadata_pivoted
        ON measurements.row_index = metadata_pivoted.row_index
        AND measurements.vendor_id = metadata_pivoted.vendor_id
        AND measurements.file_name = metadata_pivoted.file_name
)

SELECT * FROM standardized

The logic splits the enriched EAV data into two paths. The metadata_pivoted CTE filters for rows where is_metadata_column = TRUE and collapses them into one record per row_index, vendor_id, and file_name, using MAX(CASE WHEN ...) to pull each metadata attribute into its own named column. The measurements CTE takes the other path: rows where is_metadata_column = FALSE (actual measurement attributes) and rows where is_metadata_column IS NULL (attributes that didn't match the mapping table) both pass through. The final join recombines the two paths on all three columns.

The three-column join condition deserves a note, because this is exactly the kind of thing that looks fine until it catastrophically isn't. row_index is the position of a row within a single CSV file; it resets to zero at the start of each file. Joining on row_index and vendor_id alone would incorrectly match row 5 from file_a.csv with row 5 from file_b.csv if both came from the same vendor. Including file_name makes the join key specific to a row within a specific file from a specific vendor: the granularity we actually want.

Two other design decisions in this model are worth making explicit.

The is_metadata_column IS NULL inclusion in the measurements CTE reflects the same philosophy as the LEFT JOIN in the previous model: preserve ambiguity rather than discard it. An unmapped row in silver is information. It says "something arrived that we haven't classified yet." Discarding it would make the gap invisible; surfacing it makes the gap findable.

Date columns (date_received, date_analyzed) remain as raw strings through these intermediate models. This is not an oversight. Casting vendor date strings to an actual date type without knowing the vendor's format either fails loudly on unexpected input or silently coerces values into something plausible but wrong. Silent and wrong is the worst possible outcome in a data pipeline. Now that we have the data in a more usable format, we can start validating this with further intermediate models before it ends up in a gold table.

Writing the Contract Down

The three models form a coherent chain. Staging creates a joinable key. The first intermediate model uses that key to enrich with canonical meaning. The second pivots the result into an analytical schema while preserving unmapped rows for QA. From a design standpoint, the transformation logic is complete.

From a validation standpoint, we've asserted that it works without demonstrating it. "Looks correct" is not a testing strategy. It is a feeling, and feelings have a well-documented history of being wrong about SQL.

This is where test-driven thinking applies. The staging model makes a specific claim: given a lab_provided_attribute, attribute_standardized will be trimmed, lowercased, and have special characters replaced with underscores. That claim is precise enough to verify directly.

dbt's unit testing feature [4], introduced in version 1.8, makes this claim machine-verifiable. You define mock input rows and expected output rows in the model's YAML configuration. dbt runs the model against your mock inputs in isolation, without touching production data, and tells you whether reality matches the specification. It is, in the most literal sense, writing the contract down and then checking whether anyone actually honored it.

unit_tests:
  - name: test_column_name_standardization
    description: "Check that column names are correctly trimmed, made lower case, and have symbols removed"
    model: stg_lab_samples_unpivoted
    given:
      - input: source('bronze', 'lab_samples_unpivoted')
        rows:
          - {lab_provided_attribute: ' extra spaces '}
          - {lab_provided_attribute: 'CAPITALS'}
          - {lab_provided_attribute: '-a$lot(of)weird#symbols.why/vendors%why@!'}
    expect:
      rows:
        - {attribute_standardized: 'extra_spaces'}
        - {attribute_standardized: 'capitals'}
        - {attribute_standardized: 'a_lot_of_weird_symbols_why_vendors_why'}

Three test cases, three behaviors. The whitespace case verifies that leading and trailing spaces are removed. The capitalization case verifies that mixed or uppercase input produces lowercase output. The symbols case verifies that punctuation and special characters become underscores. Together, they define the contract that downstream models rely on when they join to the mapping table on attribute_standardized.

Let's run it.

The Tests Fail

Two of the three cases fail. The test does not care about our feelings about this.

Failure in unit_test test_column_name_standardization

actual differs from expected:

@@ ,attribute_standardized
+++,_a_lot_of_weird_symbols_why_vendors_why__
+++,_extra_spaces_
---,a_lot_of_weird_symbols_why_vendors_why
   ,capitals
---,extra_spaces

Consider the whitespace case first. Input: ' extra spaces '. Expected: 'extra_spaces'. Actual: '_extra_spaces_'. Something in that chain is executing in the wrong order. Let's trace it.

The functions nest inside each other, so they execute from the innermost outward: TRANSLATE runs first, TRIM runs second, LOWER runs third.

TRANSLATE replaces the eleven characters listed in its second argument with underscores. The character set is '-$()#./ %@!'. Count carefully: there is a space between the forward slash and the percent sign. Every space in the input string becomes an underscore. ' extra spaces ' becomes '_extra_spaces_'.

TRIM runs on the result of TRANSLATE. TRIM removes leading and trailing whitespace characters. There is no whitespace left; the leading and trailing spaces became underscores in the previous step. TRIM finds nothing to remove. The string stays '_extra_spaces_'.

LOWER runs. No change; the string is already lowercase.

The bug is an ordering problem, and it is the specific kind of ordering problem that looks completely reasonable until you trace one concrete input through the full execution sequence and watch it go wrong in slow motion. TRIM must run before TRANSLATE. Strip the whitespace first, then translate special characters, and the leading and trailing spaces are gone before TRANSLATE ever encounters them. The fix is a one-line change to the nesting order.

The symbols case fails for related reasons, and tracing through it is left as an exercise. The root cause is the same.

Incidentally, I got this wrong the first time. The test told me.

Here is the uncomfortable implication of that fact, and it applies regardless of how the models were written. As I was working through this post, I noticed that the join in int_lab_samples_standardized was initially written on row_index and vendor_id alone, which would produce incorrect metadata associations whenever a vendor sends more than one file, since row_index resets to zero at the start of each file. I caught it by thinking carefully about what the columns actually represent. I would not have caught it by looking at the SQL and deciding it seemed reasonable.

That bug would have existed in the code whether the models were written with an agent, generated from a Claude prompt and pasted in, or typed out manually by an engineer who had just had a very productive morning. The staging model's unit test demonstrates the alternative: make the claim explicit, make it machine-verifiable, and find out whether the claim is true before production data depends on it.

The question of how to extend that discipline to the intermediate models, and to the outputs of the entire pipeline against real data, is where Part 5 begins.

What Silver Looks Like Now, and What Comes Next

The architecture is sound. Three models form a coherent transformation chain, each layer making the next one possible, each design decision traceable back to a specific problem it exists to solve. Kimball would recognize it. He would just be surprised it took thirty years to get decent dependency management.

The first unit test does not pass. The intermediate models have no tests at all. We have built something that looks correct and demonstrated, scientifically, that looking correct is insufficient evidence.

How do we know that actual vendor data flowing through this pipeline maps correctly to canonical columns? How do we validate that the mapping table covers the attributes vendors actually send, rather than the attributes we assumed they would send? How do we catch a new analysis package that introduces column names we have never seen, silently producing nulls in silver while everyone downstream wonders why the copper measurements disappeared?

Those questions require validation against real data, at the boundaries where layers hand off to each other, against statistical expectations that reflect what vendors actually send rather than what we hope they send. Building that validation framework is the next post.

Complete "working" example: The dbt models described in this post are in GIthub. The staging model, both intermediate models, and the unit test definition are all in that directory.

References:

[1] dbt Labs. (2026). How we structure our dbt projects.

[2] dbt Labs. (2026). Staging: Preparing and cleaning source data.

[3] Kimball, R., & Ross, M. (2013). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling (3rd ed.). Wiley.

[4] dbt Labs. (2026). Unit tests.

The Zen of the Bronze Layer: Embracing Schema Chaos

Aaron Wiegel — Fri, 06 Feb 2026 04:51:38 +0000

In Part 1, we introduced the Medallion Architecture with clean, well-behaved vendor data. In Part 2, we watched the bronze layer transform from "just land the data" into an eight-step ingestion pipeline with vendor-specific logic, fuzzy matching heuristics, header detection, and character sanitization. The final form barely resembles the simple bronze layer from Part 1. We ended by asking uncomfortable questions about whether we're still preserving "raw" data, and what happens when Vendor C arrives.

This post answers the question we left hanging: What if we stop treating column names as schema and start treating them as data?

The Cognitive Shift

Traditional thinking treats CSV column names as schema constraints. You design a bronze table with specific columns (ph, copper_ppm, zinc_ppm), and vendor data either fits that schema or requires transformation to match it. When Vendor B calls their pH column "acidity" instead, you write mapping logic. When schemas change between analysis packages, you build superset schemas that accommodate all possible columns. When typos appear, you add fuzzy matching.

Each vendor variation becomes a code problem requiring a code solution. By the time you're writing handle_vendor_c_special_case_for_march_exports(), you start questioning your life choices.

Consider what happens as this approach scales. With two vendors and two analysis packages each, you manage four schema mappings. Add a third vendor with three packages, and you're at nine. The combination space grows faster than the vendor count. Combinatorial explosions are not, despite what you might think from our enthusiasm for complexity, actually a data engineer's favorite kind of surprise. Testing requires sample files for every vendor/package/quirk combination; the test matrix explodes exponentially.

The fundamental issue is treating column names as structural constraints when they're actually metadata about measurements.

What "Raw" Actually Means

When vendors send CSV files, they're dumping data from lab systems or Excel into whatever format was easiest to export. The wide CSV format (one column per measurement) is convenient for humans viewing spreadsheets. It's also convenient for data engineers who harbor the sweet, naive hope that vendor files arrive in thoughtful formats. We've all been there. But wide format creates a problem: semantics are encoded in structure.

Consider what this means in practice. Column positions and names carry meaning; you need to know that the third column represents copper measurements before you can interpret the value 10.2. This encoding is precisely what created the brittleness we fought in Part 2.

Here's a deceptively simple question: What is the "raw" form of vendor data? Is it the wide CSV they sent, or is it the atomic facts before they got serialized into columns?

Consider this sample from Vendor A:

sample_barcode,ph,copper_ppm,zinc_ppm
ABC123,6.5,10.2,15.3
DEF456,7.2,8.7,12.1

The "raw" facts are: Sample ABC123 has a pH of 6.5, copper of 10.2, and zinc of 15.3. The wide format is a presentation choice, not the essential structure. Column names are data labels that happen to be stored as structural metadata.

What if we converted this into a different format?

sample_barcode,attribute,value,row_number
ABC123,ph,6.5,1
ABC123,copper_ppm,10.2,1
ABC123,zinc_ppm,15.3,1
DEF456,ph,7.2,2
DEF456,copper_ppm,8.7,2
DEF456,zinc_ppm,12.1,2

Now column names are data in the attribute field. The structure describes position and value; semantics come from the attribute values themselves.

This transformation is called unpivoting (or melting). It converts wide format into long format by treating each cell as an individual record with explicit position tracking.

The Power of Vendor-Agnostic Structure

This pattern isn't novel; it's a variant of the Entity-Attribute-Value (EAV) model that's been used in database design for decades, particularly in healthcare and scientific domains where schemas are highly variable. In simpler terms, we're storing key-value pairs with position metadata. The statistical community calls this "long format" or "tidy data". The concept is well-established. Data engineers have been independently "discovering" EAV for decades, and each time we're absolutely CERTAIN our use case is special. It usually isn't, but the confidence is admirable. What's perhaps less common is applying it specifically to bronze layer ingestion as a solution to vendor schema chaos.

Once data is in long format, the bronze table schema becomes fixed and vendor-agnostic:

row_index: Position in the original file
column_index: Column position in the original file
lab_provided_attribute: The exact column name the vendor used
lab_provided_value: The measurement value
vendor_id: Which vendor sent this data
file_name: Source file name
ingestion_timestamp: When we received it

Every vendor file, regardless of its schema, gets transformed into this same structure. Vendor A sends pH as "ph"? That goes in lab_provided_attribute. Vendor B sends it as "acidity"? That also goes in lab_provided_attribute. Typo it as "recieved_date"? Preserved exactly as received in lab_provided_attribute.

The schema chaos doesn't disappear; we've just stopped fighting it. Turns out the solution was acceptance all along. Very zen. Very therapy. The bronze layer no longer needs to know what "ph" or "acidity" mean. It just preserves attribute/value pairs with position metadata.

This has profound implications:

No vendor-specific logic: The same unpivot transformation works for every vendor. No if/elif branches based on vendor_name.

No superset schemas: The bronze table doesn't grow new columns when vendors add measurements. Additional measurements just create more rows.

No fuzzy matching: Typos are preserved as-is in lab_provided_attribute. We're not making quality decisions about which variations are "close enough."

No header detection complexity: While we still need to find the header row, once found, every column gets the same treatment. There's no domain knowledge about which columns are measurements versus metadata.

The unpivot pattern trades structural rigidity for structural consistency. Instead of a brittle schema that breaks with variations, we have a flexible schema that accepts any variation and pushes semantic interpretation to the silver layer.

Implementation: The CSV Table Parser

Having established why column names should be data, let's examine how to actually implement this transformation. The unpivot process has three steps: find the header row, clean up the structure, and transform wide to long.

The CSVTableParser uses Python's standard csv module rather than pandas or Spark for initial parsing. This matters: pandas and Spark make assumptions about data types and structure that interfere with preserving data exactly as received. The csv module gives us raw strings without interpretation.

Finding the Header Row

CSV files from vendors often have metadata rows above the actual data header. Vendor A might include "Lab Report Generated: 2024-01-15" in row 1, with the real header in row 3. Rather than hardcoding vendor-specific knowledge about metadata row patterns (which inevitably leads to a function called detect_vendor_a_weird_header_thing()), the parser uses a simple heuristic: the header row is the first row with a sufficient number of non-null columns.

def remove_header(self, records: list[list[Any]], min_found: int = 10) -> list[list[Any]]:
    """
    Find the first row with enough non-null values to be considered the header
    """
    header_index = None

    for i, row in enumerate(records):
        non_nulls = sum(item is not None for item in row)
        if non_nulls >= min_found:
            header_index = i
            break

    if header_index is None:
        raise ValueError("Could not find header row.")

    return records[header_index:]

The threshold (min_found) is configurable, not embedded in code. If Vendor A typically has 15 columns and Vendor B has 8, you can initialize the parser with different thresholds: CSVTableParser({"header_detection_threshold": 15}) for Vendor A and CSVTableParser({"header_detection_threshold": 7}) for Vendor B. This is configuration data, not branching logic. The algorithm remains the same; only the parameter changes.

Configurable thresholds let you adapt to vendor differences without encoding vendor knowledge into the codebase. You still need to know the magic number; we've just given it a better address. The complexity didn't vanish, it got relocated.

Cleaning Column Structure

Vendors sometimes export CSVs with empty columns (extra commas creating phantom columns) or duplicate column names. The parser drops empty columns and deduplicates names:

def clean_columns(self, records: list[list[Any]]) -> list[list[Any]]:
    """Drop empty columns and deduplicate column names"""
    # Drop columns where header is None
    cols_to_drop = [
        index for index, column in enumerate(records[0]) if column is None
    ]
    records = [
        [item for index, item in enumerate(row) if index not in cols_to_drop]
        for row in records
    ]

    # Deduplicate column names by appending _1, _2, etc.
    records[0] = self._dedupe_columns(records[0])

    return records

If a vendor's export includes duplicate column names (perhaps two "notes" columns), they become "notes" and "notes_1". The structure is preserved, not rejected.

The Unpivot Transformation

This is where wide format becomes long format. Each cell in the original table becomes a row in the output:

def unpivot(self, records: list[list[Any]]) -> list[dict[str, Any]]:
    """Transform wide format to long format with position tracking"""
    results = []
    attributes = records[0]
    for row_index, row in enumerate(records[1:], start=1):
        for column_index, (attribute, value) in enumerate(
            zip(attributes, row), start=1
        ):
            results.append({
                "row_index": row_index,
                "column_index": column_index,
                "lab_provided_attribute": attribute,
                "lab_provided_value": value,
            })

    return results

The loop structure is straightforward: for each row (starting from row 1 after the header), for each column, create a record with the position (row_index, column_index), the attribute name, and the value. A 50-row CSV with 20 columns becomes 1,000 records (50 × 20).

Position tracking matters. row_index and column_index preserve the original structure. If an issue appears with a measurement, you can trace it back to the exact cell in the source file (row 42, column 7). This is critical for debugging and audit trails.

Bronze Ingestion: One Loop for All Vendors

With the CSVTableParser handling the transformation, bronze ingestion becomes remarkably simple. Here's the actual code from our demo notebook:

# Initialize parser with configuration
parser = CSVTableParser({"header_detection_threshold": 5})

# Track ingestion timestamp
ingestion_timestamp = datetime.now()

# Collect all parsed records
all_records = []

# Process each vendor file
for file_name in vendor_files:
    # Get file path
    csv_file_path = get_csv_file_path(f"{VOLUME_PATH}/{file_name}")

    # Parse and unpivot
    records = parser.parse(csv_file_path)

    # Add metadata to each record
    vendor_id = extract_vendor_id(file_name)
    for record in records:
        record["vendor_id"] = vendor_id
        record["file_name"] = file_name
        record["ingestion_timestamp"] = ingestion_timestamp

    all_records.extend(records)

# Create bronze table from all records
spark_df_bronze = spark.createDataFrame(all_records, schema=bronze_schema_def)
spark_df_bronze.write.format("delta").mode("overwrite").saveAsTable(bronze_table_name)

This same loop processes 11 different vendor files:

Vendor A: basic_clean, full_clean, messy_typos, messy_casing, messy_whitespace, excel_nightmare
Vendor B: standard_clean, full_clean, messy_combo, excel_disaster, db_nightmare

No vendor-specific branches. No analysis package logic. No special handling for typos or special characters. The parser treats every file identically; the bronze layer preserves everything as data.

Compare this to the eight-step bronze function from Part 2 with its vendor-specific mappings, superset schema alignment, fuzzy matching, and character sanitization. The unpivot approach collapses all that complexity into a single generic transformation. What took eight carefully orchestrated steps now takes one aggressively indifferent loop.

Silver Layer: Standardization Through Data

Bronze preserves chaos; silver brings order. The key insight is that standardization happens through data (mapping tables), not code (if/elif logic).

The unpivoted bronze table contains lab_provided_attribute values like "ph", "acidity", "copper_ppm", "cu_total", "sample_barcode", "sample_barcod" (typo), "lab_id". These need standardization regardless of whether they're measurements or metadata columns. The silver layer resolves this through a unified mapping approach.

Vendor Column Mapping Table

This mapping table connects vendor-specific column names to canonical column identifiers:

vendor_id	vendor_column_name	canonical_column_id	notes
vendor_a	ph	col_ph	Direct pH measurement
vendor_a	copper_ppm	col_copper	Copper in ppm
vendor_b	acidity	col_ph	Vendor B calls pH 'acidity'
vendor_b	cu_total	col_copper	Chemical symbol notation
vendor_a	sample_barcode	col_sample_id	Sample identifier
vendor_b	sample_barcod	col_sample_id	Typo preserved in bronze

Notice that "ph" and "acidity" both map to col_ph. Similarly, "sample_barcode" and "sample_barcod" (typo) both map to col_sample_id. The mapping handles both measurements and metadata columns uniformly.

Canonical Column Definitions

The silver staging area maintains simple canonical definitions for all columns:

canonical_column_id	canonical_column_name	column_category	data_type	description
col_ph	ph	measurement	numeric	Soil pH level
col_copper	copper_ppm	measurement	numeric	Copper concentration
col_sample_id	sample_barcode	sample_identifier	string	Unique sample ID
col_lab_id	lab_id	lab_metadata	string	Laboratory identifier
col_date_received	date_received	date	date	Sample receipt date

This isn't a full dimensional model yet; it's standard staging-area column normalization. The gold layer builds actual star schema dimensions (analyte dimensions with units and valid ranges, sample dimensions with tracking metadata, etc.). Silver simply establishes canonical naming and basic categorization.

The Silver Transformation

Joining bronze with these mapping tables produces standardized column names. Here's the SQL pattern:

CREATE TABLE silver.lab_samples_standardized AS
SELECT
    -- Original bronze columns for lineage
    bronze.row_index,
    bronze.column_index,
    bronze.lab_provided_attribute,
    bronze.lab_provided_value,
    bronze.vendor_id,
    bronze.file_name,
    bronze.ingestion_timestamp,
    -- Standardized column information
    canonical.canonical_column_id,
    canonical.canonical_column_name,
    canonical.column_category,
    canonical.data_type
FROM bronze.lab_samples_unpivoted AS bronze
LEFT JOIN bronze.vendor_column_mapping AS mapping
    ON bronze.lab_provided_attribute = mapping.vendor_column_name
    AND bronze.vendor_id = mapping.vendor_id
LEFT JOIN silver.canonical_column_definitions AS canonical
    ON mapping.canonical_column_id = canonical.canonical_column_id;

The left join handles unmapped columns gracefully; anything not in the mapping table gets NULL for canonical information. You can filter for canonical_column_id IS NOT NULL to get recognized columns, or keep everything for complete lineage.

This approach treats all columns uniformly. Whether it's "ph" vs "acidity" (measurement) or "sample_barcode" vs "sample_barcod" (identifier), the pattern is the same: map vendor naming to canonical naming through configuration data.

After this transformation, you have:

Original context preserved: The exact column name vendor used (lab_provided_attribute), the vendor_id, file_name, and ingestion_timestamp
Standardized semantics added: Canonical analyte_name, unit, and validation metadata
Cross-vendor comparability: pH measurements from both vendors now share the same analyte_name and analyte_id

What We've Achieved

The silver layer demonstrates three architectural principles:

Configuration over code: Vendor differences are expressed as rows in mapping tables, not if/elif branches in functions. Adding a vendor means inserting rows; changing how a vendor names their columns means updating rows. Database operations, not deployments.

Separation of concerns: Bronze handles structure preservation (unpivoting). Silver handles semantic interpretation (mapping). Each layer has a single, clear responsibility.

Data-driven evolution: The mapping tables are versioned data that can be managed by data stewards, not just engineers. Domain experts can maintain vendor-to-column mappings without understanding the ingestion code.

Vendor-specific knowledge still exists; we haven't eliminated the need to understand that "acidity" means pH or "sample_barcod" means sample_barcode. But we've moved that knowledge from code (brittle, requires deployments) to data (flexible, requires inserts/updates). The complexity didn't vanish; it got a new address and better management.

A note on dimensional modeling: This silver layer approach establishes canonical naming but doesn't build full star schema dimensions. That's the gold layer's job. Gold takes canonical columns and builds proper dimension tables (analyte dimensions with units and valid ranges, sample dimensions with tracking metadata, date dimensions, etc.). Silver is a staging area for standardization; gold is where dimensional modeling happens.

Rethinking "Raw" Data

The unpivot pattern raises deeper questions about data engineering philosophy. Part 2 ended by questioning whether the complex eight-step bronze layer was still preserving "raw" data. This pattern forces a more careful definition of what "raw" actually means.

When vendors export CSVs, they're likely just dumping data from Excel or their lab information systems without much thought. Calling anything that survived Excel's date-parsing tendencies "raw" requires some philosophical flexibility. Excel thinks everything is a date. Not just the data type; the kind where it buys your sample IDs dinner and doesn't call the next day. Very unprofessional. We've had conversations. But here we are. The wide format (one column per measurement) is convenient for humans viewing spreadsheets, but it encodes domain knowledge into structure. The wide format (one column per measurement) is convenient for humans viewing spreadsheets, but it encodes domain knowledge into structure. To understand that the third column represents copper measurements, you need to read the header; the column position itself carries no semantic meaning.

The unpivot transformation exposes the atomic facts hiding in this structure: this sample, this attribute, this value, at this position. Column names stop being structural constraints and become data values we can query, filter, and join against. Whether a vendor calls it "ph" or "pH_lvl", it's just a string value in lab_provided_attribute.

In this sense, unpivoted bronze is closer to "raw" than wide bronze. The messiness doesn't disappear; it becomes explicit. Typos appear as queryable data in lab_provided_attribute rather than as structural variations that break schema assumptions.

The Paradox of Control

The unpivot pattern presents a paradox: by giving up control (accepting any schema), we gain control (one ingestion pattern).

Part 2's approach tried to control vendor chaos through transformation logic: detect headers, fix typos, sanitize characters, map column names, align to superset schemas. Each transformation attempted to force vendor data into expected structure. The system became increasingly fragile, as systems do when you try to control chaos through sheer force of will. The harder we fought for control, the more brittle the system became. Around transformation seven, I started empathizing with King Canute trying to command the tide.

In contrast, the unpivot approach accepts vendor chaos by treating it as data to preserve rather than problems to solve. Bronze doesn't validate column names or fix typos; those are silver layer concerns solved through mapping tables. When vendor schemas change, bronze doesn't break. It just creates different lab_provided_attribute values, and the mapping tables handle semantic evolution without code changes.

When Schemas Are Data, Not Code

Traditional databases treat schemas as structural constraints: the CREATE TABLE statement defines columns, and INSERT statements must conform. The unpivot pattern inverts this. The bronze schema (row_index, column_index, lab_provided_attribute, lab_provided_value) is fixed, but what constitutes valid data is open-ended. Any attribute name is acceptable.

This inversion has significant implications:

Adding vendors becomes a data operation. Add rows to vendor_analyte_mapping; the new vendor's data flows through unchanged code paths.

Schema changes become data operations. Vendor renames "ph" to "pH_value"? Update vendor_analyte_mapping. No deployment required.

Profiling schema drift becomes possible. Create snapshots that show how the schema is changing over time. Bonus: you can measure just how often vendors send you zany Excel files and compare who sends higher quality data.

Testing becomes systematic. Test the generic unpivot transformation once; vendor differences are data fixtures, not code paths.

The cognitive shift is recognizing that vendor quirks are metadata about their export process, not structural properties of the data itself.

Reclaiming the Bronze Layer

Part 2 asked: "Is this still a bronze layer?" after watching transformation logic accumulate. The unpivot pattern reclaims bronze simplicity by giving it one job: parse CSV structure and preserve information as position/value pairs. No quality decisions about typos. No business logic about semantics. Just structural transformation from wide to long format. Consequently, we can defer those concerns to the silver layer where they rightfully belong.

This IS minimal transformation. Values aren't modified; only their organization changes. The transformation is generic (same code for all vendors) and reversible (pivot back using row_index and column_index).

Silver handles semantic complexity through mapping tables. Gold does analysis and aggregation. Each layer has clear boundaries and single responsibilities.

When NOT to Use This Pattern

The unpivot pattern solves specific problems; it's not a universal solution. Recognizing when it adds unnecessary complexity is as important as knowing when it provides value.

Important Caveat: This Is an Intermediate Form

The unpivot pattern creates long-format data as an intermediate representation. You don't stop here; silver and gold layers transform this back into analyst-friendly structures (wide tables, aggregations, dimensional models). If you're not building a multi-stage transformation pipeline where bronze feeds silver feeds gold (or marts in other architectural patterns), unpivoting adds unnecessary complexity without delivering its benefits. The pattern makes sense when you have layer separation and further transformations; it's the wrong tool if you need a simple data landing zone for direct consumption.

Don't Use Unpivot When Schemas Are Stable

If you work with one or two vendors who provide stable schemas that rarely change, the unpivot pattern may be unnecessary overhead. Some people have stable vendor relationships. I hear they exist. When Vendor A's contract specifies that copper_ppm won't change without notice and you haven't seen schema drift in two years, simpler approaches suffice.

Similarly, at low volume (say, 500 samples per month from two stable vendors), the unpivot infrastructure might cost more to build and maintain than occasional vendor-specific adjustments. The pattern's benefits scale with schema chaos; if you don't have chaos, you don't need the solution.

Do Use Unpivot When Schema Chaos Is Real

Conversely, use the pattern when:

You have multiple vendors with divergent naming conventions. Three vendors calling pH by three different names; vendor-specific if/elif logic is already feeling brittle and hard to maintain.

Schemas change frequently within vendors. Same vendor sends different column sets based on analysis package ordered, or evolves their format quarterly without coordination.

You're building for extensibility. You expect vendor count to grow, or you're building a platform where schema flexibility is a product feature, not just a maintenance challenge. (e.g., scientific data)

You need complete provenance. Regulatory requirements demand preserving exact column names as received, with full traceability to source files and cells. The unpivot pattern with position tracking provides audit-grade lineage.

Standardization requires domain expertise. Mapping between vendor terminologies involves domain knowledge that should be managed by data stewards as reference data, not hardcoded by engineers in transformation logic.

The Decision Framework

Ask yourself:

How many vendors do you have (current and expected in 2 years)?
How often do schemas change, and how coordinated are those changes?
Are you building a multi-stage transformation pipeline with proper layer separation?
What are your provenance and audit requirements?
What's the actual cost of schema-related maintenance today?

If the answers suggest high vendor count, frequent uncoordinated schema changes, multi-stage architecture, strong audit needs, and meaningful current maintenance burden, the unpivot pattern likely pays dividends. If answers point toward stability, predictability, and simple requirements, simpler approaches might suffice.

There's no universal answer; architectural decisions require weighing specific context and constraints.

The Zen of It

In Part 1, we traced how medallion architecture evolved from Kimball's dimensional modeling framework—not replacement, but simplification. Part 2 ended with vendor-specific logic, superset schemas, fuzzy matching, and character sanitization accumulating until I questioned what 'bronze' even meant. The solution wasn't more sophisticated logic; it was remembering Kimball's staging area principle: preserve source structure before imposing semantics. The unpivoted bronze IS that source structure before the staging area, with vendor chaos encoded as data rather than fought through transformations.

By treating column names as data instead of schema, we eliminated brittleness without eliminating complexity. Vendor chaos still exists, but it's no longer a code problem. Column name variations become rows in mapping tables. Schema evolution becomes data updates, not deployments. The complexity moves from scattered if/elif logic into structured dimension tables managed by people who understand vendor semantics.

This reveals a broader principle: sometimes the elegant solution isn't solving the problem, it's reframing what the problem actually is. Schema chaos looked like a structural problem requiring sophisticated transformation logic. Reframed as a metadata problem, it becomes manageable through configuration.

The paradoxes stack up: by giving up control (accepting any schema), we gain control (one ingestion path). By preserving more of what vendors send (typos included), we achieve better standardization (explicit mapping, not implicit assumptions). By doing less transformation in bronze, we enable cleaner layer separation.

Data engineering is about finding the right abstraction level. Too concrete and you drown in special cases. Too abstract and you can't solve actual problems. The unpivot pattern finds the middle ground: generic enough to handle any vendor's wide CSV, specific enough to preserve structure and position.

The code is simpler. The testing is systematic. The evolution path is clear. That's finding zen in apparent chaos.

Complete working example: The demo notebook processes 11 vendor files (clean, messy, typos, Excel nightmares) using the patterns described in this post.

When Bronze Goes Rogue: Schema Chaos in the Wild

Aaron Wiegel — Wed, 28 Jan 2026 05:04:14 +0000

In Part 1, we explored the Medallion Architecture with clean, well-behaved vendor data. The bronze layer simply landed the raw CSV files. The silver layer standardized measurement names. The gold layer aggregated for analysis. Everything worked beautifully.

Then reality arrived.

This post demonstrates what happens when vendor CSV files exhibit the full spectrum of real-world data quality issues. We'll watch the bronze layer transform from "just land the data" into an increasingly complex series of transformations, vendor-specific logic, and fragile workarounds. By the end, we'll be asking uncomfortable questions about what "bronze" actually means.

The code examples are adapted from a conference talk I gave at PyBay 2025, where I walked through this exact problem progression live.

Problem 1: Different Column Names for the Same Measurements

Vendor A and Vendor B measure the same soil properties. Both provide pH, copper concentration, and zinc concentration. Their CSV files look like this:

Vendor A schema:

sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm

Vendor B schema:

sample_barcode,lab_id,date_received,date_processed,acidity,cu_total,zn_total

Same measurements. Different names. Vendor B calls pH "acidity." They use chemical symbols with _total suffixes instead of element names with _ppm suffixes.

This is not a data quality problem. This is a legitimate difference in how two professional laboratories name their measurements. (Although pedantically you might wonder about a chemistry lab that thinks pH and acidity are the same thing.) Both schemas are internally consistent and well-documented. The challenge is ours: we need both vendors' data in the same bronze table.

Bronze Layer: Approach 1 (Add Vendor-Specific Column Mapping)

We create a standardization function for each vendor:

def standardize_vendor_a_columns(df):
    """Vendor A column names are already standard"""
    return df

def standardize_vendor_b_columns(df):
    """Map Vendor B columns to standard names"""
    return df.select(
        col("sample_barcode"),
        col("lab_id"),
        col("date_received"),
        col("date_processed"),
        col("acidity").alias("ph"),           # Different name
        col("cu_total").alias("copper_ppm"),  # Different name
        col("zn_total").alias("zinc_ppm"),    # Different name
    )

The bronze ingestion now includes a vendor detection step:

def load_vendor_to_bronze(file_path, vendor_name):
    """Bronze layer with vendor-specific column mapping"""
    # Read CSV
    df = spark.read.option("header", "true").csv(file_path)

    # Apply vendor-specific standardization
    if vendor_name == "vendor_a":
        df = standardize_vendor_a_columns(df)
    elif vendor_name == "vendor_b":
        df = standardize_vendor_b_columns(df)

    # Write to bronze
    df.write.mode("append").saveAsTable("bronze.lab_samples")

    return df

This works. We can now query both vendors' data using consistent column names. The bronze layer contains standardized schemas.

But we just added vendor-specific business logic to what was supposed to be a raw data landing zone.

Problem 2: Schema Instability Within the Same Vendor

The vendor-specific mapping holds up until Vendor A sends a new file. Our ingestion pipeline fails with a schema mismatch error. Examining the file reveals that Vendor A now includes additional analytes:

Vendor A - Basic package (what we had):

sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm

Vendor A - Metals package (what we just received):

sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm,lead_ppm,iron_ppm,manganese_ppm

The schema changes based on which analysis package the customer ordered. Sometimes they order basic soil testing. Sometimes they add heavy metals analysis. The vendor includes only the columns relevant to what was tested.

This is also not a data quality problem. Including only requested measurements is reasonable and reduces file size. But it breaks our bronze table schema.

Bronze Layer: Approach 2 (Create Superset Schema)

The solution requires accommodating all possible variations. We create a superset schema containing all possible columns from all analysis packages. When ingesting files with fewer columns, we add NULL values for missing measurements:

def align_to_superset_schema(df, vendor_name, analysis_package):
    """Add missing columns as NULL to match superset schema"""
    # Define superset of all possible columns for this vendor
    if vendor_name == "vendor_a":
        all_columns = [
            "sample_barcode", "lab_id", "date_received", "date_processed",
            "ph", "copper_ppm", "zinc_ppm",           # basic package
            "lead_ppm", "iron_ppm", "manganese_ppm",  # metals package
            # ... more columns as we discover them
        ]

    # Add missing columns as NULL
    for col in all_columns:
        if col not in df.columns:
            df = df.withColumn(col, lit(None))

    # Reorder columns to match superset
    df = df.select(all_columns)

    return df

Now our bronze ingestion tracks package types:

def load_vendor_to_bronze(file_path, vendor_name, analysis_package):
    """Bronze layer with superset schema alignment"""
    df = spark.read.option("header", "true").csv(file_path)

    # Apply vendor-specific column mapping
    if vendor_name == "vendor_a":
        df = standardize_vendor_a_columns(df)
    elif vendor_name == "vendor_b":
        df = standardize_vendor_b_columns(df)

    # Align to superset schema
    df = align_to_superset_schema(df, vendor_name, analysis_package)

    # Write to bronze
    df.write.mode("append").option("mergeSchema", "true").saveAsTable("bronze.lab_samples")

    return df

This works. But now:

Our bronze table is sparse (most columns are NULL for most rows)
We must maintain a master list of all possible columns for each vendor
Adding new analytes requires code changes
We can't distinguish between "wasn't measured" and "measurement failed"

The bronze layer is accumulating knowledge about vendor schemas and business rules.

Problem 3: Typos in Column Headers

Our superset schema handles varying column sets, but the next issue reveals a different category of problem. A file from Vendor A fails to parse correctly. Examining the raw CSV, we find:

sample_barcod,lab_id,date_recieved,date_proccessed,ph,copper_ppm,zinc_ppm

Three typos: sample_barcod (missing 'e'), date_recieved (i before e), date_proccessed (double c). The vendor's export system mangles column names. Occasionally.

These files are otherwise valid. The data values are correct. Only the header row has issues. Rejecting these files would delay processing by days while we contact the vendor.

Bronze Layer: Approach 3 (Add Fuzzy Column Matching)

Rejecting files creates unacceptable delays, so we implement fuzzy matching to handle common typos:

def fix_column_typos(df):
    """Fix common typos in column names"""
    column_mapping = {}

    for col in df.columns:
        col_lower = col.lower()

        # Check for common typos
        if "recieved" in col_lower:
            new_col = col.replace("recieved", "received").replace("reciev", "received")
        elif "proccessed" in col_lower:
            new_col = col.replace("proccessed", "processed").replace("proccess", "process")
        elif "barcod" in col_lower and "barcode" not in col_lower:
            new_col = col.replace("barcod", "barcode")
        else:
            new_col = col

        column_mapping[col] = new_col

    # Rename columns
    for old_name, new_name in column_mapping.items():
        df = df.withColumnRenamed(old_name, new_name)

    return df

Our bronze ingestion grows:

def load_vendor_to_bronze(file_path, vendor_name, analysis_package):
    """Bronze layer with fuzzy column matching"""
    df = spark.read.option("header", "true").csv(file_path)

    # Fix typos in column names
    df = fix_column_typos(df)

    # Apply vendor-specific column mapping
    if vendor_name == "vendor_a":
        df = standardize_vendor_a_columns(df)
    elif vendor_name == "vendor_b":
        df = standardize_vendor_b_columns(df)

    # Align to superset schema
    df = align_to_superset_schema(df, vendor_name, analysis_package)

    # Write to bronze
    df.write.mode("append").option("mergeSchema", "true").saveAsTable("bronze.lab_samples")

    return df

This works. But we're now making quality decisions about what constitutes an acceptable typo. We're interpreting intent. The bronze layer is no longer just landing raw data.

Problem 4: Excel Nightmares

Vendor B sends a file that completely breaks our parser. Opening it in a text editor reveals the structure:

Contact:,lab@testing.com,"","","","","","","","","","","","","","","","",""
Generated:,2024-10-15,"","","","","","","","","","","","","","","","",""
Lab Name:,Premium Soil Testing,"","","","","","","","","","","","","","","","",""
Sampl_Barcode,lab_id,DATE_RECEIVED,Date_Proccessed,acidity,cu_totl,ZN_TOTL,pb_total,fe_total,Mn_Totl,b_total,mo_totl,ec_ms_cm,Organic_Carbon_Pct,"","","","",""
PYB2475-266277,AT6480 68463,2024-05-12,2024-05-19,6.46,6.63,29.5,4.22,103.,3.56,0.759,0.186,1.44,0.30,"","","","",""

Three metadata rows precede the actual header. Additionally, the file has empty column name padding (those trailing empty strings). The file exhibits the telltale signs of an Excel export where someone navigated beyond the data range and accidentally pressed enter before saving.

The actual data is fine. The measurements are valid. We just need to skip the metadata rows and ignore the empty columns.

Bronze Layer: Approach 4 (Add Header Detection and Column Filtering)

We implement header detection to skip metadata:

def detect_header_row(df):
    """Find the row that looks like a header (most non-null values)"""
    rows = df.limit(10).collect()

    header_idx = None
    max_non_null = 0

    for i, row in enumerate(rows):
        non_null_count = sum(1 for val in row if val is not None and val != "")
        if non_null_count > max_non_null:
            max_non_null = non_null_count
            header_idx = i

    return header_idx

We filter out empty columns:

def remove_empty_columns(df):
    """Remove columns with empty names or all empty values"""
    cols_to_keep = []

    for col in df.columns:
        if col and col.strip():  # Has a name
            # Check if column has any non-empty values
            non_empty_count = df.filter(
                (col(col).isNotNull()) & (col(col) != "")
            ).count()

            if non_empty_count > 0:
                cols_to_keep.append(col)

    return df.select(cols_to_keep)

And implement re-reading from the correct header position:

def reread_with_header_at_index(file_path, header_idx):
    """Re-read CSV file with header at a specific row index"""
    # Read the entire file as text, skip to header row
    df = (spark.read
          .option("header", "false")
          .csv(file_path))

    # Extract the header row
    header_row = df.collect()[header_idx]
    new_columns = [val for val in header_row if val]

    # Skip rows before header and use header_idx row as column names
    df = (spark.read
          .option("header", "false")
          .csv(file_path))

    # Filter to rows after header
    df = df.filter(monotonically_increasing_id() > header_idx)

    # Rename columns using detected header
    for i, col_name in enumerate(new_columns):
        df = df.withColumnRenamed(df.columns[i], col_name)

    return df

The bronze ingestion continues to grow:

def load_vendor_to_bronze(file_path, vendor_name, analysis_package):
    """Bronze layer with header detection and column filtering"""
    # Read CSV without assuming first row is header
    df_peek = spark.read.csv(file_path)

    # Detect where the real header is
    header_idx = detect_header_row(df_peek)

    # Re-read with correct header
    df = reread_with_header_at_index(file_path, header_idx)

    # Remove empty columns
    df = remove_empty_columns(df)

    # Fix typos in column names
    df = fix_column_typos(df)

    # Apply vendor-specific column mapping
    if vendor_name == "vendor_a":
        df = standardize_vendor_a_columns(df)
    elif vendor_name == "vendor_b":
        df = standardize_vendor_b_columns(df)

    # Align to superset schema
    df = align_to_superset_schema(df, vendor_name, analysis_package)

    # Write to bronze
    df.write.mode("append").option("mergeSchema", "true").saveAsTable("bronze.lab_samples")

    return df

The bronze layer now includes heuristics for detecting valid data. We're making educated guesses about file structure.

Problem 5: Database-Hostile Column Names

Vendor B's files sometimes include special characters in column names:

#sample_id,lab_id,organic_matter%,cu-total,zn-total

The # prefix, % suffix, and hyphens require backtick escaping in SQL queries:

SELECT `#sample_id`, `organic_matter%`, `cu-total`
FROM bronze.lab_samples
WHERE vendor = 'vendor_b'

Every analyst who touches this data must remember the escaping rules. Queries become brittle and harder to read.

Bronze Layer: Approach 5 (Add Character Sanitization)

We sanitize column names to be database-friendly:

def sanitize_column_names(df):
    """Remove invalid database characters from column names"""
    column_mapping = {}

    for col in df.columns:
        # Remove #, convert % to _pct, replace - with _
        new_col = col.replace("#", "").replace("%", "_pct").replace("-", "_")
        column_mapping[col] = new_col

    # Rename columns
    for old_name, new_name in column_mapping.items():
        df = df.withColumnRenamed(old_name, new_name)

    return df

The complete bronze ingestion:

def load_vendor_to_bronze(file_path, vendor_name, analysis_package):
    """Bronze layer: The final form"""
    # Read CSV to detect header
    df_peek = spark.read.csv(file_path)

    # Detect and skip metadata rows
    header_idx = detect_header_row(df_peek)

    # Re-read with correct header position
    df = reread_with_header_at_index(file_path, header_idx)

    # Remove empty columns
    df = remove_empty_columns(df)

    # Fix typos in column names
    df = fix_column_typos(df)

    # Sanitize database characters
    df = sanitize_column_names(df)

    # Apply vendor-specific column mapping
    if vendor_name == "vendor_a":
        df = standardize_vendor_a_columns(df)
    elif vendor_name == "vendor_b":
        df = standardize_vendor_b_columns(df)

    # Align to superset schema
    df = align_to_superset_schema(df, vendor_name, analysis_package)

    # Write to bronze
    df.write.mode("append").option("mergeSchema", "true").saveAsTable("bronze.lab_samples")

    return df

Eight transformation steps. Vendor-specific logic branches. Fuzzy matching heuristics. Schema knowledge. Quality decisions.

This was supposed to be "just land the raw data."

The Uncomfortable Questions

We started with a simple bronze layer that read CSV files and wrote them to a table. We now have a complex ingestion pipeline that:

Applies business logic: Vendor-specific column mapping encodes knowledge about what measurements mean across different schemas
Makes quality decisions: Fuzzy matching determines which typos are acceptable and how to fix them
Interprets structure: Header detection guesses where real data begins
Modifies data: Character sanitization changes the raw column names we received

Is this still a "bronze layer"? The Medallion Architecture describes bronze as raw data with minimal transformation. We're well beyond minimal.

What happens when Vendor C arrives? We add more column mappings to the function, another branch in the if/elif chain, and hope their quirks don't conflict with the assumptions we've baked into our existing logic. And how do we decide what the "default" name for the column should be?

How do we test this? We need sample files for every vendor, every analysis package, every combination of issues. The test matrix grows exponentially.

We haven't addressed date format differences, unit conversions, vendor-specific codes, or the dozens of other variations we'll encounter as more vendors join the system.

The bronze layer has gotten away from us.

How Would You Manage This Complexity?

Before we explore solutions in the next post, consider how you would handle this problem in your own systems:

Would you continue adding transformation logic to the bronze layer until it handles every edge case?
Would you reject files that don't conform to expected formats and force vendors to fix their exports?
Would you build a configuration system where new vendor quirks can be added without code changes?
Would you accept some data quality issues and handle them downstream in the silver layer?

Each approach has tradeoffs. Adding more transformations makes the bronze layer complex and fragile. Rejecting files delays processing and frustrates vendors and users of the data alike. Configuration systems add their own complexity. Pushing problems downstream just moves the pain to a different layer.

What if the fundamental problem is that we're treating column names as schema when they should be treated as data?

In the next post, we'll explore this alternative. Instead of fighting schema chaos with increasingly complex transformations, we'll embrace it. We'll examine how a single transformation applied to all vendors can replace vendor-specific logic, superset schemas, fuzzy matching, and header detection with something simpler and more robust.

The solution involves questioning what "raw" actually means.

Coming Soon: Part 3 - The Zen of the Bronze Layer: Embracing Schema Chaos

Code: All examples are available in the zen_bronze_data repository. The PyBay presentation notebook contains runnable versions of these transformations.

Medallion Architecture 101: Building Data Pipelines That Don't Fall Apart

Aaron Wiegel — Fri, 23 Jan 2026 09:10:19 +0000

Medallion architecture appears everywhere in modern data engineering. Bronze, Silver, Gold. Raw data, refined data, analytics-ready data. Every Databricks tutorial mentions it. Every lakehouse pitch deck includes the diagram. Every "modern data stack" blog post treats it as gospel.

Here's the part most people skip: the core ideas trace back to Ralph Kimball's data warehouse methodology from the 1990s. Kimball advocated for staging areas that preserved raw data, integration zones for applying business logic, and dimensional models for analytics delivery. His framework included thirty-four subsystems covering everything from data extraction to audit dimension management. The medallion pattern distills these principles into something clearer and more actionable: three layers with distinct responsibilities.

This evolution represents genuine improvement, not just rebranding. Kimball's warehouse architecture was comprehensive but complex, designed for batch ETL in relational databases. Medallion architecture preserves the wisdom about data quality and layer separation while adapting to lakehouse capabilities like schema evolution and streaming ingestion. We kept what worked and simplified what didn't. Bronze gets you on the podium. Silver refines your performance. Gold takes the championship.

The pattern itself is straightforward. Bronze layers ingest raw data with minimal transformation, preserving source formats and tracking metadata about where data originated. Silver layers apply business logic, cleaning and standardizing data into queryable formats. Gold layers deliver analytics-ready datasets, pre-aggregated and optimized for dashboards and reports.

This series starts with the ideal case. Clean vendor files, stable schemas, cooperative data sources. We'll implement proper medallion architecture with metadata tracking and layer discipline. Then reality arrives. Post two explores what happens when vendors send chaos instead of clean CSVs; typos in headers, unstable schemas, Excel nightmares that make you question your career choices. Post three reveals an elegant solution that handles the chaos without drowning in vendor-specific transformation code.

But first, the foundation. Let's build medallion architecture the right way.

THE EVOLUTION

Kimball's Data Warehouse (1996) → Medallion Lakehouse (2020)

34 subsystems          → 3 layers (Bronze, Silver, Gold)
Staging Area          → Bronze (raw ingestion, preserve source)
Integration Zone      → Silver (cleaned, standardized, queryable)  
Dimensional Delivery  → Gold (aggregated, analytics-ready)

Batch ETL             → Streaming + batch
Relational warehouses → Cloud lakehouses

Same core wisdom. Simpler execution.

Understanding the Layers

Understanding medallion architecture requires understanding separation of concerns. Each layer serves one purpose. Bronze preserves raw data exactly as received, adding only metadata for tracking and auditability. Silver applies business rules and standardization, transforming raw inputs into consistent, queryable formats. Gold optimizes for specific analytical use cases, pre-aggregating and structuring data for dashboards, reports, and data science workflows.

The discipline matters more than the metaphor. Resist the temptation to "just quickly clean this in bronze" or "add one small aggregation to silver." Each compromise weakens the architecture. Bronze becomes unpredictable when transformations creep in. Silver becomes cluttered when analytics logic appears. Gold loses focus when it tries to serve every possible use case. Maintain clear boundaries between layers, and the entire pipeline becomes easier to debug, test, and extend.

Enough concepts. Let's implement this with actual code and actual data. We'll work with a clean vendor CSV file with stable schema.

The Use Case

Biotech companies frequently work with contract research organizations to perform specialized laboratory measurements. These external labs analyze samples and return results as CSV or Excel files. Each vendor has their own file format, column naming conventions, and data delivery schedules. Our task is to ingest this vendor data and make it available for analysis while maintaining data quality and traceability. You can follow along with a Databricks notebook on Github here.

Our scenario involves a soil chemistry lab that measures pH levels and metal concentrations. They send results as CSV files with one row per sample. Each file contains sample identifiers, lab batch information, collection and processing dates, and measurement results. For this first post, we're working with the ideal case: clean data, stable schema, consistent formatting.

Here's what the vendor file looks like:

sample_barcode,lab_id,date_received,date_processed,ph,copper_ppm,zinc_ppm
PYB6134-404166,PSL-73 72846,2024-02-02,2024-02-08,6.74,10.7,5.23
PYB8638-328304,PSL-77 74041,2024-10-11,2024-10-17,6.43,6.34,5.64
PYB7141-642256,PSL-82 22558,2024-08-28,2024-09-03,5.58,3.64,39.8

Let's ingest this through our medallion architecture, tracking metadata at each stage and maintaining proper layer separation.

Bronze Layer: Raw Ingestion with Metadata

The bronze layer preserves raw data with minimal transformation. The challenge lies in tracking provenance without corrupting the source data itself. Every record needs an audit trail: which file did this come from, when did it arrive, which vendor sent it. This metadata proves essential when data quality issues appear downstream or when business users question analytical results.

Metadata Tracking Patterns

There are several approaches to tracking ingestion metadata, each with different trade-offs:

Embedded metadata adds provenance columns directly to the bronze table. Every row carries its source file path, ingestion timestamp, and vendor identifier. This approach is simple to implement and query since no joins are needed to trace a record back to its source. The trade-off is that file-level information repeats across every row. In practice, columnar storage formats like Parquet and Delta compress these repetitive string values efficiently, making the storage overhead minimal.

Separate ingestion table maintains a dedicated ingestion_metadata table with a surrogate key (like a batch_id). Bronze rows reference this key, keeping the data table compact. Need to analyze ingestion patterns or identify missing vendor deliveries? Query the lightweight metadata table without scanning data. Need complete lineage for specific records? Join the tables. This pattern scales better for production systems with high ingestion volumes and complex monitoring requirements.

Data vault-style through table uses a link or satellite table to associate ingestion batches with data records. This supports many-to-many relationships: a record appearing across multiple loads, or a single load touching multiple target tables. This is the most flexible pattern for complex lineage scenarios but adds architectural complexity.

For this demonstration, we use embedded metadata. It keeps the focus on medallion layer mechanics without introducing additional tables. Production systems with high data volumes or complex ingestion monitoring should consider the separate table or data vault patterns.

Ingesting with Databricks' `_metadata`

Databricks exposes file-level metadata automatically through a special _metadata struct column. By including it in our select, we get the source file path, file name, and modification time without any manual tracking. We supplement this with business-level metadata: a unique ingestion ID, the data source identifier, a timestamp, and a row position for ordering.

from datetime import datetime
from uuid import uuid4
from pyspark.sql import functions as F

# Read vendor CSV, including Databricks file metadata
df = spark.read.option("header", "true").csv(file_path).select(["*", "_metadata"])

# Add ingestion metadata (the ONLY transformation in bronze)
df_bronze = (
    df
    # Extract from Databricks' _metadata struct
    .withColumn("source_file_path", F.col("_metadata.file_path"))
    .withColumn("source_file_name", F.col("_metadata.file_name"))
    .withColumn("file_modified_at", F.col("_metadata.file_modification_time"))
    # Add business metadata
    .withColumn("ingestion_id", F.lit(str(uuid4())))
    .withColumn("data_source", F.lit("vendor_a"))
    .withColumn("ingested_at", F.current_timestamp())
    .withColumn("file_row_number", F.monotonically_increasing_id())
)

# Write to bronze
df_bronze.write.format("delta").mode("overwrite").saveAsTable(
    f"{catalog}.{bronze_schema}.vendor_a_samples_raw"
)

The bronze table now contains both the raw data and its provenance trail:

sample_barcode | lab_id       | date_received | ... | source_file_path              | source_file_name           | file_modified_at     | ingestion_id                          | data_source | ingested_at          | file_row_number
PYB6134-404166 | PSL-73 72846 | 2024-02-02   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 0
PYB8638-328304 | PSL-77 74041 | 2024-10-11   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 1
PYB7141-642256 | PSL-82 22558 | 2024-08-28   | ... | /Volumes/.../incoming/ven...  | vendor_a_basic_clean.csv   | 2024-12-16 09:00:00 | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | vendor_a    | 2024-12-16 10:23:45 | 2

Notice what we did and didn't do. We loaded the data into a SQL-queryable table and added metadata so any record can be traced back to its source file and ingestion event. We did not convert data types (everything remains strings), rename columns, validate values, or apply business logic. Bronze is about preservation, not transformation. When someone's dashboard breaks downstream, the metadata lets us walk backward through the entire pipeline to find where problems originated.

Silver Layer: Cleaning and Standardization

The silver layer applies business logic and standardization. Here we convert data types, apply semantic renaming, and transform raw vendor formats into consistent structures that downstream users can query reliably. Silver is where we enforce data contracts and catch quality issues before they propagate to analytics.

Our vendor data needs several transformations. The date columns arrive as strings and need conversion to proper date types. The numeric measurements remain strings from CSV parsing and require casting to doubles. We also apply minor semantic renaming: data_source becomes vendor_name to better reflect its business meaning, and we rename the raw _metadata struct to databricks_ingestion_metadata for clarity. The original measurement column names (ph, copper_ppm, zinc_ppm) are already clear and descriptive, so we keep them as-is.

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType, DateType

# Read from bronze
df_bronze = spark.table(f"{catalog}.{bronze_schema}.vendor_a_samples_raw")

# Apply transformations
df_silver = (
    df_bronze
    .withColumnsRenamed(
        {
            "_metadata": "databricks_ingestion_metadata",
            "data_source": "vendor_name",
        }
    )
    # Parse dates
    .withColumn("date_received", F.to_date(F.col("date_received"), "yyyy-MM-dd"))
    .withColumn("date_processed", F.to_date(F.col("date_processed"), "yyyy-MM-dd"))
    # Cast measurement types
    .withColumn("ph", F.col("ph").cast(DoubleType()))
    .withColumn("copper_ppm", F.col("copper_ppm").cast(DoubleType()))
    .withColumn("zinc_ppm", F.col("zinc_ppm").cast(DoubleType()))
    # Add processing timestamp
    .withColumn("silver_processed_at", F.current_timestamp())
)

# Write to silver
df_silver.write.format("delta").mode("overwrite").saveAsTable(
    f"{catalog}.{silver_schema}.vendor_a_samples_cleaned"
)

The silver table now contains properly typed columns with semantic naming:

sample_barcode | lab_id       | date_received | date_processed | ph   | copper_ppm | zinc_ppm | ... | vendor_name | ingestion_id                          | file_row_number | silver_processed_at
PYB6134-404166 | PSL-73 72846 | 2024-02-02   | 2024-02-08     | 6.74 | 10.7       | 5.23     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 0               | 2024-12-16 10:24:15
PYB8638-328304 | PSL-77 74041 | 2024-10-11   | 2024-10-17     | 6.43 | 6.34       | 5.64     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 1               | 2024-12-16 10:24:15
PYB7141-642256 | PSL-82 22558 | 2024-08-28   | 2024-09-03     | 5.58 | 3.64       | 39.8     | ... | vendor_a    | a7f3c891-4e2b-4d1a-9c8f-1b5e6a7c8d9e | 2               | 2024-12-16 10:24:15

Silver serves multiple purposes beyond type conversion. Production implementations often augment data through joins with reference tables, split datasets into separate dimension and fact tables following dimensional modeling patterns, apply complex business calculations, or enrich records with derived attributes. This simple example focuses on basic standardization to establish the pattern. Later posts will explore more sophisticated transformations as our architecture evolves to handle real-world complexity.

Notice we preserved the ingestion identifier and file row number. These lineage columns allow us to trace any silver record back to its bronze source and ultimately to the original vendor file. This becomes essential when data quality issues emerge or when business users question specific values. We can walk backward through the entire pipeline to find where problems originated.

The silver layer also serves as a natural place for data quality checks. We could add validation rules here: pH values should fall between 0 and 14, concentration values should be non-negative, processing dates should occur after received dates. For this simple example we're skipping validation, but production implementations would include these checks.

Gold Layer: Analytics-Ready Aggregations

The gold layer delivers pre-aggregated datasets optimized for specific analytical use cases. Rather than forcing analysts to repeatedly write the same aggregation queries against silver, we materialize common patterns. This improves query performance and ensures consistent metric definitions across dashboards and reports.

For our soil sample data, we'll create a monthly summary table that aggregates measurements by vendor. This serves common analytical questions: how many samples did each vendor deliver per month, what were the average pH and concentration levels, did measurements vary significantly across time periods.

from pyspark.sql import functions as F

# Read from silver
df_silver = spark.table(f"{catalog}.{silver_schema}.vendor_a_samples_cleaned")

# Create monthly summary aggregations
df_gold = (
    df_silver
    .withColumn("month_start", F.date_trunc("month", F.col("date_received")))
    .groupBy("month_start", "vendor_name")
    .agg(
        F.count("sample_barcode").alias("sample_count"),
        F.avg("ph").alias("avg_ph"),
        F.stddev("ph").alias("stddev_ph"),
        F.min("ph").alias("min_ph"),
        F.max("ph").alias("max_ph"),
        F.avg("copper_ppm").alias("avg_copper_ppm"),
        F.avg("zinc_ppm").alias("avg_zinc_ppm")
    )
    .withColumn("gold_processed_at", F.current_timestamp())
)

# Write to gold
df_gold.write.format("delta").mode("overwrite").saveAsTable(
    f"{catalog}.{gold_schema}.monthly_vendor_a_summary"
)

The gold table provides quick access to monthly summary statistics:

month_start | vendor_name | sample_count | avg_ph | stddev_ph | min_ph | max_ph | avg_copper_ppm | avg_zinc_ppm | gold_processed_at
2024-02-01  | vendor_a    | 5            | 6.38   | 0.53      | 5.46   | 7.06   | 5.37           | 8.31         | 2024-12-16 10:25:30
2024-03-01  | vendor_a    | 4            | 6.57   | 0.57      | 5.71   | 7.54   | 3.76           | 13.5         | 2024-12-16 10:25:30
2024-08-01  | vendor_a    | 5            | 6.21   | 0.42      | 5.58   | 6.74   | 5.89           | 18.6         | 2024-12-16 10:25:30

Dashboards query this gold table instead of aggregating silver data repeatedly. Business intelligence tools connect directly to gold for their visualizations. Data scientists pull from gold for initial exploration before diving into silver for detailed analysis. Each layer serves its purpose: bronze preserves raw truth, silver provides clean queryable data, gold optimizes for consumption.

The Foundation Is Set

We've implemented medallion architecture in its ideal form. Bronze preserves raw vendor data with minimal transformation, tracking file lineage through embedded metadata columns drawn from Databricks' _metadata struct and our own business-level provenance fields. Silver applies business logic and standardization, converting string columns to proper types and applying semantic naming. Gold delivers pre-aggregated monthly summaries optimized for analytical consumption. Each layer maintains clear boundaries and serves a distinct purpose.

This discipline pays dividends when requirements change. Need to recalculate silver with different business rules? Bronze remains untouched as your source of truth. Analytics team wants new gold aggregations? Silver provides clean, typed data ready for transformation. Vendor changes their file format? Only bronze ingestion logic needs adjustment while downstream layers continue functioning.

The architecture we built assumes cooperative vendors who send clean files with stable schemas. Our sample data arrived perfectly formatted with consistent column names, valid data types, and no surprises. This scenario establishes the pattern and demonstrates proper layer separation. Kimball's audit dimension tables inspired the separate ingestion table pattern we discussed earlier, and that remains the better choice for production systems with high ingestion volumes or complex monitoring needs. For our demo, embedding metadata directly in bronze keeps the focus on layer mechanics. Kimball's staging areas become our bronze layer. His integration zones become our silver transformations. His dimensional delivery layer maps to gold aggregations. The terminology evolved but the wisdom remained constant.

Reality rarely cooperates this nicely. To see why, let's peek at what Vendor B sends:

# Peek at Vendor B's file
vendor_b_path = f"/Volumes/{catalog}/{bronze_schema}/{incoming_volume}/vendor_b_standard_clean.csv"
df_vendor_b = spark.read.option("header", "true").csv(vendor_b_path)
df_vendor_b.printSchema()

root
 |-- sample_barcode: string
 |-- lab_id: string
 |-- date_received: string
 |-- date_processed: string
 |-- acidity: string
 |-- cu_total: string
 |-- zn_total: string

Wait. acidity instead of ph? cu_total instead of copper_ppm? zn_total instead of zinc_ppm?

Same measurements. Different column names. How do we handle this without writing vendor-specific transformation code in our silver layer? Do we create separate tables for each vendor? Vendor-specific case statements? A config file with column mappings that grows longer with every new vendor?

The next post explores what happens when vendors send chaos instead of clean CSVs (typos in column headers, unstable schemas, Excel nightmares) and how our clean three-layer architecture begins accumulating complexity. We'll discover an elegant solution that handles schema variations without drowning in vendor-specific code.

References and Further Reading

Kimball, Ralph, and Margy Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd ed. Wiley, 2013.

Chapter 2 provides an overview of Kimball's dimensional modeling techniques
Chapter 8 covers audit dimensions and metadata tracking patterns

The Kimball Group. "Dimensional Data Warehousing Resources." https://www.kimballgroup.com/

Archive of articles, design tips, and dimensional modeling techniques from the originators of the methodology

Databricks. "What is the medallion lakehouse architecture?" Databricks Documentation. https://docs.databricks.com/lakehouse/medallion.html

Official documentation on medallion architecture patterns and best practices

The complete code examples and demo notebooks for this blog series are available at: https://github.com/aawiegel/zen_bronze_data