<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Martin</title>
    <description>The latest articles on Forem by Martin (@martiniflap).</description>
    <link>https://forem.com/martiniflap</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3798890%2F8d7d7635-4496-49a9-99e0-ee6d8d9ea5a4.jpeg</url>
      <title>Forem: Martin</title>
      <link>https://forem.com/martiniflap</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/martiniflap"/>
    <language>en</language>
    <item>
      <title>What Git Doesn’t Solve: Designing a Transform Command for Data</title>
      <dc:creator>Martin</dc:creator>
      <pubDate>Mon, 30 Mar 2026 18:08:10 +0000</pubDate>
      <link>https://forem.com/martiniflap/what-git-doesnt-solve-designing-a-transform-command-for-data-1io4</link>
      <guid>https://forem.com/martiniflap/what-git-doesnt-solve-designing-a-transform-command-for-data-1io4</guid>
      <description>&lt;p&gt;In my previous post, I wrote about DataTracker's storage architecture (hashes, objects, and SQLite metadata). This follow-up is about what I think is the most technically interesting command: &lt;code&gt;transform&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you have not read the first post, quick context: &lt;a href="https://github.com/martin-iflap/DataTracker" rel="noopener noreferrer"&gt;DataTracker&lt;/a&gt; is a local CLI tool for versioning datasets (files or directories) with git-like commands (&lt;code&gt;add&lt;/code&gt;, &lt;code&gt;update&lt;/code&gt;, &lt;code&gt;history&lt;/code&gt;, &lt;code&gt;compare&lt;/code&gt;, &lt;code&gt;diff&lt;/code&gt;, &lt;code&gt;export&lt;/code&gt;, etc.).&lt;/p&gt;

&lt;p&gt;This article focuses on one question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How do you run a data transformation in Docker and still keep version history useful instead of chaotic?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;transform&lt;/code&gt; Exists at All
&lt;/h2&gt;

&lt;p&gt;Most data versioning tools stop at "store versions". That is useful, but in real workflows the interesting part is what happens &lt;em&gt;between&lt;/em&gt; versions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cleaning&lt;/li&gt;
&lt;li&gt;reshaping&lt;/li&gt;
&lt;li&gt;converting formats&lt;/li&gt;
&lt;li&gt;running scripts in reproducible environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted this to be one command, not three manual steps each time.&lt;/p&gt;

&lt;p&gt;Without &lt;code&gt;dt transform&lt;/code&gt;, the flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Run some custom Docker command manually&lt;/li&gt;
&lt;li&gt;Hope the output is written where you expected&lt;/li&gt;
&lt;li&gt;Remember to call &lt;code&gt;dt update&lt;/code&gt; afterward&lt;/li&gt;
&lt;li&gt;Pick a version number and message&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That works until you forget step 3 once and history becomes incomplete.&lt;/p&gt;

&lt;p&gt;So the design target became: &lt;strong&gt;run transform + apply versioning rules in one place&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Command Surface
&lt;/h2&gt;

&lt;p&gt;At a high level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dt transform &lt;span class="nt"&gt;--input-data&lt;/span&gt; &amp;lt;path&amp;gt; &lt;span class="nt"&gt;--output-data&lt;/span&gt; &amp;lt;path&amp;gt; &lt;span class="o"&gt;[&lt;/span&gt;options]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key options are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--image&lt;/code&gt;, &lt;code&gt;--command&lt;/code&gt; (required unless using a preset)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--auto-track&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--no-track&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--dataset-id&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--version&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--message&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--preset&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--force&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The command validates Docker, validates tracker initialization, resolves paths, decides whether output should be versioned, runs the transformation in a container, and then possibly versions the output or even creates a new dataset.&lt;/p&gt;




&lt;h2&gt;
  
  
  Design Principle: Separate "Run" from "Track Decision"
&lt;/h2&gt;

&lt;p&gt;One thing I changed early was separating concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker execution stays in &lt;code&gt;docker_manager.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;tracking/versioning policy stays in &lt;code&gt;transform.py&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Click argument parsing stays in &lt;code&gt;commands.py&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That split made the logic easier to test and refactor. It also prevented the CLI layer from becoming an unmaintainable if-else block.&lt;/p&gt;




&lt;h2&gt;
  
  
  Mount Contract and Safety Checks
&lt;/h2&gt;

&lt;p&gt;Internally the command mounts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;input at &lt;code&gt;/input&lt;/code&gt; (read-only)&lt;/li&gt;
&lt;li&gt;output at &lt;code&gt;/output&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and runs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; &amp;lt;input&amp;gt;:/input:ro &lt;span class="nt"&gt;-v&lt;/span&gt; &amp;lt;output&amp;gt;:/output &amp;lt;image&amp;gt; /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"&amp;lt;command&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, I validate that the command references both &lt;code&gt;/input&lt;/code&gt; and &lt;code&gt;/output&lt;/code&gt;. This catches a surprisingly common user error: the transformation technically succeeds, but writes data to a path that is not mounted back to the host.&lt;/p&gt;

&lt;p&gt;If someone knows exactly what they are doing and wants to bypass this, &lt;code&gt;--force&lt;/code&gt; disables that check.&lt;/p&gt;

&lt;p&gt;This is one of the recurring themes in the CLI design: &lt;strong&gt;safe default, escape hatch available&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Auto-Versioning: The Real Core
&lt;/h2&gt;

&lt;p&gt;After Docker runs, the command decides what to do with output history.&lt;/p&gt;

&lt;p&gt;The policy is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If &lt;code&gt;--no-track&lt;/code&gt; is set: do not version output.&lt;/li&gt;
&lt;li&gt;Else if input path matches a tracked dataset: version output into that dataset.&lt;/li&gt;
&lt;li&gt;Else if input is untracked and &lt;code&gt;--auto-track&lt;/code&gt; is set:

&lt;ul&gt;
&lt;li&gt;add input as a new dataset&lt;/li&gt;
&lt;li&gt;then version output into that new dataset&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Else: run transform only, no versioning.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This gives predictable behavior for both cautious users and exploratory users.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision table
&lt;/h3&gt;

&lt;p&gt;Once I had the rules written down, the behavior became much easier to reason about. In practice, the command reduces to this table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Input already tracked?&lt;/th&gt;
&lt;th&gt;&lt;code&gt;--auto-track&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;&lt;code&gt;--no-track&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Run transform and version output into the existing dataset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Same result as above — input is already tracked, so &lt;code&gt;--auto-track&lt;/code&gt; has nothing extra to do&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Run transform only, do not version output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Run transform only, do not version output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Add input as a new dataset, then version output into that dataset&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Run transform only, do not version output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Invalid combination, exit with usage error&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why path-based dataset lookup?
&lt;/h3&gt;

&lt;p&gt;I use original dataset paths to infer identity when &lt;code&gt;--dataset-id&lt;/code&gt; is not provided. It keeps the command convenient for everyday usage, while still allowing explicit control when needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Version increment behavior
&lt;/h3&gt;

&lt;p&gt;For transform-generated versions, I intentionally use fractional increments (&lt;code&gt;+0.1&lt;/code&gt; by default) rather than forcing integer bumps.&lt;/p&gt;

&lt;p&gt;Reasoning: many transforms are intermediate processing steps, not "major new source snapshot" events. Keeping the increments smaller prevents version numbers from ballooning unnecessarily and makes the history easier to scan.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conflict Rules and Flags
&lt;/h2&gt;

&lt;p&gt;Two flags are mutually exclusive by design:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--auto-track&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--no-track&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If both are provided, command exits with a usage error.&lt;/p&gt;

&lt;p&gt;This might look obvious, but explicit validation here matters because these conflicting options can otherwise produce silent, confusing behavior.&lt;/p&gt;

&lt;p&gt;I would rather fail fast than invent precedence rules users have to memorize.&lt;/p&gt;




&lt;h2&gt;
  
  
  Presets: Turning Repetition into Reuse
&lt;/h2&gt;

&lt;p&gt;The obviously annoying part about &lt;code&gt;transform&lt;/code&gt; is the length of the command and the repetition: image, command, and tracking flags are very often the same.&lt;/p&gt;

&lt;p&gt;That led to transform presets in &lt;code&gt;.data_tracker/presets_config.json&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A preset stores things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;image&lt;/li&gt;
&lt;li&gt;command&lt;/li&gt;
&lt;li&gt;auto-track/no-track&lt;/li&gt;
&lt;li&gt;force&lt;/li&gt;
&lt;li&gt;message&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you can run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;dt transform &lt;span class="nt"&gt;--input-data&lt;/span&gt; ./raw &lt;span class="nt"&gt;--output-data&lt;/span&gt; ./processed &lt;span class="nt"&gt;--preset&lt;/span&gt; clean-sales
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and optionally override any preset field from CLI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Override hierarchy
&lt;/h3&gt;

&lt;p&gt;The rule is simple and explicit:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CLI value &amp;gt; preset value &amp;gt; default value&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That gives reusable defaults while keeping one-off runs easy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preset management commands
&lt;/h3&gt;

&lt;p&gt;I added a small CRUD interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# list presets&lt;/span&gt;
dt preset &lt;span class="nb"&gt;ls
&lt;/span&gt;dt preset &lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;--detailed&lt;/span&gt;
&lt;span class="c"&gt;# add/remove presets&lt;/span&gt;
dt preset add &amp;lt;name&amp;gt; &lt;span class="nt"&gt;--image&lt;/span&gt; ... &lt;span class="nt"&gt;--command&lt;/span&gt; ... &lt;span class="o"&gt;[&lt;/span&gt;flags]
dt preset remove &amp;lt;name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The detailed listing is intentionally human-readable, so you can quickly review what a preset actually does before using it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Failure Cases I Handled Explicitly
&lt;/h2&gt;

&lt;p&gt;The most important ones:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Docker not installed&lt;/li&gt;
&lt;li&gt;tracker not initialized&lt;/li&gt;
&lt;li&gt;input path does not exist&lt;/li&gt;
&lt;li&gt;transform command succeeds but output directory is empty&lt;/li&gt;
&lt;li&gt;preset missing or malformed&lt;/li&gt;
&lt;li&gt;tracking/version update fails after successful transform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I also added rollback behavior for one edge case: if &lt;code&gt;--auto-track&lt;/code&gt; adds a dataset but the transform fails immediately after, the command removes the auto-added dataset to avoid leaving junk history.&lt;/p&gt;

&lt;p&gt;This is not truly transactional, but it keeps state cleaner than a naive implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example Workflow
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# initialize&lt;/span&gt;
dt init

&lt;span class="c"&gt;# track raw input&lt;/span&gt;
dt add ./data/raw.csv &lt;span class="nt"&gt;--title&lt;/span&gt; sales &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"raw export"&lt;/span&gt;

&lt;span class="c"&gt;# run transform and auto-version output&lt;/span&gt;
dt transform &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-data&lt;/span&gt; ./data/raw.csv &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-data&lt;/span&gt; ./data/cleaned &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--image&lt;/span&gt; python:3.11-slim &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--command&lt;/span&gt; &lt;span class="s2"&gt;"python /input/clean.py --output /output/clean.csv"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--message&lt;/span&gt; &lt;span class="s2"&gt;"normalize + remove nulls"&lt;/span&gt;

&lt;span class="c"&gt;# inspect what changed&lt;/span&gt;
dt &lt;span class="nb"&gt;history&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; sales
dt compare &lt;span class="nt"&gt;--name&lt;/span&gt; sales &lt;span class="c"&gt;# auto-compares latest two versions by default&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What Is Still Imperfect
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;transform&lt;/code&gt; works well for my scope, but a few things are intentionally out of scope for now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;no remote execution (local Docker only)&lt;/li&gt;
&lt;li&gt;no pipeline DAG orchestration (single command execution)&lt;/li&gt;
&lt;li&gt;no built-in preset edit command (remove + add is currently enough)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I prefer this to overbuilding features before there are real users.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned from Building It
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Command design is mostly policy design.&lt;/strong&gt; The hard part is not running Docker; it is defining clear, deterministic rules for when to version and how.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety checks are worth a few extra lines.&lt;/strong&gt; Validation around mount paths and conflicting flags prevented multiple confusing runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Defaults should be opinionated, not rigid.&lt;/strong&gt; Good defaults (&lt;code&gt;/input&lt;/code&gt;, &lt;code&gt;/output&lt;/code&gt;, auto behavior) plus escape hatches (&lt;code&gt;--force&lt;/code&gt;, explicit &lt;code&gt;--dataset-id&lt;/code&gt;, custom &lt;code&gt;--version&lt;/code&gt;) make the tool usable for both normal and advanced scenarios.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Repo
&lt;/h2&gt;

&lt;p&gt;Source code: &lt;a href="https://github.com/martin-iflap/DataTracker" rel="noopener noreferrer"&gt;github.com/martin-iflap/DataTracker&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This project is open source under the MIT License. Contributions and feedback are welcome.&lt;/p&gt;

&lt;p&gt;If you have ideas for improving &lt;code&gt;transform&lt;/code&gt; (or the overall CLI design), feel free to open an issue or submit a pull request. The main goal of this project for me is to learn more about CLI design and data versioning, so I am very open to suggestions.&lt;/p&gt;

</description>
      <category>git</category>
      <category>docker</category>
      <category>python</category>
      <category>cli</category>
    </item>
    <item>
      <title>I Built a Dataset Version Control Tool — and Accidentally Reimplemented Git's Core</title>
      <dc:creator>Martin</dc:creator>
      <pubDate>Wed, 04 Mar 2026 19:40:32 +0000</pubDate>
      <link>https://forem.com/martiniflap/i-built-a-dataset-version-control-tool-and-accidentally-reimplemented-gits-core-3ig5</link>
      <guid>https://forem.com/martiniflap/i-built-a-dataset-version-control-tool-and-accidentally-reimplemented-gits-core-3ig5</guid>
      <description>&lt;p&gt;A few months ago I started a side project mostly to get hands-on experience with three things I hadn't used seriously before: Docker, SQLite, and Python CLI tools. The plan was to build something small for educational purposes. But the project turned into &lt;a href="https://github.com/martin-iflap/DataTracker" rel="noopener noreferrer"&gt;DataTracker&lt;/a&gt; — a local version control system for data files.&lt;/p&gt;

&lt;p&gt;This article is about the architecture, specifically the part that surprised me most once I started designing how to actually store versioned files, I kept arriving at the same solutions git already uses. Not because I copied them, but because they're probably the best answers to the problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;The use case is simple. You have a CSV, a set of images, or any data file really. You run some processing, the file changes. Later you want to know what it looked like before, compare the two versions, etc. You want this without manually copying files into &lt;code&gt;data_v1/&lt;/code&gt;, &lt;code&gt;data_v2/&lt;/code&gt;, &lt;code&gt;data_final/&lt;/code&gt;, &lt;code&gt;data_final_REAL/&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Git solves this for source code. It does not solve it well for binary files or large datasets, and it was never designed to. So the question becomes: what does a minimal version of git look like if you build it specifically for data files?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Idea: Content-Addressed Storage
&lt;/h2&gt;

&lt;p&gt;The most important design decision, and the one that everything else follows from, is how you store files.&lt;/p&gt;

&lt;p&gt;The naive approach is to copy the file into some storage directory and name it after the dataset and version: &lt;code&gt;sales-data-v1.csv&lt;/code&gt;, &lt;code&gt;sales-data-v2.csv&lt;/code&gt;, and so on. This works until two versions of a dataset happen to contain identical data — then you've stored the same bytes twice for no reason.&lt;/p&gt;

&lt;p&gt;Git's answer to this is content-addressed storage: &lt;strong&gt;don't name files after what they are, name them after what they contain&lt;/strong&gt;. Specifically, hash the file contents and use the hash as the filename.&lt;/p&gt;

&lt;p&gt;DataTracker does exactly this. When you run &lt;code&gt;dt add ./sales.csv&lt;/code&gt;, this happens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file_utils.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sha256_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;sha256_hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sha256_hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;copy_file_to_objects&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracker_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_hash&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;save_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracker_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;objects&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_hash&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;shutil&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;copy2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The file ends up stored as &lt;code&gt;.data_tracker/objects/a3f8c2...&lt;/code&gt; — a 64-character hex string. The original filename is recorded separately in the database. The storage layer has no concept of names at all.&lt;/p&gt;

&lt;p&gt;Git calls these "blob objects". The structure is identical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git:          .git/objects/a3/f8c2...   ← raw file contents
DataTracker:  .data_tracker/objects/a3f8c2...  ← raw file contents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The immediate practical benefit is &lt;strong&gt;automatic deduplication&lt;/strong&gt;. If you add two datasets with identical contents, or update a dataset without actually changing the file, the hash is the same, the object file already exists, and no second copy is written. This is handled by a single database check before the copy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# db_manager.py — INSERT OR IGNORE means a collision is silently a no-op
&lt;/span&gt;&lt;span class="n"&gt;conn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INSERT OR IGNORE INTO objects (hash, size) VALUES (?, ?)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;file_hash&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No other deduplication logic is needed.&lt;/p&gt;

&lt;p&gt;One important thing to note is that duplicates are allowed in DataTracker — you can add new versions with the same contents if you want, DataTracker will only provide a warning.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Database as the Index
&lt;/h2&gt;

&lt;p&gt;Storing files by hash solves the storage problem, but it creates a new one: how do you know which hash belongs to which dataset version? Git uses a combination of tree objects and refs (branches, tags) stored as small files in &lt;code&gt;.git/&lt;/code&gt;. I used SQLite, which gives you the same thing with foreign keys and transactions.&lt;/p&gt;

&lt;p&gt;The database schema has four tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;datasets      — one row per tracked dataset (id, name, message, created_at)
    │
    └── versions  — one row per version (dataset_id → datasets, object_hash, version number, original_path)
            │
            └── files     — one row per file in a version (version_id → versions, object_hash → objects, relative_path)
                    │
objects       — one row per unique file stored (hash, size)  ←──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;files&lt;/code&gt; table is the key one. It's what lets DataTracker reconstruct a full directory version: given a &lt;code&gt;version_id&lt;/code&gt;, you get back every file's hash and its relative path within the original directory. That's enough to recreate the original structure anywhere.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;datasets&lt;/code&gt; and &lt;code&gt;versions&lt;/code&gt; are roughly equivalent to git's refs and commits. &lt;code&gt;objects&lt;/code&gt; is the object store manifest — it tracks sizes and prevents orphaned files, but the actual content lives in the filesystem.&lt;/p&gt;

&lt;p&gt;Why SQLite over a JSON file? Three reasons:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Atomic transactions.&lt;/strong&gt; When adding a dataset with 50 files, either all 50 succeed or none do. A JSON file gives you no such guarantee.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Foreign keys.&lt;/strong&gt; They enforce referential integrity — you can't have a version row pointing to a non-existent dataset. SQLite has them, but notably they're off by default. You have to enable them per connection: &lt;code&gt;conn.execute("PRAGMA foreign_keys = ON")&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Queries.&lt;/strong&gt; Finding the latest version of a dataset, checking for hash collisions, listing all datasets — these are one-liners in SQL and would be loops with a JSON file.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Directory Versioning: Where It Diverges from Git
&lt;/h2&gt;

&lt;p&gt;Git's unit of storage is always a single file (a blob). Directory structure is captured separately as tree objects that reference blobs. DataTracker takes a slightly different approach because the use case is different.&lt;/p&gt;

&lt;p&gt;When you add a directory, DataTracker stores each file individually in &lt;code&gt;objects/&lt;/code&gt; (same as git), but it also computes a &lt;strong&gt;single primary hash for the whole directory&lt;/strong&gt;. This primary hash is used for one specific purpose: duplicate detection at the version level, so you can warn users if needed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# file_utils.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;hash_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;sha256_hash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sha256&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dirs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dir_path&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;dirs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;root&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;rel_path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;relpath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dir_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;sha256_hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rel_path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;  &lt;span class="c1"&gt;# include structure
&lt;/span&gt;            &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rb&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;iter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;sha256_hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sha256_hash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The directory hash covers both file contents and relative paths — so renaming a file in the directory produces a different hash even if the contents are unchanged.&lt;/p&gt;

&lt;p&gt;So the two-hash approach is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Directory hash&lt;/strong&gt; → stored in &lt;code&gt;versions.object_hash&lt;/code&gt;, used to warn about duplicate versions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Individual file hashes&lt;/strong&gt; → stored in &lt;code&gt;objects&lt;/code&gt; and &lt;code&gt;files&lt;/code&gt;, used for actual storage and retrieval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Git doesn't need this distinction because it always works at the file level. DataTracker needs it because the primary user-facing unit is a dataset version, not a file.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Walk-Through
&lt;/h2&gt;

&lt;p&gt;Here's what happens end to end when you run &lt;code&gt;dt add ./data/ --title "experiment-1"&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Walk the directory, collect all file paths
2. For each file:
   a. SHA-256 hash the contents
   b. Copy to .data_tracker/objects/&amp;lt;hash&amp;gt;  (INSERT OR IGNORE — dedup is free)
   c. Record (version_id, hash, relative_path) in the files table
3. SHA-256 hash the entire directory → primary_hash
4. Check if primary_hash already exists in versions → warn if so
5. Insert a row into datasets (or reuse existing for dt update)
6. Insert a row into versions (dataset_id, primary_hash, version number, original_path)
7. conn.commit() — all of the above is one transaction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 7 is important. If anything fails between steps 1 and 6, the commit never happens and the database sees none of it. The object files that were already copied to disk are then cleaned up explicitly. This is the same reason git's staging area exists — operations on the object store and operations on the index need to be kept consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Missing Compared to git
&lt;/h2&gt;

&lt;p&gt;Pointing out the gaps is more useful than pretending the tool is complete.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No delta storage.&lt;/strong&gt; Git stores diffs for text files rather than full copies, which is why a git repository with hundreds of commits doesn't grow linearly with the number of commits. DataTracker stores a full copy of every file in every version. For small datasets this is fine. For large ones it becomes expensive quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No garbage collection.&lt;/strong&gt; Git has &lt;code&gt;git gc&lt;/code&gt;. DataTracker cleans up orphaned objects when you remove a dataset or version, but there's no general-purpose GC pass. If something goes wrong mid-operation and leaves orphaned objects, they stay there until you notice the &lt;code&gt;dt storage&lt;/code&gt; numbers look wrong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Linear history only.&lt;/strong&gt; Git has branching. DataTracker has a single version number per dataset, incrementing linearly. There's no concept of parallel versions or merging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No remote.&lt;/strong&gt; Everything is local. There's no push/pull, no sharing between machines.&lt;/p&gt;

&lt;p&gt;Some of these are limitations by design — the tool is meant to be simple. Some of them (delta storage in particular) are things I would perhaps like to add in the future, but they require a lot of new logic and complexity, which I didn't want to deal with initially.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Lesson I Didn't Expect
&lt;/h2&gt;

&lt;p&gt;I started this project thinking Docker and SQL would be the only interesting parts. The Docker integration (the &lt;code&gt;dt transform&lt;/code&gt; command, which runs a transformation inside a container and auto-versions the output) did turn out to be very interesting. But the part that surprised me the most was how much of git's core architecture I ended up reimplementing without even trying — and thinking about it turned out to be a great way to understand git itself better.&lt;/p&gt;

&lt;p&gt;Content-addressed storage is a 30-year-old idea. It shows up in git, in IPFS, in container image layers, in package managers. The reason it keeps appearing is that it solves a hard problem — identity and deduplication — with almost no code. The hash &lt;em&gt;is&lt;/em&gt; the identity. Two files that are identical are automatically the same object. You don't have to write that logic; the data model expresses it.&lt;/p&gt;

&lt;p&gt;A lot of developers use git every day without thinking much about how it actually works under the hood. I think building something similar, even something much simpler, is one of the better ways to change that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Demo
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install&lt;/span&gt;
git clone https://github.com/martin-iflap/DataTracker.git
&lt;span class="nb"&gt;cd &lt;/span&gt;DataTracker
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Track a dataset&lt;/span&gt;
dt init
dt add ./data.csv &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s2"&gt;"sales"&lt;/span&gt; &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Raw export"&lt;/span&gt;

&lt;span class="c"&gt;# Update it&lt;/span&gt;
dt update ./data_cleaned.csv &lt;span class="nt"&gt;--name&lt;/span&gt; sales &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Removed nulls"&lt;/span&gt;

&lt;span class="c"&gt;# See what changed&lt;/span&gt;
dt compare 1.0 2.0 &lt;span class="nt"&gt;--name&lt;/span&gt; sales

&lt;span class="c"&gt;# Go back to v1&lt;/span&gt;
dt &lt;span class="nb"&gt;export&lt;/span&gt; ./recovered &lt;span class="nt"&gt;--name&lt;/span&gt; sales &lt;span class="nt"&gt;-v&lt;/span&gt; 1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output from &lt;code&gt;dt compare&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Comparison between version 1.0 and version 2.0:
...
Modified files:
  ~ data.csv | Size: 48.20 KB → 45.10 KB = -3.10 KB
    Similarity: 94.30%
    Lines added: 0, Lines removed: 47
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Repo
&lt;/h2&gt;

&lt;p&gt;The full source is at &lt;a href="https://github.com/martin-iflap/DataTracker" rel="noopener noreferrer"&gt;github.com/martin-iflap/DataTracker&lt;/a&gt;. The project is still actively developed — the transform/presets system is not finished, there's no GC command, and no status command yet. But I plan to add them in the coming weeks.&lt;/p&gt;

&lt;p&gt;If you've built something similar, ran into one of the same problems, or have a strong opinion about delta storage — I'd genuinely like to hear it in the comments.&lt;/p&gt;

</description>
      <category>git</category>
      <category>python</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
