<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Vadym Stupakov</title>
    <description>The latest articles on Forem by Vadym Stupakov (@redeyed).</description>
    <link>https://forem.com/redeyed</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F841357%2F9203c46e-f46c-4026-837d-2d8fb8c455b7.jpeg</url>
      <title>Forem: Vadym Stupakov</title>
      <link>https://forem.com/redeyed</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/redeyed"/>
    <language>en</language>
    <item>
      <title>git-sfs: Large File Storage Without the LFS Server</title>
      <dc:creator>Vadym Stupakov</dc:creator>
      <pubDate>Thu, 07 May 2026 20:42:25 +0000</pubDate>
      <link>https://forem.com/redeyed/git-sfs-large-file-storage-without-the-lfs-server-5cco</link>
      <guid>https://forem.com/redeyed/git-sfs-large-file-storage-without-the-lfs-server-5cco</guid>
      <description>&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;You added a 2 GB dataset to a repo. Now &lt;code&gt;git clone&lt;/code&gt; takes 10 minutes, CI downloads the full history on every run, and GitHub is billing you per GB transferred. You switch to Git LFS - and now you need a server, a token, and a storage plan. You try DVC - and now you need Python, a pipeline config, and lock files that conflict on every PR.&lt;/p&gt;

&lt;p&gt;None of this is the actual problem. The actual problem is: large files don't belong in Git objects. Everything else is overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Meet git-sfs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SFS&lt;/strong&gt; stands for &lt;strong&gt;Symbolic File Storage&lt;/strong&gt;. The name is deliberate - it's Git LFS with the &lt;code&gt;L&lt;/code&gt; swapped for &lt;code&gt;S&lt;/code&gt;. Git LFS replaces large files with opaque pointer files and routes bytes through a proprietary server protocol. &lt;code&gt;git-sfs&lt;/code&gt; replaces large files with plain symlinks that Git already understands natively, and routes bytes through &lt;code&gt;rclone&lt;/code&gt; to any remote you already have.&lt;/p&gt;

&lt;p&gt;No new protocol. No server. No pointer file format to decode. Just symlinks.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;The model is three sentences:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Git tracks symlinks.
git-sfs stores file bytes.
rclone moves files.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you run &lt;code&gt;git-sfs add data/train-000.tar.zst&lt;/code&gt;, here's what happens:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The file is hashed with SHA-256&lt;/li&gt;
&lt;li&gt;Bytes are stored in &lt;code&gt;&amp;lt;cache&amp;gt;/files/sha256/ab/&amp;lt;hash&amp;gt;&lt;/code&gt; - write-once, read-only&lt;/li&gt;
&lt;li&gt;The original path becomes a relative Git symlink pointing into the cache
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data/train-000.tar.zst -&amp;gt; ../.git-sfs/cache/files/sha256/ab/&amp;lt;hash&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;data/train-000.tar.zst&lt;/code&gt; opens normally. Git commits a 70-byte symlink. Your repo stays fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Git LFS?
&lt;/h2&gt;

&lt;p&gt;Git LFS solves the storage problem but adds a &lt;em&gt;server&lt;/em&gt; problem. You need an LFS endpoint, pay per-GB transfer fees on GitHub, and the pointer-file format is an opaque internal detail.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git-sfs&lt;/code&gt; remotes are plain &lt;code&gt;rclone&lt;/code&gt; destinations - S3, GCS, Azure Blob, Backblaze B2, SFTP, a local path, anything &lt;code&gt;rclone&lt;/code&gt; supports. You can &lt;code&gt;rclone ls&lt;/code&gt; your remote and see exactly what's there. No magic.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Binary, No Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;git-sfs&lt;/code&gt; is written in Go. It ships as a single static binary - no Python environment, no runtime, no version conflicts. Drop it on any machine and it runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install (macOS/Linux, arm64 and x86_64)&lt;/span&gt;
curl &lt;span class="nt"&gt;-LsSf&lt;/span&gt; https://github.com/Red-Eyed/git-sfs/releases/latest/download/install.sh | sh

&lt;span class="c"&gt;# Init and configure&lt;/span&gt;
git-sfs init        &lt;span class="c"&gt;# creates .git-sfs/config.toml&lt;/span&gt;
&lt;span class="c"&gt;# edit config.toml: set remote backend, path, rclone config&lt;/span&gt;
git-sfs setup       &lt;span class="c"&gt;# bind local cache&lt;/span&gt;

&lt;span class="c"&gt;# Add files&lt;/span&gt;
git-sfs add data/
git add .git-sfs/config.toml data/
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"track datasets"&lt;/span&gt;

&lt;span class="c"&gt;# Sync to remote&lt;/span&gt;
git-sfs push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On another machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;repo&amp;gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd&lt;/span&gt; &amp;lt;repo&amp;gt;
git-sfs setup
git-sfs pull        &lt;span class="c"&gt;# download only what you need&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also pull a subset - useful when one machine only needs validation data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git-sfs pull data/validation/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Safety by Design
&lt;/h2&gt;

&lt;p&gt;Data loss in a dataset management tool is unacceptable, so &lt;code&gt;git-sfs&lt;/code&gt; has a few hard rules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hash-verify at every boundary&lt;/strong&gt; - after hashing, after download, after copy. A corrupted file is rejected, not silently accepted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic writes&lt;/strong&gt; - temp file + rename everywhere. An interrupted push or pull never leaves a partial file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache files are immutable&lt;/strong&gt; - write-once, then write-protected. Accidental overwrites are impossible.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;verify&lt;/code&gt; command is designed for CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git-sfs verify              &lt;span class="c"&gt;# presence check (fast)&lt;/span&gt;
git-sfs verify &lt;span class="nt"&gt;--with-integrity&lt;/span&gt; data/  &lt;span class="c"&gt;# rehash cached + remote files (thorough)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And &lt;code&gt;doctor&lt;/code&gt; checks your entire setup - config, cache, rclone binary, remote connectivity - in one shot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git-sfs doctor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's Different from dvc, git-annex, etc.?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Requires&lt;/th&gt;
&lt;th&gt;Remote&lt;/th&gt;
&lt;th&gt;PR-friendly?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Git LFS&lt;/td&gt;
&lt;td&gt;LFS server&lt;/td&gt;
&lt;td&gt;proprietary protocol&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DVC&lt;/td&gt;
&lt;td&gt;Python, pipelines&lt;/td&gt;
&lt;td&gt;S3/GCS/etc via SDK&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;git-annex&lt;/td&gt;
&lt;td&gt;Haskell runtime&lt;/td&gt;
&lt;td&gt;many backends&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;git-sfs&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;rclone&lt;/td&gt;
&lt;td&gt;anything rclone supports&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;DVC&lt;/strong&gt; stores large file metadata in &lt;code&gt;.dvc&lt;/code&gt; sidecar files and a &lt;code&gt;dvc.lock&lt;/code&gt; that encode pipeline state. When two branches touch the same dataset, merging those lock files creates conflicts that have no meaningful resolution in a pull request - reviewers see &lt;code&gt;.dvc&lt;/code&gt; diffs, not data diffs, and the merge problem is fundamentally DVC's, not Git's.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;git-annex&lt;/strong&gt; stores its metadata in a separate orphan branch (&lt;code&gt;git-annex&lt;/code&gt;). That branch never appears in a normal &lt;code&gt;git log&lt;/code&gt; or PR diff, so reviewers have no visibility into what large files changed or why. Merging the annex branch is a separate out-of-band operation that GitHub PRs don't surface at all.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git-sfs&lt;/code&gt; tracks only plain relative symlinks. A PR diff shows exactly which files were added, removed, or renamed - the same way any other file change looks in Git. Reviewers can approve or reject dataset changes with full context, and there are no sidecar files or hidden branches to reconcile.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;git-sfs&lt;/code&gt; has no pipelines, no Python, no manifests, no committed lock files. The Git tree is the file list. The cache is a plain directory. The remote is whatever &lt;code&gt;rclone&lt;/code&gt; can reach.&lt;/p&gt;

&lt;h2&gt;
  
  
  Concurrency and Partial Pulls
&lt;/h2&gt;

&lt;p&gt;Large dataset workflows often mean hundreds of files if not millions. &lt;code&gt;git-sfs&lt;/code&gt; runs add, push, and pull with a configurable worker pool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# .git-sfs/config.toml&lt;/span&gt;
&lt;span class="nn"&gt;[settings]&lt;/span&gt;
&lt;span class="py"&gt;n_jobs&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And partial pulls let teams share a repo where different machines only materialize what they actually use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git-sfs pull data/train/         &lt;span class="c"&gt;# only training split&lt;/span&gt;
git-sfs pull data/checkpoints/   &lt;span class="c"&gt;# only model weights&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Source
&lt;/h2&gt;

&lt;p&gt;The project is open source under MIT: &lt;a href="https://github.com/Red-Eyed/git-sfs" rel="noopener noreferrer"&gt;github.com/Red-Eyed/git-sfs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feedback and issues welcome. If you're managing large files in Git and the LFS server tax is getting old, give it a try.&lt;/p&gt;

</description>
      <category>git</category>
      <category>devops</category>
      <category>go</category>
      <category>lfs</category>
    </item>
  </channel>
</rss>
