<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Taskworld</title>
    <description>The latest articles on Forem by Taskworld (@taskworld).</description>
    <link>https://forem.com/taskworld</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F2230%2Fa7c00db7-b9cc-414e-9196-8ec4ae0cd44a.png</url>
      <title>Forem: Taskworld</title>
      <link>https://forem.com/taskworld</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/taskworld"/>
    <language>en</language>
    <item>
      <title>Save the precious build minutes! Reusing build outputs with Git Tree Hash 🌳</title>
      <dc:creator>Thai Pangsakulyanont</dc:creator>
      <pubDate>Fri, 01 May 2020 08:16:03 +0000</pubDate>
      <link>https://forem.com/taskworld/save-the-precious-build-minutes-reusing-build-outputs-with-git-tree-hash-k61</link>
      <guid>https://forem.com/taskworld/save-the-precious-build-minutes-reusing-build-outputs-with-git-tree-hash-k61</guid>
      <description>&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; Do you spend a lot of time (and perhaps money) waiting for builds on the &lt;code&gt;master&lt;/code&gt; branch, even though its contents are identical to the commit before you clicked that &lt;em&gt;Merge Pull Request&lt;/em&gt; button? You may save time by caching your build output, using a Git &lt;em&gt;Tree Hash&lt;/em&gt; (not to be confused with &lt;em&gt;Commit Hash&lt;/em&gt;) as a cache key.&lt;/p&gt;

&lt;p&gt;This is the second installment of a series about build performance optimization.&lt;/p&gt;

&lt;p&gt;For this one, I discovered the technique while optimizing &lt;a href="https://taskworld.com/"&gt;Taskworld&lt;/a&gt;’s frontend build pipeline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Context: The Netlify Build Workflow
&lt;/h2&gt;

&lt;p&gt;At Taskworld, we’ve been using a build workflow that has now been popularized as “&lt;a href="https://www.netlify.com/products/build/"&gt;Netlify Build&lt;/a&gt;.” It is a Git workflow for developing, testing, and delivering &lt;a href="https://jamstack.org/"&gt;JAMStack&lt;/a&gt; sites to production &lt;em&gt;continuously&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;While you work on feature branches, an automated system builds and deploys your code to a deployment preview environment, so people can review and test the app without having to do do a &lt;code&gt;git checkout &amp;amp;&amp;amp; yarn &amp;amp;&amp;amp; yarn dev&lt;/code&gt; by themselves.&lt;/li&gt;
&lt;li&gt;When your change is merged to &lt;code&gt;master&lt;/code&gt;, an automated system builds and deploys your code to a production environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think this is very similar to &lt;a href="https://guides.github.com/introduction/flow/"&gt;the GitHub flow&lt;/a&gt;… but with the GitHub flow, they deploy stuff to production (in order to verify the changes on production) &lt;em&gt;before&lt;/em&gt; merging to master. Netlify’s &lt;em&gt;default&lt;/em&gt; build workflow, in contrast, deploys to production &lt;em&gt;after&lt;/em&gt; merging to master (it can be customized though).&lt;/p&gt;




&lt;h2&gt;
  
  
  😩 Why build twice?
&lt;/h2&gt;

&lt;p&gt;If your master branch is protected, and you also require feature branches to be up-to-date with master before they can be merged, you’ll find that &lt;strong&gt;the commit generated when merging to master will have the exact same contents as the commit just before the merge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xrozaYwk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/4fvwut9elue1p2v93rbm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xrozaYwk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/4fvwut9elue1p2v93rbm.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But still, &lt;strong&gt;most CI pipelines, by default, will build your code from scratch&lt;/strong&gt; in such situation.&lt;/p&gt;

&lt;p&gt;Now this is fine for a small project. But as our app grows, what used to take &lt;em&gt;seconds&lt;/em&gt; now takes &lt;em&gt;minutes&lt;/em&gt;. We start to see some redundancy in building the exact same code over and over again.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--CIezurVm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/zbdbp34pv6hitphhdyx5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--CIezurVm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/zbdbp34pv6hitphhdyx5.png" alt="A Git graph, showing merge requests getting built twice."&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🌳 Introducing the Git Tree Hash
&lt;/h2&gt;

&lt;p&gt;When you run &lt;code&gt;git rev-parse HEAD&lt;/code&gt;, you get the hash of the HEAD commit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git rev-parse HEAD
10684e38090ed90d2d58d3ff3c81ace99ce658fe
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;When you run &lt;code&gt;git rev-parse HEAD:&lt;/code&gt; (note the colon at the end), you get the hash of the &lt;em&gt;contents&lt;/em&gt; that the HEAD commit is representing. This is a tree hash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git rev-parse HEAD:
29aa6872b8bfa8a911995c6a6b206fdd158339e3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;🤔 Now, what’s the difference?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A commit hash is calculated based on its contents &lt;em&gt;and&lt;/em&gt; metadata.&lt;/strong&gt; Metadata includes the commit message, committer and author’s information (like name, email, and date), as well as the parent commits’ hashes. That’s why each time you create a commit you always get a completely different hash.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A tree hash, on the other hand, captures &lt;em&gt;only&lt;/em&gt; the state of the files within.&lt;/strong&gt; It’s as if you take a snapshot of the repository at that commit and hash its contents. If the contents are the same, the tree hash will be the same, regardless of the commit history.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also get a tree hash of a subdirectory. As you might have guessed, the subdirectory’s tree hash will stay the same if you don’t modify the contents of that subdirectory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;git rev-parse HEAD:docs
1c771ff483992f38b268f08e9c015b613aa51e0a
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;If you are interested in learning more about this concept, I really recommend the article &lt;a href="https://blog.thoughtram.io/git/2014/11/18/the-anatomy-of-a-git-commit.html"&gt;“The anatomy of a Git commit” by Christoph Burgdorf&lt;/a&gt; where Git blobs, trees, and commits are visualized using an easy-to-understand diagram.&lt;/p&gt;




&lt;h2&gt;
  
  
  💡 Reusing build outputs with Git Tree Hash
&lt;/h2&gt;

&lt;p&gt;Now, how can we put this to use?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We can use Git Tree Hash as a cache key for our build outputs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of running your build script unconditionally, you can make your build process try to restore the build output from the cache first, and then run the build script only when the cache doesn't exist.&lt;/p&gt;

&lt;p&gt;Here’s an example CircleCI configuration. The same concept can also be applied to other CI systems.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight diff"&gt;&lt;code&gt;&lt;span class="gi"&gt;+      - run:
+          name: obtain tree hash
+          command: |
+            git rev-parse HEAD: | tee /tmp/tree.hash
+      - restore_cache:
+          keys:
+            - v1-tree-{{ checksum "/tmp/tree.hash" }}
&lt;/span&gt;       - run:
           name: build
           command: |
&lt;span class="gd"&gt;-            yarn build
&lt;/span&gt;&lt;span class="gi"&gt;+            test -f build/index.html || yarn build
+      - save_cache:
+          key: v1-tree-{{ checksum "/tmp/tree.hash" }}
+          paths:
+            - build
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;On CI systems that doesn’t provide a cache storage, you can also use a cloud storage service, like Amazon S3 or Google Cloud Storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️ Some safety caveats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ensure that the &lt;code&gt;build&lt;/code&gt; directory does not exist before restoring the build output from the cache.&lt;/strong&gt; Build systems that gives you a pristine build environment (e.g. GitHub Actions and CircleCI) will not have this problem. However if you use build systems where the workspace directory can be reused (e.g. Jenkins), please be aware of this caveat.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make sure to verify the validity of the restored cache.&lt;/strong&gt; A corrupted cache can lead to a corrupted build. CircleCI already verifies the integrity of the restored cache, so with CircleCI we get this for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Make sure to never save the build output to the cache if the build failed.&lt;/strong&gt; It may lead to a corrupted cache.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🤩 Use a more fine-grained tree hash
&lt;/h2&gt;

&lt;p&gt;This requires you to list out everything that may potentially affect the build output, but doing this will increase the chance of a cache hit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight diff"&gt;&lt;code&gt;       - run:
           name: obtain tree hash
           command: |
&lt;span class="gd"&gt;-            git rev-parse HEAD: | tee /tmp/tree.hash
&lt;/span&gt;&lt;span class="gi"&gt;+            git rev-parse HEAD:src | tee -a /tmp/tree.hash
+            git rev-parse HEAD:public | tee -a /tmp/tree.hash
+            git rev-parse HEAD:.babelrc | tee -a /tmp/tree.hash
+            git rev-parse HEAD:postcss.config.js | tee -a /tmp/tree.hash
+            git rev-parse HEAD:tsconfig.json | tee -a /tmp/tree.hash
+            git rev-parse HEAD:yarn.lock | tee -a /tmp/tree.hash
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;⚠️ Some safety caveats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If you forget to include any dependency here, there is a risk of a developer changing the configuration just to find out that the build output remains unchanged. Tracking this down can be a painful experience. So, when caching strategies like this is involved, try to make it clear to the developers what goes into determining whether to reuse the build output, and also provide instructions on how to force-invalidate the cache.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🏘 Use with subprojects
&lt;/h2&gt;

&lt;p&gt;Sometimes a project may contain both the main application and a documentation site in the same repository. This can take time to build.&lt;/p&gt;

&lt;p&gt;Since you can obtain a tree hash of a subdirectory, you can, for example, cache the documentation site, and only rebuild it when the documentation contents have changed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⚠️ Some safety caveats:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You may use this technique with a monorepo project as well, but the contents of the same-repository dependencies must also go into the cache key of the dependent sub-project. I would recommend using tools designed for monorepos such as &lt;a href="https://nx.dev"&gt;Nx&lt;/a&gt; or &lt;a href="https://bazel.build/"&gt;Bazel&lt;/a&gt; instead of a makeshift solution like this.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ✅ Use with tests
&lt;/h2&gt;

&lt;p&gt;If your tests are taking a long time to run, you can also cache the &lt;code&gt;junit.xml&lt;/code&gt; file or something similar when tests are passing, so you don't have to re-run your tests if you already know that they are passing.&lt;/p&gt;




&lt;h2&gt;
  
  
  ℹ️ Including the commit hash in the build output
&lt;/h2&gt;

&lt;p&gt;Including a commit hash in the built application can make it easier to track down bugs, e.g. in &lt;a href="https://docs.sentry.io/workflow/releases/"&gt;Sentry&lt;/a&gt;. Usually this is accomplished by providing a build-time environment variable. &lt;a href="https://stackoverflow.com/questions/48391897/add-git-information-to-create-react-app"&gt;For example&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;env &lt;/span&gt;&lt;span class="nv"&gt;REACT_APP_GIT_SHA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sb"&gt;`&lt;/span&gt;git rev-parse &lt;span class="nt"&gt;--short&lt;/span&gt; HEAD&lt;span class="sb"&gt;`&lt;/span&gt; yarn build
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This might not work well with build output caching. When a build output from a previous commit is reused, the embedded commit ID may not match the commit that is being deployed.&lt;/p&gt;

&lt;p&gt;We can resolve that by not using the commit hash during build time, but inject the commit hash into the application package at deployment time (e.g. injecting a &lt;code&gt;&amp;lt;script&amp;gt;APP_VERSION='…'&amp;lt;/script&amp;gt;&lt;/code&gt; or a manifest JSON file instead), and have the built application read it at runtime instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;By learning how Git works under the hood, and exploring how data is stored inside a Git repository, we can come up with ways to improve our build performance.&lt;/p&gt;

&lt;p&gt;At Taskworld, we saw 3 minutes of savings in build time when we can reuse a build output from one of the previous commits.&lt;/p&gt;

&lt;p&gt;As a result, the build pipeline becomes faster in merging situations. We also get feedback from the CI system especially faster when we make changes to parts of the repository other than the main app (such as end-to-end tests and build scripts).&lt;/p&gt;

&lt;p&gt;Hope you’ll find this useful, and thanks for reading!&lt;/p&gt;

</description>
      <category>github</category>
      <category>circleci</category>
      <category>performance</category>
      <category>optimization</category>
    </item>
  </channel>
</rss>
