<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: BenBirt</title>
    <description>The latest articles on Forem by BenBirt (@benbirt).</description>
    <link>https://forem.com/benbirt</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F187074%2Fb6d8c644-f216-4259-b070-2c6cf673fad0.jpeg</url>
      <title>Forem: BenBirt</title>
      <link>https://forem.com/benbirt</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/benbirt"/>
    <language>en</language>
    <item>
      <title>Cut data warehouse costs with run caching</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 24 Sep 2020 12:20:29 +0000</pubDate>
      <link>https://forem.com/dataform/cut-data-warehouse-costs-with-run-caching-2hgb</link>
      <guid>https://forem.com/dataform/cut-data-warehouse-costs-with-run-caching-2hgb</guid>
      <description>&lt;p&gt;As we've mentioned before, one of the core design goals of Dataform is to make project compilation &lt;strong&gt;hermetic&lt;/strong&gt;. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).&lt;/p&gt;

&lt;p&gt;Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our &lt;a href="https://docs.dataform.co/guides/configuration#enable-run-caching-to-cut-warehouse-costs"&gt;"run caching" feature&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't waste time and money re-computing the same data
&lt;/h2&gt;

&lt;p&gt;Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.&lt;/p&gt;

&lt;p&gt;Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.&lt;/p&gt;

&lt;p&gt;Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.&lt;/p&gt;

&lt;p&gt;We've built this feature into Dataform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run caching in Dataform
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://app.dataform.co/#/5682186490478592/file/definitions%2Fincidents_by_date.sqlx?utm_source=dev_to&amp;amp;utm_campaign=run_caching"&gt;&lt;strong&gt;Try out an example project with run caching here!&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can turn run caching on in your project with a few small changes which are described &lt;a href="https://docs.dataform.co/guides/configuration#enable-run-caching-to-cut-warehouse-costs"&gt;here&lt;/a&gt;. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.&lt;/p&gt;

&lt;p&gt;For example, consider the following SQLX file, which configures Dataform to publish a table &lt;code&gt;age_count&lt;/code&gt; containing the transformed results of a query reading a &lt;code&gt;people&lt;/code&gt; relation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Dataform only needs to (re-)publish this table if any of the following conditions are true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The output table &lt;code&gt;age_count&lt;/code&gt; doesn't exist&lt;/li&gt;
&lt;li&gt;The output table &lt;code&gt;age_count&lt;/code&gt; has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself)&lt;/li&gt;
&lt;li&gt;The query has changed since the last time the &lt;code&gt;age_count&lt;/code&gt; table was published&lt;/li&gt;
&lt;li&gt;The input table &lt;code&gt;people&lt;/code&gt; has changed since the last time the &lt;code&gt;age_count&lt;/code&gt; table was published (or, if &lt;code&gt;people&lt;/code&gt; is a view, then if any of the input(s) to &lt;code&gt;people&lt;/code&gt; have changed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building in intelligence so you don't have to
&lt;/h2&gt;

&lt;p&gt;At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.&lt;/p&gt;

&lt;p&gt;This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.&lt;/p&gt;

&lt;p&gt;If you'd like to learn more, the Dataform framework documentation is &lt;a href="https://docs.dataform.co/?utm_source=dev_to&amp;amp;utm_campaign=run_caching"&gt;here&lt;/a&gt;. Join us on &lt;a href="http://dataform-users.slack.com/"&gt;Slack&lt;/a&gt; and let us know what you think!&lt;/p&gt;

</description>
      <category>elt</category>
      <category>dataengineering</category>
      <category>pipeline</category>
      <category>etl</category>
    </item>
    <item>
      <title>CI/CD for ETL/ELT pipelines</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Mon, 08 Jun 2020 12:14:52 +0000</pubDate>
      <link>https://forem.com/dataform/ci-cd-for-etl-elt-pipelines-2mgm</link>
      <guid>https://forem.com/dataform/ci-cd-for-etl-elt-pipelines-2mgm</guid>
      <description>&lt;p&gt;One of Dataform’s key motivations has been to bring software engineering best practices to teams building ETL/ELT pipelines. To further that goal, we recently launched support for you to run Continuous Integration (CI) checks against your Dataform projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is CI/CD?
&lt;/h2&gt;

&lt;p&gt;CI/CD is a set of processes which aim to help teams ship software quickly and reliably.&lt;/p&gt;

&lt;p&gt;Continuous integration (CI) checks automatically verify that all changes to your code work as expected, and typically run before the change is merged into your Git master branch. This ensures that the version of the code on the master branch always works correctly.&lt;/p&gt;

&lt;p&gt;Continuous deployment (CD) tools automatically (and frequently) deploy the latest version of your code to production. This is intended to minimize the time it takes for new features or bugfixes to be available in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI/CD for Dataform projects
&lt;/h2&gt;

&lt;p&gt;Dataform already does most of the CD gruntwork for you. By default, all code committed to the master branch is automatically deployed. For more advanced use cases, you can configure exactly what you want to be deployed and when using &lt;a href="https://docs.dataform.co/dataform-web/scheduling/environments?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;environments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;CI checks, however, are usually configured as part of your Git repository (usually hosted on GitHub, though Dataform supports other Git hosting providers).&lt;/p&gt;

&lt;h2&gt;
  
  
  How to configure CI checks
&lt;/h2&gt;

&lt;p&gt;Dataform distributes a &lt;a href="https://hub.docker.com/r/dataformco/dataform"&gt;Docker image&lt;/a&gt; which can be used to run the equivalent of &lt;a href="https://docs.dataform.co/dataform-cli?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;Dataform CLI&lt;/a&gt; commands. For most CI tools, this Docker image is what you'll use to run your automated checks.&lt;/p&gt;

&lt;p&gt;If you host your Dataform Git repository on GitHub, you can use &lt;a href="https://help.github.com/en/actions/configuring-and-managing-workflows/configuring-a-workflow"&gt;GitHub Actions&lt;/a&gt; to run CI workflows. This post assumes you’re using GitHub Actions, but other CI tools are configured in a similar way.&lt;/p&gt;

&lt;p&gt;Here’s a simple example of a GitHub Actions workflow for a Dataform project. Once you put this in a &lt;code&gt;.github/workflows/&amp;lt;some filename&amp;gt;.yaml&lt;/code&gt; file, GitHub will run the workflow on each pull request and commit to your master branch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;compile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code into workspace directory&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install project dependencies&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker://dataformco/dataform:1.6.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;install&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run dataform compile&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker://dataformco/dataform:1.6.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compile&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This workflow runs &lt;code&gt;dataform compile&lt;/code&gt; - this means that if the project fails to compile, the workflow will fail, and this will be reflected in the GitHub UI.&lt;/p&gt;

&lt;p&gt;Note that it’s possible to run any &lt;code&gt;dataform&lt;/code&gt; CLI command in a CI workflow. However, some commands do need credentials in order to run queries against your data warehouse. In these circumstances, you should encrypt those credentials and commit the encrypted file to your Git repository. Then, in your CI workflow, you decrypt the credentials so that the Dataform CLI can use them.&lt;/p&gt;

&lt;p&gt;For further details on configuring CI/CD for your Dataform projects, please see our &lt;a href="https://docs.dataform.co/guides/ci-cd?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;docs&lt;/a&gt;. As always, if you have any questions, or would like to get in touch with us, please send us a message on &lt;a href="https://join.slack.com/t/dataform-users/shared_invite/zt-dark6b7k-r5~12LjYL1a17Vgma2ru2A"&gt;Slack&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>etl</category>
      <category>elt</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>The right way to install Helm charts</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Sun, 05 Apr 2020 13:06:57 +0000</pubDate>
      <link>https://forem.com/benbirt/the-right-way-to-install-helm-charts-4mjp</link>
      <guid>https://forem.com/benbirt/the-right-way-to-install-helm-charts-4mjp</guid>
      <description>&lt;p&gt;&lt;a href="https://helm.sh/"&gt;Helm&lt;/a&gt; has become the de facto package management tool for Kubernetes resources. As an example, take a look at these installation instructions for &lt;a href="https://istio.io/docs/setup/install/helm/"&gt;Istio&lt;/a&gt; (a Kubernetes service mesh and observability tool).&lt;/p&gt;

&lt;p&gt;While many common Helm chart installation instructions encourage you to run a very simple command (&lt;code&gt;helm install &amp;lt;chart&amp;gt;&lt;/code&gt;), and - hey presto - some new software is running in your Kubernetes cluster, I think that this workflow should generally (if not always) be avoided.&lt;/p&gt;

&lt;p&gt;The big disadvantage of this workflow is that you sacrifice repeatability.&lt;/p&gt;

&lt;h1&gt;
  
  
  Repeatability is critical
&lt;/h1&gt;

&lt;p&gt;Consider the scenario when you need to reinstall your Helm charts.&lt;/p&gt;

&lt;p&gt;Say, for example, you need to migrate to a new Kubernetes cluster, for some reason. You &lt;em&gt;can&lt;/em&gt; run &lt;code&gt;helm ls&lt;/code&gt; to determine all currently installed charts, and their versions, and then install all of those on the new cluster, but this is significant manual work, and it only applies if you have a functioning cluster from which to 'copy' your Helm charts.&lt;/p&gt;

&lt;p&gt;If for some reason your cluster is sufficiently broken - or, perhaps, accidentally deleted - you've now lost the accurate record of which Helm charts you had installed, and at what versions.&lt;/p&gt;

&lt;h1&gt;
  
  
  &lt;code&gt;helm install&lt;/code&gt; is anti-GitOps
&lt;/h1&gt;

&lt;p&gt;The GitOps model to managing Kubernetes resources - where a Git repo is treated as the source-of-truth for what should be running in the cluster - is precisely the solution for the repeatability problem. Somebody made a manual change to the cluster that broke something? No problem, we'll just rollback to what's in Git. (Indeed, preferably some automation will detect the change and do it for you!)&lt;/p&gt;

&lt;p&gt;On the other hand, manually running &lt;code&gt;helm install&lt;/code&gt; commands completely breaks this model, because your Git repo no longer completely encapsulates a description of what should be running in your Kubernetes cluster.&lt;/p&gt;

&lt;p&gt;To be fair to Helm, it is possible to work around this problem. As long as you're willing to package up all of your Kubernetes resources into Helm chart(s), with all dependencies listed (managed using semver), you can continue using the GitOps deployment model. But this forces you to use Helm for everything.&lt;/p&gt;

&lt;h1&gt;
  
  
  What you should do instead
&lt;/h1&gt;

&lt;p&gt;If you want to install Helm charts to your Kubernetes cluster, I strongly recommend:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;vendoring the chart into your Git repo (or otherwise fully specifying the precise version of the chart in source control)&lt;/li&gt;
&lt;li&gt;using &lt;code&gt;helm template&lt;/code&gt; on the chart to render it as Kubernetes YAML&lt;/li&gt;
&lt;li&gt;running plain-old &lt;code&gt;kubectl apply&lt;/code&gt; on the result&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This makes your Kubernetes cluster's configuration fully repeatable. It also means that all chart installations, and version upgrades, are fully auditable in source control.&lt;/p&gt;

&lt;p&gt;In fact - if you take another look at the Istio installation instructions - you'll see that this is exactly the recommended workflow for installing Istio using Helm!&lt;/p&gt;

&lt;p&gt;We do all of this using our build system - &lt;a href="https://bazel.build/"&gt;Bazel&lt;/a&gt;. We've written some custom Bazel rules that help us achieve this workflow - &lt;a href="https://github.com/dataform-co/dataform/tree/de1eb66e558fbd349092d9519a8d5a1edefba94f/tools/helm"&gt;feel free to use them&lt;/a&gt;, if you like.&lt;/p&gt;

&lt;p&gt;Bringing a new chart into the build system is simple. In your Bazel &lt;code&gt;WORKSPACE&lt;/code&gt; file, you'll need to include the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Add the 'dataform' repository as a dependency.
&lt;/span&gt;&lt;span class="n"&gt;git_repository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"df"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;commit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"de1eb66e558fbd349092d9519a8d5a1edefba94f"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;remote&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://github.com/dataform-co/dataform.git"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load the Helm repository rules.
&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"@dataform//tools/helm:repository_rules.bzl"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"helm_chart"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"helm_tool"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Download the 'helm' tool.
&lt;/span&gt;&lt;span class="n"&gt;helm_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"helm_tool"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To add a single new chart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Download the 'istio' Helm chart.
&lt;/span&gt;&lt;span class="n"&gt;helm_chart&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"istio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chartname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"istio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;repo_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"https://storage.googleapis.com/istio-release/releases/1.4.0/charts/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"v1.4.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then, when you want to template it, add a &lt;code&gt;BUILD&lt;/code&gt; rule somewhere, looking something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;helm_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"istio"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chart_tar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"@istio//:chart.tgz"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"istio-system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The output of this rule is plain Kubernetes YAML, ready for you to deploy to your cluster however you wish. (We use the standard Bazel Kubernetes &lt;a href="https://github.com/bazelbuild/rules_k8s"&gt;rules&lt;/a&gt;.)&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>helm</category>
      <category>devops</category>
      <category>gitops</category>
    </item>
    <item>
      <title>How we store protobufs in MongoDB</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 09 Jan 2020 12:09:42 +0000</pubDate>
      <link>https://forem.com/dataform/mongodb-protobuf-codec-2168</link>
      <guid>https://forem.com/dataform/mongodb-protobuf-codec-2168</guid>
      <description>&lt;p&gt;At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.&lt;/p&gt;

&lt;p&gt;We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.&lt;/p&gt;

&lt;p&gt;Here's a simple example of a protobuf definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int64&lt;/span&gt; &lt;span class="na"&gt;birth_timestamp_millis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the &lt;code&gt;.proto&lt;/code&gt; file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.&lt;/p&gt;

&lt;p&gt;However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.&lt;/p&gt;

&lt;p&gt;One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of &lt;code&gt;Person&lt;/code&gt; (as defined above) in JSON format, but then renamed &lt;code&gt;birth_timestamp_millis&lt;/code&gt; to &lt;code&gt;birthday_timestamp&lt;/code&gt; - the previously stored &lt;code&gt;Person&lt;/code&gt; would now have an undecodeable &lt;code&gt;birthTimestampMillis&lt;/code&gt; field, and would be missing a value for &lt;code&gt;birthdayTimestamp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.&lt;/p&gt;

&lt;p&gt;Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.&lt;/p&gt;

&lt;p&gt;We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.&lt;/p&gt;

&lt;p&gt;For example, given the following protobuf definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;Example&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;string_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;ExampleEnum&lt;/span&gt; &lt;span class="na"&gt;enum_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;oneof&lt;/span&gt; &lt;span class="n"&gt;example_oneof&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;int32_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int64&lt;/span&gt; &lt;span class="na"&gt;int64_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;NestedMessage&lt;/span&gt; &lt;span class="na"&gt;nested_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;107&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;ExampleEnum&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;VAL_0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="na"&gt;VAL_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;573&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;NestedMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;nested_string_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;nested_int32_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the following instance of &lt;code&gt;Example&lt;/code&gt;, in canonical JSON format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stringField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"enumField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VAL_0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Note&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;represented&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;because&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JavaScript&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;smaller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;than&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;int&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"int64Field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nestedMessage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nestedStringField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nestedInt32Field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Our MongoDB codec will encode the instance of &lt;code&gt;Example&lt;/code&gt; as the following Mongo BSON document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"10"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"33"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;123456789&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"107"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With this encoding, if we change the name of &lt;code&gt;nested_string_field&lt;/code&gt; to &lt;code&gt;something_else&lt;/code&gt;, or the enum value &lt;code&gt;VAL_0&lt;/code&gt; to &lt;code&gt;BETTER_ENUM_VALUE_NAME&lt;/code&gt;, we'll still be able to decode the document, without any loss of data.&lt;/p&gt;

&lt;p&gt;This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.&lt;/p&gt;

&lt;p&gt;The code is open-sourced &lt;a href="https://github.com/dataform-co/dataform/tree/master/protomongo"&gt;here&lt;/a&gt; (&lt;a href="https://godoc.org/github.com/dataform-co/dataform/protomongo"&gt;godoc&lt;/a&gt;). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>protobuf</category>
      <category>database</category>
    </item>
    <item>
      <title>Thinking about technical debt</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Sat, 20 Jul 2019 13:09:08 +0000</pubDate>
      <link>https://forem.com/benbirt/thinking-about-technical-debt-b72</link>
      <guid>https://forem.com/benbirt/thinking-about-technical-debt-b72</guid>
      <description>&lt;p&gt;Technical debt is a common concern among software engineers. But we don't often think about it in the right terms.&lt;/p&gt;

&lt;h1&gt;
  
  
  Monetary debt
&lt;/h1&gt;

&lt;p&gt;After you take out a loan, you (usually) have to make interest payments.&lt;/p&gt;

&lt;p&gt;If you cannot make interest payments in full, in the short term, your debt grows. If you continue to be unable to repay interest, you may have to declare bankruptcy. This means that you may lose your assets to pay down the debt.&lt;/p&gt;

&lt;p&gt;If circumstances are good, you will be able to both make interest payments and pay down the debt. After enough of these payments, you will be debt-free.&lt;/p&gt;

&lt;h1&gt;
  
  
  Technical debt as monetary debt
&lt;/h1&gt;

&lt;p&gt;Introducing technical debt to a codebase is equivalent to taking out a loan.&lt;/p&gt;

&lt;p&gt;You must now make continuous interest payments on the loan. Interest can come in various forms. One is engineering time spent on fixing bugs or regressions. Another is an increase in operational load from running the system in production.&lt;/p&gt;

&lt;p&gt;Technical debt will grow if you do not have enough income (engineering time) to meet its costs. Lacking the time to make proper bug fixes, engineers will introduce yet more debt to the codebase.&lt;/p&gt;

&lt;p&gt;If this situation remains unchecked, you may have to declare technical bankruptcy. This is a terrible situation: you must now drop all other work to fix the technical debt. In the worst case, you may have to drop all support for the system while you rewrite it completely from scratch.&lt;/p&gt;

&lt;h1&gt;
  
  
  We should not be afraid of technical debt
&lt;/h1&gt;

&lt;p&gt;There are often good reasons for taking out a loan. For example, it might enable us to make an investment that has a large payoff in the future.&lt;/p&gt;

&lt;p&gt;As with loans, so too with technical debt. We must make tradeoffs between technical debt and investment in new features. Those features may pay off in the form of more profits, enabling us to hire more engineers. Thus paying down technical debt becomes much easier.&lt;/p&gt;

&lt;h1&gt;
  
  
  Manage technical debt as you would a loan
&lt;/h1&gt;

&lt;p&gt;Just as with real debt, technical debt requires careful management.&lt;/p&gt;

&lt;p&gt;We should not be afraid of technical debt for its own sake. What should concern us is poor planning.&lt;/p&gt;

&lt;p&gt;If you were going to use a loan to make an investment, it would be prudent to make a plan. This plan would model both the loan's value and the investment's value over time.&lt;/p&gt;

&lt;p&gt;Technical debt is harder to model. But we should carefully consider both its cost and the opportunity cost of not taking it on. We should also keep track of our current debt, what it is costing us, and what headroom is available to take on more.&lt;/p&gt;

</description>
      <category>codequality</category>
      <category>techdebt</category>
      <category>management</category>
    </item>
    <item>
      <title>How to write unit tests for your SQL queries</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Mon, 15 Jul 2019 14:31:00 +0000</pubDate>
      <link>https://forem.com/dataform/how-to-write-unit-tests-for-your-sql-queries-2hd7</link>
      <guid>https://forem.com/dataform/how-to-write-unit-tests-for-your-sql-queries-2hd7</guid>
      <description>&lt;p&gt;I’ve previously &lt;a href="https://dev.to/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg"&gt;written&lt;/a&gt; about how I think we should prefer writing processing pipelines in pure SQL. However, a big difference between SQL and more widely-used languages is that those other languages generally have a strong tradition of unit testing.&lt;/p&gt;

&lt;p&gt;Usually, when we talk about ‘tests’ in the context of SQL, we don’t actually mean unit tests. Instead, the term generally refers to data tests, which are really &lt;a href="https://dataform.co/blog/data-assertions/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;assertions&lt;/a&gt; that the data itself conforms to some test criteria.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unit tests are not assertions.&lt;/strong&gt; Unit tests verify the logic of a SQL query by running that query on some fixed set of inputs. Assertions necessarily depend upon the real datasets which they validate, while unit tests should never depend on any real data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The benefits of unit tests
&lt;/h3&gt;

&lt;p&gt;Unit testing is a standard practice in software engineering. Unit tests help ensure that difficult pieces of logic or complex interactions between components work as expected - and continue to work as expected as the surrounding code changes.&lt;/p&gt;

&lt;p&gt;Unit tests should not have any external dependencies; tests run the code in question on some faked inputs, ensuring that changes outside of that unit of code do not affect the test. This means that the success or failure of the test comes down purely to the code’s logic. Thus, if the test fails, you know exactly where to start debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why isn’t SQL unit testing widespread?
&lt;/h3&gt;

&lt;p&gt;In standard languages, a unit test typically consists of injecting fake input into the code under test and checking that the output matches some expected result. However, SQL scripts don’t label their input datasets - typically, they’re just defined statically inline in a FROM clause. This makes it difficult to inject fake input test data into your SQL code.&lt;/p&gt;

&lt;p&gt;The result of this is that most SQL code goes untested.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;Various SQL frameworks let you define layers of indirection between your SQL and its input(s); i.e. you declare and label the input datasets upon which a query depends. Unit testing frameworks can use this indirection to replace real input data with faked versions.&lt;/p&gt;

&lt;p&gt;We can then run the code under test, using some faked input, and compare the output result rows against a set of expected outputs. If the actual output of the code under test matches the expected output, the test passes; if not, it fails.&lt;/p&gt;

&lt;p&gt;This technique is simple and gives you real power to verify that a SQL script does what you think it does. You can pass faked inputs to your SQL that your real data may not currently contain, giving you confidence that it can robustly handle a wide range of data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test case support in Dataform
&lt;/h3&gt;

&lt;p&gt;When using Dataform’s &lt;a href="https://docs.dataform.co/guides/datasets/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;enriched SQL&lt;/a&gt;, you reference input datasets using either the &lt;code&gt;ref()&lt;/code&gt; or &lt;code&gt;resolve()&lt;/code&gt; function. This functionality gives us an easy way to inject fake input datasets into a script, thus enabling users to write unit tests.&lt;/p&gt;

&lt;p&gt;We have defined a new type of Dataform script: &lt;code&gt;test&lt;/code&gt;. In a &lt;code&gt;test&lt;/code&gt; query, you specify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The query you’re testing&lt;/li&gt;
&lt;li&gt;The faked inputs, each labeled with their referenced name&lt;/li&gt;
&lt;li&gt;The expected output of running the query on the faked inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Behind the scenes, when you run the test, we dynamically replace the inputs to the &lt;code&gt;dataset&lt;/code&gt; query with your faked input data. We then run the &lt;code&gt;dataset&lt;/code&gt; query, along with the query that defines your expected output, and check that the resulting rows match. Simple!&lt;/p&gt;

&lt;h3&gt;
  
  
  An example
&lt;/h3&gt;

&lt;p&gt;Here’s a worked example written using Dataform’s JavaScript API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// First, define a dataset - we’ll follow this up with the unit test.&lt;/span&gt;
&lt;span class="nx"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT
      FLOOR(age / 5) * 5 AS age_group,
      COUNT(1) AS user_count
      FROM &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;
      GROUP BY age_group
    `&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Now, define the unit test.&lt;/span&gt;
&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;test_age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Specify the name of the dataset under test.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Provide the fake input “ages” dataset.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT 15 AS age UNION ALL
      SELECT 21 AS age UNION ALL
      SELECT 24 AS age UNION ALL
      SELECT 34 AS age
    `&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Provide the expected output of running “age_groups” on the “ages” dataset.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT 15 AS age_group, 1 AS user_count UNION ALL
      SELECT 20 AS age_group, 2 AS user_count UNION ALL
      SELECT 30 AS age_group, 1 AS user_count
    `&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Alternatively, if you prefer to use Dataform’s enriched SQL, the unit test would look as follows (note that publishing the dataset is elided for simplicity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For more details, see our &lt;a href="https://docs.dataform.co/guides/tests/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We’ve released this functionality as part of the v1.0.0 release of our &lt;a href="https://www.npmjs.com/package/@dataform/cli"&gt;@dataform NPM packages&lt;/a&gt;. Dataform Web will soon support test cases, too. Let us know what you think!&lt;/p&gt;

</description>
      <category>sql</category>
      <category>testing</category>
    </item>
    <item>
      <title>Consider SQL when writing your next processing pipeline</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 27 Jun 2019 11:03:34 +0000</pubDate>
      <link>https://forem.com/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg</link>
      <guid>https://forem.com/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg</guid>
      <description>&lt;p&gt;Once a team or organization has some data to manage - customer data, events to be fed into some machine learning system, or whatever else - they almost immediately find themselves writing, running, and maintaining processing pipelines.&lt;/p&gt;

&lt;p&gt;Outputs of these pipelines are many and varied, including customer / market analysis, data cleaning, etc, but such pipelines seem to pop up more often and more quickly than one expects.&lt;/p&gt;

&lt;p&gt;Today, most non-trivial data processing is done using some pipelining technology, for example Dataflow / Apache Beam, with user code typically written in languages such as Java, Python, or perhaps Go.&lt;/p&gt;

&lt;h4&gt;
  
  
  My experience
&lt;/h4&gt;

&lt;p&gt;I worked as a software engineer at Google for several years, during which I led multiple teams and projects which required writing, managing, and maintaining various types of processing pipelines. During that time I became convinced that - for the majority of use-cases - &lt;strong&gt;expressing these pipelines in SQL is simpler, cheaper, and easier than the alternatives, with few disadvantages&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For what it’s worth, I’ll note that I’m actually a big fan of these pipelining technologies. While at Google, I was a cheerleader for the internal version of Cloud Dataflow (Flume) for both batch and streaming use-cases. However, I think that the reasons for using them - broadly - no longer apply to today’s world of highly scalable cloud warehouses and query engines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why isn’t SQL the de facto processing pipeline language today?
&lt;/h3&gt;

&lt;p&gt;SQL wasn’t really a scalable option for processing data before we had widely available cloud data warehouses such as BigQuery and Redshift. Without these highly-scalable query engines, the only reasonable choice was to perform any significant data processing outside of the data warehouse.&lt;/p&gt;

&lt;h4&gt;
  
  
  Scalable processing
&lt;/h4&gt;

&lt;p&gt;The first truly scalable data processing solution was probably something like Google MapReduce. It then quickly became obvious that chaining MapReduce-like processing steps into a full pipeline using some higher-level API can produce very powerful pipelining systems, and frameworks such as Hadoop, Apache Spark, and Google Cloud Dataflow were born. These systems enabled users to process terabytes of data (or more, with some tuning) quickly and scalably, which was often simply impossible using SQL query engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However, cloud data warehouse systems have evolved dramatically over the past 5 years.&lt;/strong&gt; SQL queries running on BigQuery’s query engine will generally run much more quickly than the alternative, which requires reading all of the relevant data out of the warehouse, processing it, then writing the result back to some other table. It’s also much easier to run in production; there’s no need to manage temporary state, queries are optimized automatically, etc. All of these concerns are pushed to the query engine, and the user doesn’t have to care about them.&lt;/p&gt;

&lt;p&gt;The query engine is the best place to optimize the pipeline since it has access to the most metadata about what data is being processed; as a result it’s much easier to manage the pipeline operationally in production. This is much better than the alternative - I can’t tell you how many hours (or days, or even weeks) my teams and I have spent debugging scalability issues and poor optimization choices in Java pipelines.&lt;/p&gt;

&lt;h4&gt;
  
  
  Existing bias towards imperative languages
&lt;/h4&gt;

&lt;p&gt;I think there is an understandable cultural bias in software engineering teams towards using standard imperative programming languages to implement processing pipelines, and until very recently it wasn’t really possible to mix and match SQL and non-SQL (see below for more on this).&lt;/p&gt;

&lt;p&gt;Engineers are much more familiar with configuring jobs written in these languages in production, but happily, modern SaaS options obviate this problem for SQL pipelines by taking responsibility for scheduling and running the user’s code, so that &lt;strong&gt;the user needs to do very little productionization&lt;/strong&gt; at all.&lt;/p&gt;

&lt;p&gt;Additionally, SQL scripts have sometimes been treated as a second class citizen versus other languages. Some tools used for SQL script development haven’t supported standard software engineering techniques such as version control or code review. However, this too has changed, with modern toolchain options supporting these practices as first-class features.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL has distinct advantages over the alternatives
&lt;/h3&gt;

&lt;p&gt;SQL is a language built and designed to support exactly what you want to do when you’re processing data: joining, filtering, aggregating, and transforming data. Thus, it’s usually &lt;strong&gt;much simpler and easier to express your pipeline in SQL&lt;/strong&gt; than it is in some other pipelining technology. (If you’d like to see an example of just how powerful SQL can be, take a look at &lt;a href="https://towardsdatascience.com/deep-neural-network-implemented-in-pure-sql-over-bigquery-f3ed245814d3"&gt;this article&lt;/a&gt; in which a deep neural network is implemented with it!)&lt;/p&gt;

&lt;h4&gt;
  
  
  A common language
&lt;/h4&gt;

&lt;p&gt;The biggest advantage of implementing your pipeline in SQL is that it’s likely to be the same language that you or your data team use to actually perform final analysis on the output of the pipeline.&lt;/p&gt;

&lt;p&gt;This means that the data team don’t need support from engineers to make changes to the pipeline. Instead, they’re empowered to make the changes themselves.&lt;/p&gt;

&lt;h4&gt;
  
  
  Debugging
&lt;/h4&gt;

&lt;p&gt;When something goes wrong, SQL pipelines are usually much easier to introspect than the alternative. If you want to check exactly what data is being output by any given processing stage of a SQL pipeline, you can simply pull out those results into some relevant SELECT query.&lt;/p&gt;

&lt;p&gt;Doing the same using a pipelining system can be a real pain, involving making significant code changes (just to add enough instrumentation to enable debugging) and re-deploying the pipeline.&lt;/p&gt;

&lt;h4&gt;
  
  
  Faster development
&lt;/h4&gt;

&lt;p&gt;During development of a SQL-based pipeline, the iteration cycle is significantly faster. This is because the feedback loop is much quicker - make an edit to your query(s), re-run the pipeline, and immediately get new results.&lt;/p&gt;

&lt;p&gt;If the pipeline processes so much data that it takes more than a minute or two to execute, it’s trivial to process a fraction of the data (to get results more quickly) by adding a LIMIT to your query (or subqueries), or by only selecting rows belonging to a subset of the input dataset.&lt;/p&gt;

&lt;p&gt;When writing Java pipelines from scratch, I would often find that testing out a single bugfix would take hours - not so with SQL. I actually often found myself writing a SQL script to validate the output of some productionized Java pipeline, only to belatedly realize that I had essentially re-implemented the Java pipeline in SQL - in much fewer lines of code, with much more readability, and significantly less complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL’s disadvantages
&lt;/h3&gt;

&lt;p&gt;In my experience there are two distinct domains where other languages have an edge on pure SQL: (1) unit testing and (2) the readability of particularly complex data transformations.&lt;/p&gt;

&lt;p&gt;Some SQL queries can be fairly complex, especially if they use powerful features such as BigQuery’s &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts"&gt;analytic functions&lt;/a&gt;. I’d like to be able to write unit tests for these SQL queries, statically defining sets of input rows and expected output rows, asserting that the query does exactly what it’s supposed to. We’re working on implementing this feature within Dataform, and expect to have basic unit test support out soon. However, a useful tool which can help out here is &lt;a href="https://dataform.co/blog/data-assertions/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=consider_sql"&gt;data assertions&lt;/a&gt;, using which you can express requirements of your input data, for example to check for correctness, before continuing to run your processing pipeline.&lt;/p&gt;

&lt;p&gt;Occasionally, you will want to run some particularly complicated data transformation logic. (For one interesting - if slightly insane - example, check out &lt;a href="https://medium.com/@urish/yes-i-compiled-1-000-000-typescript-files-in-under-40-seconds-this-is-how-6429a665999c"&gt;this&lt;/a&gt; Medium post.) Sometimes, when expressed in SQL, this can become difficult to read and/or maintain due to its complexity. However, there exists a nice solution to this problem: User-Defined Functions (UDFs). UDFs allow you to break out of SQL and use JavaScript or Python (depending on the warehouse) when you need the power of a full imperative programming language to implement your own function.&lt;/p&gt;

&lt;h3&gt;
  
  
  The future
&lt;/h3&gt;

&lt;p&gt;We’re seeing a general move towards expressing pipelines in plain SQL. Indeed, Apache Beam recently launched &lt;a href="https://beam.apache.org/documentation/dsls/sql/overview/"&gt;support&lt;/a&gt; for Beam SQL, allowing Java users to express transformations using inline SQL. I expect that as time goes on, we’ll see fewer and fewer processing pipelines expressed using Java/Python/Go, and much more work being done inside data warehouses using simple SQL, for all of the reasons discussed above.&lt;/p&gt;

</description>
      <category>sql</category>
    </item>
  </channel>
</rss>
