<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Dataform</title>
    <description>The latest articles on Forem by Dataform (@dataform).</description>
    <link>https://forem.com/dataform</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F882%2F988bc05f-0649-4272-a8af-c78c03d61c7e.png</url>
      <title>Forem: Dataform</title>
      <link>https://forem.com/dataform</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/dataform"/>
    <language>en</language>
    <item>
      <title>Cut data warehouse costs with run caching</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 24 Sep 2020 12:20:29 +0000</pubDate>
      <link>https://forem.com/dataform/cut-data-warehouse-costs-with-run-caching-2hgb</link>
      <guid>https://forem.com/dataform/cut-data-warehouse-costs-with-run-caching-2hgb</guid>
      <description>&lt;p&gt;As we've mentioned before, one of the core design goals of Dataform is to make project compilation &lt;strong&gt;hermetic&lt;/strong&gt;. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).&lt;/p&gt;

&lt;p&gt;Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our &lt;a href="https://docs.dataform.co/guides/configuration#enable-run-caching-to-cut-warehouse-costs"&gt;"run caching" feature&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't waste time and money re-computing the same data
&lt;/h2&gt;

&lt;p&gt;Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.&lt;/p&gt;

&lt;p&gt;Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.&lt;/p&gt;

&lt;p&gt;Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.&lt;/p&gt;

&lt;p&gt;We've built this feature into Dataform.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run caching in Dataform
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://app.dataform.co/#/5682186490478592/file/definitions%2Fincidents_by_date.sqlx?utm_source=dev_to&amp;amp;utm_campaign=run_caching"&gt;&lt;strong&gt;Try out an example project with run caching here!&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can turn run caching on in your project with a few small changes which are described &lt;a href="https://docs.dataform.co/guides/configuration#enable-run-caching-to-cut-warehouse-costs"&gt;here&lt;/a&gt;. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.&lt;/p&gt;

&lt;p&gt;For example, consider the following SQLX file, which configures Dataform to publish a table &lt;code&gt;age_count&lt;/code&gt; containing the transformed results of a query reading a &lt;code&gt;people&lt;/code&gt; relation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Dataform only needs to (re-)publish this table if any of the following conditions are true:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The output table &lt;code&gt;age_count&lt;/code&gt; doesn't exist&lt;/li&gt;
&lt;li&gt;The output table &lt;code&gt;age_count&lt;/code&gt; has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself)&lt;/li&gt;
&lt;li&gt;The query has changed since the last time the &lt;code&gt;age_count&lt;/code&gt; table was published&lt;/li&gt;
&lt;li&gt;The input table &lt;code&gt;people&lt;/code&gt; has changed since the last time the &lt;code&gt;age_count&lt;/code&gt; table was published (or, if &lt;code&gt;people&lt;/code&gt; is a view, then if any of the input(s) to &lt;code&gt;people&lt;/code&gt; have changed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building in intelligence so you don't have to
&lt;/h2&gt;

&lt;p&gt;At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.&lt;/p&gt;

&lt;p&gt;This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.&lt;/p&gt;

&lt;p&gt;If you'd like to learn more, the Dataform framework documentation is &lt;a href="https://docs.dataform.co/?utm_source=dev_to&amp;amp;utm_campaign=run_caching"&gt;here&lt;/a&gt;. Join us on &lt;a href="http://dataform-users.slack.com/"&gt;Slack&lt;/a&gt; and let us know what you think!&lt;/p&gt;

</description>
      <category>elt</category>
      <category>dataengineering</category>
      <category>pipeline</category>
      <category>etl</category>
    </item>
    <item>
      <title>CI/CD for ETL/ELT pipelines</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Mon, 08 Jun 2020 12:14:52 +0000</pubDate>
      <link>https://forem.com/dataform/ci-cd-for-etl-elt-pipelines-2mgm</link>
      <guid>https://forem.com/dataform/ci-cd-for-etl-elt-pipelines-2mgm</guid>
      <description>&lt;p&gt;One of Dataform’s key motivations has been to bring software engineering best practices to teams building ETL/ELT pipelines. To further that goal, we recently launched support for you to run Continuous Integration (CI) checks against your Dataform projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is CI/CD?
&lt;/h2&gt;

&lt;p&gt;CI/CD is a set of processes which aim to help teams ship software quickly and reliably.&lt;/p&gt;

&lt;p&gt;Continuous integration (CI) checks automatically verify that all changes to your code work as expected, and typically run before the change is merged into your Git master branch. This ensures that the version of the code on the master branch always works correctly.&lt;/p&gt;

&lt;p&gt;Continuous deployment (CD) tools automatically (and frequently) deploy the latest version of your code to production. This is intended to minimize the time it takes for new features or bugfixes to be available in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  CI/CD for Dataform projects
&lt;/h2&gt;

&lt;p&gt;Dataform already does most of the CD gruntwork for you. By default, all code committed to the master branch is automatically deployed. For more advanced use cases, you can configure exactly what you want to be deployed and when using &lt;a href="https://docs.dataform.co/dataform-web/scheduling/environments?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;environments&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;CI checks, however, are usually configured as part of your Git repository (usually hosted on GitHub, though Dataform supports other Git hosting providers).&lt;/p&gt;

&lt;h2&gt;
  
  
  How to configure CI checks
&lt;/h2&gt;

&lt;p&gt;Dataform distributes a &lt;a href="https://hub.docker.com/r/dataformco/dataform"&gt;Docker image&lt;/a&gt; which can be used to run the equivalent of &lt;a href="https://docs.dataform.co/dataform-cli?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;Dataform CLI&lt;/a&gt; commands. For most CI tools, this Docker image is what you'll use to run your automated checks.&lt;/p&gt;

&lt;p&gt;If you host your Dataform Git repository on GitHub, you can use &lt;a href="https://help.github.com/en/actions/configuring-and-managing-workflows/configuring-a-workflow"&gt;GitHub Actions&lt;/a&gt; to run CI workflows. This post assumes you’re using GitHub Actions, but other CI tools are configured in a similar way.&lt;/p&gt;

&lt;p&gt;Here’s a simple example of a GitHub Actions workflow for a Dataform project. Once you put this in a &lt;code&gt;.github/workflows/&amp;lt;some filename&amp;gt;.yaml&lt;/code&gt; file, GitHub will run the workflow on each pull request and commit to your master branch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;master&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;compile&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Checkout code into workspace directory&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v2&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install project dependencies&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker://dataformco/dataform:1.6.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;install&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run dataform compile&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker://dataformco/dataform:1.6.11&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;compile&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This workflow runs &lt;code&gt;dataform compile&lt;/code&gt; - this means that if the project fails to compile, the workflow will fail, and this will be reflected in the GitHub UI.&lt;/p&gt;

&lt;p&gt;Note that it’s possible to run any &lt;code&gt;dataform&lt;/code&gt; CLI command in a CI workflow. However, some commands do need credentials in order to run queries against your data warehouse. In these circumstances, you should encrypt those credentials and commit the encrypted file to your Git repository. Then, in your CI workflow, you decrypt the credentials so that the Dataform CLI can use them.&lt;/p&gt;

&lt;p&gt;For further details on configuring CI/CD for your Dataform projects, please see our &lt;a href="https://docs.dataform.co/guides/ci-cd?utm_source=dev_to&amp;amp;utm_campaign=ci_cd"&gt;docs&lt;/a&gt;. As always, if you have any questions, or would like to get in touch with us, please send us a message on &lt;a href="https://join.slack.com/t/dataform-users/shared_invite/zt-dark6b7k-r5~12LjYL1a17Vgma2ru2A"&gt;Slack&lt;/a&gt;!&lt;/p&gt;

</description>
      <category>cicd</category>
      <category>etl</category>
      <category>elt</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Building an end to end Machine Learning Pipeline in Bigquery</title>
      <dc:creator>Ahmad Faiyaz</dc:creator>
      <pubDate>Fri, 13 Mar 2020 10:41:16 +0000</pubDate>
      <link>https://forem.com/dataform/building-an-end-to-end-machine-learning-pipeline-in-bigquery-3n1g</link>
      <guid>https://forem.com/dataform/building-an-end-to-end-machine-learning-pipeline-in-bigquery-3n1g</guid>
      <description>&lt;p&gt;Google BigQuery is one of the more advanced data warehouses in the market, and has out of the box support for building and training ML models using SQL like statements without requiring any code. This is extremely powerful, however managing end to end ML pipelines in this way can be fragile and requires manual steps to updating training and prediction.&lt;/p&gt;

&lt;p&gt;In this article we walk through building a simple end to end BigQuery ML pipeline using the open-source framework &lt;a href="https://dataform.co/?utm_medium=organic&amp;amp;utm_source=dataform_blog&amp;amp;utm_campaign=bigquery-ml-pipeline"&gt;Dataform&lt;/a&gt; to help us manage the end to end process of data preparation, training and prediction.&lt;/p&gt;

&lt;p&gt;Google BigQuery provides some Machine Learning algorithms such as Linear regression, Binary logistic regression etc. To find out more about the models that BigQuery supports, check out the &lt;a href="https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro#supported_models_in"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A typical workflow for building a machine learning model looks like:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Data exploration&lt;/li&gt;
&lt;li&gt;Data pre-processing (data transformation)&lt;/li&gt;
&lt;li&gt;Model training&lt;/li&gt;
&lt;li&gt;Model evaluation on test dataset&lt;/li&gt;
&lt;li&gt;Prediction/Inference on real dataset&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In this article I am going to follow the &lt;a href="https://cloud.google.com/bigquery-ml/docs/logistic-regression-prediction"&gt;tutorial&lt;/a&gt; from Google Cloud documentation to create a machine learning model with Google BigQuery, please read the official documentation for understanding the technical details.&lt;/p&gt;

&lt;p&gt;For managing our end to end pipeline, we are going to use Dataform to help us version control our queries and manage pipeline execution order. Dataform makes it easy for us to version control our BigQuery code and execute complex pipelines with just a few commands.&lt;/p&gt;

&lt;p&gt;All of the code for this example is available in the demo repository &lt;a href="https://github.com/dataform-co/bigquery-ml-pipeline"&gt;here&lt;/a&gt; for you to follow along with. To run this example in your own BigQuery project you perform the following steps.&lt;/p&gt;

&lt;p&gt;Clone the repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/dataform-co/bigquery-ml-pipeline.git
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Install the Dataform CLI using npm or yarn:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;npm i &lt;span class="nt"&gt;-g&lt;/span&gt; @dataform/cli
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Set up the Dataform project:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;bigquery-ml-pipeline &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; dataform &lt;span class="nb"&gt;install&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For instructions on authenticating BigQuery so that you can run the queries, you can follow the Dataform documentation &lt;a href="https://docs.dataform.co/dataform-cli?utm_medium=organic&amp;amp;utm_source=dataform_blog&amp;amp;utm_campaign=bigquery-ml-pipeline"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data exploration
&lt;/h2&gt;

&lt;p&gt;Data exploration is usually done in jupyter notebooks or some dashboard solution: for example Looker or Google Data Studio. In this step one needs to find out which datasets are required, and which columns should be used as features for model training. Following the Google Cloud tutorial, I am going to use the public dataset named Census Adult Income (&lt;code&gt;bigquery-public-data.ml_datasets.census_adult_income&lt;/code&gt;), which contains these columns:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tr6otQdD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://assets.dataform.co/blog/bigquery-ml-pipeline/dataset-schema.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tr6otQdD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://assets.dataform.co/blog/bigquery-ml-pipeline/dataset-schema.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Data pre processing
&lt;/h2&gt;

&lt;p&gt;I need to split the dataset into three sections: training, evaluation and prediction. To do this using Dataform, I will create a new sqlx file with the code block below. This query extracts data on census respondents, including &lt;code&gt;education_num&lt;/code&gt;, which represents the respondent's level of education, and workclass, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columns &lt;code&gt;education&lt;/code&gt; and &lt;code&gt;education_num&lt;/code&gt; in the census_adult_income table express the same data in different formats, so this query excludes the education column. The dataframe column uses the excluded &lt;code&gt;functional_weight&lt;/code&gt; column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a table containing these columns, so that I can use them to perform training and prediction later.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;census_input.sqlx&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;"table"&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="n"&gt;age&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;workclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;native_country&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;marital_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;education_num&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;occupation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;race&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;hours_per_week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;income_bracket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="k"&gt;CASE&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;MOD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functional_weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'training'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;MOD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functional_weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'evaluation'&lt;/span&gt;
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="k"&gt;MOD&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;functional_weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'prediction'&lt;/span&gt;
  &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;dataframe&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="nv"&gt;`bigquery-public-data.ml_datasets.census_adult_income`&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Model training
&lt;/h2&gt;

&lt;p&gt;Now it's time to create a model using the training dataset. I am creating a new file for this model creation. You can find more about the model options from &lt;a href="https://cloud.google.com/bigquery-ml/docs/logistic-regression-prediction#query_details_2"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;model_train.sqlx&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'operations'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;hasOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"training"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;REPLACE&lt;/span&gt; &lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt; &lt;span class="k"&gt;OPTIONS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;model_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'LOGISTIC_REG'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;auto_class_weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;TRUE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;input_label_cols&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'income_bracket'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'census_input'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;
  &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'training'&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;You’ll notice a few non-SQL constructs above, these are Dataform features. Our configuration block at the top tells Dataform that this is an operation (a custom SQL statement) that generates an output relation. We can use the &lt;code&gt;self()&lt;/code&gt; function in dataform to get the fully qualified name of this relation that we should create, and we can use the &lt;code&gt;ref()&lt;/code&gt; function to select from the preprocessed dataset that we created in the previous step.&lt;/p&gt;

&lt;p&gt;I have also added a tag, which will be useful to run only this step with or without dependencies using the command: &lt;code&gt;dataform run --tags training --include-deps&lt;/code&gt; or on Dataform web. Check the Dataform documentation on &lt;a href="https://docs.dataform.co/guides/tags?utm_medium=organic&amp;amp;utm_source=dataform_blog&amp;amp;utm_campaign=bigquery-ml-pipeline"&gt;tags&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model evaluation
&lt;/h2&gt;

&lt;p&gt;Continuing with the standard ML workflow, I will evaluate my model with the evaluation dataset.&lt;br&gt;
As I am creating a ML model pipeline, it is important not to run the prediction step if my model’s accuracy (for this model, I will consider accuracy less than .8 is bad) is not good enough. To solve this in the pipeline, I will use Dataform’s &lt;a href="https://docs.dataform.co/guides/assertions?utm_medium=organic&amp;amp;utm_source=dataform_blog&amp;amp;utm_campaign=bigquery-ml-pipeline"&gt;assertion&lt;/a&gt; feature.&lt;/p&gt;

&lt;p&gt;Bigquery’s &lt;code&gt;ML.EVALUATE&lt;/code&gt; function returns a row with column &lt;code&gt;accuracy&lt;/code&gt; on which I can run the assertion.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;model_evaluate.sqlx&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'assertion'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"evaluate"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="n"&gt;ML&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EVALUATE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'model_train'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt;
        &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'census_input'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt;
        &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'evaluation'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;accuracy&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To train the model and run the accuracy test, use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;dataform run &lt;span class="nt"&gt;--actions&lt;/span&gt; model_evaluate &lt;span class="nt"&gt;--include-deps&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Run prediction
&lt;/h2&gt;

&lt;p&gt;As I have created the model and evaluated, now I want to run prediction on a real datasets. Let's do it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;predict.sqlx&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="err"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'table'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'model_evaluate'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;"predict"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt;
  &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;
  &lt;span class="n"&gt;ML&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PREDICT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;MODEL&lt;/span&gt; &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'model_train'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;
      &lt;span class="k"&gt;FROM&lt;/span&gt;
        &lt;span class="err"&gt;${&lt;/span&gt;&lt;span class="k"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'census_input'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;WHERE&lt;/span&gt;
        &lt;span class="n"&gt;dataframe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'prediction'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This prediction step depends on the evaluation step, so I have added it as a dependency.&lt;/p&gt;

&lt;p&gt;So let's take a look at the dependency graph:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--G5gNZr1j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://assets.dataform.co/blog/bigquery-ml-pipeline/dependency-graph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--G5gNZr1j--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://assets.dataform.co/blog/bigquery-ml-pipeline/dependency-graph.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now I can run the whole ML pipeline as a schedule within Dataform, written entirely in SQL and executed on BigQuery’s CPUs.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>sql</category>
    </item>
    <item>
      <title>How we store protobufs in MongoDB</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 09 Jan 2020 12:09:42 +0000</pubDate>
      <link>https://forem.com/dataform/mongodb-protobuf-codec-2168</link>
      <guid>https://forem.com/dataform/mongodb-protobuf-codec-2168</guid>
      <description>&lt;p&gt;At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.&lt;/p&gt;

&lt;p&gt;We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.&lt;/p&gt;

&lt;p&gt;Here's a simple example of a protobuf definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;Person&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;first_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;last_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int64&lt;/span&gt; &lt;span class="na"&gt;birth_timestamp_millis&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the &lt;code&gt;.proto&lt;/code&gt; file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.&lt;/p&gt;

&lt;p&gt;However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.&lt;/p&gt;

&lt;p&gt;One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of &lt;code&gt;Person&lt;/code&gt; (as defined above) in JSON format, but then renamed &lt;code&gt;birth_timestamp_millis&lt;/code&gt; to &lt;code&gt;birthday_timestamp&lt;/code&gt; - the previously stored &lt;code&gt;Person&lt;/code&gt; would now have an undecodeable &lt;code&gt;birthTimestampMillis&lt;/code&gt; field, and would be missing a value for &lt;code&gt;birthdayTimestamp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.&lt;/p&gt;

&lt;p&gt;Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.&lt;/p&gt;

&lt;p&gt;We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.&lt;/p&gt;

&lt;p&gt;For example, given the following protobuf definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;Example&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;string_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="n"&gt;ExampleEnum&lt;/span&gt; &lt;span class="na"&gt;enum_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;oneof&lt;/span&gt; &lt;span class="n"&gt;example_oneof&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;int32_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;78&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kt"&gt;int64&lt;/span&gt; &lt;span class="na"&gt;int64_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;33&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="n"&gt;NestedMessage&lt;/span&gt; &lt;span class="na"&gt;nested_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;107&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;enum&lt;/span&gt; &lt;span class="n"&gt;ExampleEnum&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;VAL_0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="na"&gt;VAL_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;573&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;NestedMessage&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;nested_string_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;nested_int32_field&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the following instance of &lt;code&gt;Example&lt;/code&gt;, in canonical JSON format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stringField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"enumField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VAL_0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Note&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;that&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;this&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;represented&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;because&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;the&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;JavaScript&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;type&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;smaller&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;than&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;int&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="err"&gt;.&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"int64Field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nestedMessage"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nestedStringField"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"nestedInt32Field"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Our MongoDB codec will encode the instance of &lt;code&gt;Example&lt;/code&gt; as the following Mongo BSON document:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"foo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"10"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"33"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;123456789&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"107"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bar"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;With this encoding, if we change the name of &lt;code&gt;nested_string_field&lt;/code&gt; to &lt;code&gt;something_else&lt;/code&gt;, or the enum value &lt;code&gt;VAL_0&lt;/code&gt; to &lt;code&gt;BETTER_ENUM_VALUE_NAME&lt;/code&gt;, we'll still be able to decode the document, without any loss of data.&lt;/p&gt;

&lt;p&gt;This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.&lt;/p&gt;

&lt;p&gt;The code is open-sourced &lt;a href="https://github.com/dataform-co/dataform/tree/master/protomongo"&gt;here&lt;/a&gt; (&lt;a href="https://godoc.org/github.com/dataform-co/dataform/protomongo"&gt;godoc&lt;/a&gt;). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>protobuf</category>
      <category>database</category>
    </item>
    <item>
      <title>How we use MobX at Dataform to solve our frontend application state problems</title>
      <dc:creator>Ahmad Faiyaz</dc:creator>
      <pubDate>Wed, 16 Oct 2019 11:40:51 +0000</pubDate>
      <link>https://forem.com/dataform/how-we-use-mobx-at-dataform-to-solve-our-frontend-application-state-problems-4d88</link>
      <guid>https://forem.com/dataform/how-we-use-mobx-at-dataform-to-solve-our-frontend-application-state-problems-4d88</guid>
      <description>&lt;p&gt;Having a state management library on a React based single page application is quite useful, especially if the application is complex in nature, for example, if we want to share states between two react components which are neither siblings nor child. But even if you use a state management library, it might not solve the application state in a clean and expected way.&lt;/p&gt;

&lt;h3&gt;
  
  
  What library did we use before?
&lt;/h3&gt;

&lt;p&gt;We initially used our in-house developed state management tool, which I will refer to as Goggle Store in this whole article. Goggle Store follows object oriented style, where you need to create state entity and state entities have a flat structure. And the store implementation was type-safe.&lt;/p&gt;

&lt;h3&gt;
  
  
  What problems did we face with Goggle Store?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;As an early stage startup, we couldn’t invest a lot of development time on this in house Goggle store. So we have little to no documentation for the store.&lt;/li&gt;
&lt;li&gt;Goggle store uses React’s “&lt;a href="https://reactjs.org/docs/react-component.html#forceupdate"&gt;forceUpdate&lt;/a&gt;” method to re-render react components on state change, which made our React app rendering kinda inefficient. Also forceUpdate usage is discouraged on React’s &lt;a href="https://reactjs.org/docs/react-component.html#forceupdate"&gt;documentation&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;We have to do “console.log” based debugging to check current state of the application with Goggle store.&lt;/li&gt;
&lt;li&gt;Not having control over mutating the state on Goggle store, means one can set values in any component by directly calling entity.set(x) which makes hard to keep track of where state is mutated. We had to search the whole code base to find out where set method is being called.&lt;/li&gt;
&lt;li&gt;Goggle Store doesn’t have caching mechanism for some state combination. For example, on our &lt;a href="https://app.dataform.co"&gt;Dataform web application&lt;/a&gt;, you can switch git branches, so if you open some directories on Branch A, then switch to Branch B open some other directories, then move to Branch A again, we couldn’t show the directories you opened last time due to lack of scoped state caching mechanism.&lt;/li&gt;
&lt;li&gt;Goggle Store code structure doesn’t enforce state dependency, so one can add a state entity to the store and make it independent even though it is supposed to be dependent on other state(s). We found many bugs related to this issue, as the developer forgot to reset value on some state changes, which led to inconsistent information on the UI.
After having all those issues above, we finally decided to move from Goggle store to another store library, which should solve the above problems and make our life easier.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  We chose MobX
&lt;/h3&gt;

&lt;p&gt;We did some R&amp;amp;D with two state management libraries named &lt;a href="https://redux.js.org/"&gt;Redux&lt;/a&gt; and &lt;a href="https://mobx.js.org/README.html"&gt;MobX&lt;/a&gt;. With Redux, we couldn’t have an object oriented structure: it seems best practice for Redux is to have flat store structure. Another thing about Redux is that it requires lots of boilerplate code to work with React, which seems annoying. And last but not least, we couldn’t find a solution to our caching and state dependency problem with Redux.&lt;br&gt;
As a result, we decided on using MobX for our application because of its &lt;a href="https://mobx.js.org/intro/concepts.html#2-derivations"&gt;derivation&lt;/a&gt; feature, such as computed values and reactions. Also with MobX we can follow object oriented paradigm and it requires less boilerplate code to work with React. We turned on enforceActions flag so that one can mutate state only inside an action. We have turned mobx-logger on so that one can see how state changes. But MobX didn’t solve our caching and state dependency enforcement issue. To solve those issues we have introduced a state dependency tree.&lt;/p&gt;
&lt;h3&gt;
  
  
  State Dependency Tree
&lt;/h3&gt;

&lt;p&gt;We grouped our state entities in a store, and created a dependency tree. Our entity structure with Goggle Store (simplified) is like this:&lt;/p&gt;


&lt;center&gt;
&lt;br&gt;
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ksXuTVxU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dataform.co/_next/static/images/goggle_store-e90f0dcbe4c38daca1bf7a09b58eb5fe.png"&gt;&lt;br&gt;
&lt;/center&gt;

&lt;p&gt;We converted the state like a tree on MobX below:&lt;/p&gt;


&lt;center&gt;
&lt;br&gt;
  &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ObFMi2DM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dataform.co/_next/static/images/mobx_store-7fb81f4ec49c115de8056ee1b7069020.png"&gt;&lt;br&gt;
&lt;/center&gt;

&lt;p&gt;So the code implementation looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;observable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;runInAction&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;mobx&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;Loadable&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;./loadable&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nx"&gt;Loadable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// our state entity class&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;Loadable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;private&lt;/span&gt; &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;private&lt;/span&gt; &lt;span class="na"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;isLoading&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;setLoading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="na"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;boolean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;loading&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;IProject&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;projectName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nx"&gt;RootStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;currentProjectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;projectsList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Loadable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;IProject&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;projectStoreMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ProjectStore&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;projectStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;project&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectsList&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;project&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;project&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;project&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Project not found&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;ProjectStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;project&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;computed&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;get&lt;/span&gt; &lt;span class="nx"&gt;currentProjectStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentProjectId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;action&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;setCurrentProjectId&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentProjectId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bound&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;fetchProjectsList&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectsList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;setLoading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ApiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;projectList&lt;/span&gt;&lt;span class="p"&gt;({});&lt;/span&gt;
    &lt;span class="nx"&gt;runInAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fetchProjectsListSuccess&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectsList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projects&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;IBranch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nx"&gt;ProjectStore&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;currentProject&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;IProject&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;branchList&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Loadable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;create&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;IBranch&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;observable&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;currentBranchName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;readonly&lt;/span&gt; &lt;span class="nx"&gt;branchStoreMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;BranchStore&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;project&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;IProject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentProject&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;project&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;branchStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;branch&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchList&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branch&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Branch not found&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;BranchStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branch&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchStoreMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;computed&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;get&lt;/span&gt; &lt;span class="nx"&gt;currentBranchStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchStore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentBranchName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;action&lt;/span&gt; &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="nx"&gt;setCurrentBranchName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentBranchName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;branchName&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;@&lt;/span&gt;&lt;span class="nd"&gt;action&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bound&lt;/span&gt;
  &lt;span class="kr"&gt;public&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nx"&gt;fetchBranchList&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;setLoading&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;ApiService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;get&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;branchList&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currentProject&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;projectId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="nx"&gt;runInAction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fetchBranchListSuccess&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branchList&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="kd"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;branches&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rootStore&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;RootStore&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We have utilized the computed value feature to add state dependency. So the developer doesn’t need to know which state entity they need to change. And as we have grouped entities together in a domain based store object, we can now cache the states for which we are using ES6 map, please take a look at &lt;a href="https://gist.github.com/faiyaz26/a3021a31c957e35485fd694ded557d27#file-mobx_store-ts-L46-L57"&gt;line 46-57&lt;/a&gt; for further understanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In software development world, no library is good at everything, which is also true for MobX. For example: its documentation, dev-tools are not rich like Redux but so far it is solving our problems. Many people don’t know about MobX as Redux is quite popular in react world. But I think, MobX can also be a great state management solution for many react developers.&lt;/p&gt;

</description>
      <category>mobx</category>
      <category>react</category>
      <category>mobxreact</category>
    </item>
    <item>
      <title>How to write unit tests for your SQL queries</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Mon, 15 Jul 2019 14:31:00 +0000</pubDate>
      <link>https://forem.com/dataform/how-to-write-unit-tests-for-your-sql-queries-2hd7</link>
      <guid>https://forem.com/dataform/how-to-write-unit-tests-for-your-sql-queries-2hd7</guid>
      <description>&lt;p&gt;I’ve previously &lt;a href="https://dev.to/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg"&gt;written&lt;/a&gt; about how I think we should prefer writing processing pipelines in pure SQL. However, a big difference between SQL and more widely-used languages is that those other languages generally have a strong tradition of unit testing.&lt;/p&gt;

&lt;p&gt;Usually, when we talk about ‘tests’ in the context of SQL, we don’t actually mean unit tests. Instead, the term generally refers to data tests, which are really &lt;a href="https://dataform.co/blog/data-assertions/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;assertions&lt;/a&gt; that the data itself conforms to some test criteria.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unit tests are not assertions.&lt;/strong&gt; Unit tests verify the logic of a SQL query by running that query on some fixed set of inputs. Assertions necessarily depend upon the real datasets which they validate, while unit tests should never depend on any real data.&lt;/p&gt;

&lt;h3&gt;
  
  
  The benefits of unit tests
&lt;/h3&gt;

&lt;p&gt;Unit testing is a standard practice in software engineering. Unit tests help ensure that difficult pieces of logic or complex interactions between components work as expected - and continue to work as expected as the surrounding code changes.&lt;/p&gt;

&lt;p&gt;Unit tests should not have any external dependencies; tests run the code in question on some faked inputs, ensuring that changes outside of that unit of code do not affect the test. This means that the success or failure of the test comes down purely to the code’s logic. Thus, if the test fails, you know exactly where to start debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why isn’t SQL unit testing widespread?
&lt;/h3&gt;

&lt;p&gt;In standard languages, a unit test typically consists of injecting fake input into the code under test and checking that the output matches some expected result. However, SQL scripts don’t label their input datasets - typically, they’re just defined statically inline in a FROM clause. This makes it difficult to inject fake input test data into your SQL code.&lt;/p&gt;

&lt;p&gt;The result of this is that most SQL code goes untested.&lt;/p&gt;

&lt;h3&gt;
  
  
  The solution
&lt;/h3&gt;

&lt;p&gt;Various SQL frameworks let you define layers of indirection between your SQL and its input(s); i.e. you declare and label the input datasets upon which a query depends. Unit testing frameworks can use this indirection to replace real input data with faked versions.&lt;/p&gt;

&lt;p&gt;We can then run the code under test, using some faked input, and compare the output result rows against a set of expected outputs. If the actual output of the code under test matches the expected output, the test passes; if not, it fails.&lt;/p&gt;

&lt;p&gt;This technique is simple and gives you real power to verify that a SQL script does what you think it does. You can pass faked inputs to your SQL that your real data may not currently contain, giving you confidence that it can robustly handle a wide range of data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test case support in Dataform
&lt;/h3&gt;

&lt;p&gt;When using Dataform’s &lt;a href="https://docs.dataform.co/guides/datasets/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;enriched SQL&lt;/a&gt;, you reference input datasets using either the &lt;code&gt;ref()&lt;/code&gt; or &lt;code&gt;resolve()&lt;/code&gt; function. This functionality gives us an easy way to inject fake input datasets into a script, thus enabling users to write unit tests.&lt;/p&gt;

&lt;p&gt;We have defined a new type of Dataform script: &lt;code&gt;test&lt;/code&gt;. In a &lt;code&gt;test&lt;/code&gt; query, you specify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The query you’re testing&lt;/li&gt;
&lt;li&gt;The faked inputs, each labeled with their referenced name&lt;/li&gt;
&lt;li&gt;The expected output of running the query on the faked inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Behind the scenes, when you run the test, we dynamically replace the inputs to the &lt;code&gt;dataset&lt;/code&gt; query with your faked input data. We then run the &lt;code&gt;dataset&lt;/code&gt; query, along with the query that defines your expected output, and check that the resulting rows match. Simple!&lt;/p&gt;

&lt;h3&gt;
  
  
  An example
&lt;/h3&gt;

&lt;p&gt;Here’s a worked example written using Dataform’s JavaScript API.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// First, define a dataset - we’ll follow this up with the unit test.&lt;/span&gt;
&lt;span class="nx"&gt;publish&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT
      FLOOR(age / 5) * 5 AS age_group,
      COUNT(1) AS user_count
      FROM &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;
      GROUP BY age_group
    `&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Now, define the unit test.&lt;/span&gt;
&lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;test_age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Specify the name of the dataset under test.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Provide the fake input “ages” dataset.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT 15 AS age UNION ALL
      SELECT 21 AS age UNION ALL
      SELECT 24 AS age UNION ALL
      SELECT 34 AS age
    `&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="c1"&gt;// Provide the expected output of running “age_groups” on the “ages” dataset.&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;`
      SELECT 15 AS age_group, 1 AS user_count UNION ALL
      SELECT 20 AS age_group, 2 AS user_count UNION ALL
      SELECT 30 AS age_group, 1 AS user_count
    `&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Alternatively, if you prefer to use Dataform’s enriched SQL, the unit test would look as follows (note that publishing the dataset is elided for simplicity):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;test&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;age_groups&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;input&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;ages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
  &lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;34&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt; &lt;span class="nx"&gt;UNION&lt;/span&gt; &lt;span class="nx"&gt;ALL&lt;/span&gt;
&lt;span class="nx"&gt;SELECT&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;age_group&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nx"&gt;AS&lt;/span&gt; &lt;span class="nx"&gt;user_count&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;For more details, see our &lt;a href="https://docs.dataform.co/guides/tests/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=write_sql_unit_tests"&gt;documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We’ve released this functionality as part of the v1.0.0 release of our &lt;a href="https://www.npmjs.com/package/@dataform/cli"&gt;@dataform NPM packages&lt;/a&gt;. Dataform Web will soon support test cases, too. Let us know what you think!&lt;/p&gt;

</description>
      <category>sql</category>
      <category>testing</category>
    </item>
    <item>
      <title>Consider SQL when writing your next processing pipeline</title>
      <dc:creator>BenBirt</dc:creator>
      <pubDate>Thu, 27 Jun 2019 11:03:34 +0000</pubDate>
      <link>https://forem.com/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg</link>
      <guid>https://forem.com/dataform/consider-sql-when-writing-your-next-processing-pipeline-1ojg</guid>
      <description>&lt;p&gt;Once a team or organization has some data to manage - customer data, events to be fed into some machine learning system, or whatever else - they almost immediately find themselves writing, running, and maintaining processing pipelines.&lt;/p&gt;

&lt;p&gt;Outputs of these pipelines are many and varied, including customer / market analysis, data cleaning, etc, but such pipelines seem to pop up more often and more quickly than one expects.&lt;/p&gt;

&lt;p&gt;Today, most non-trivial data processing is done using some pipelining technology, for example Dataflow / Apache Beam, with user code typically written in languages such as Java, Python, or perhaps Go.&lt;/p&gt;

&lt;h4&gt;
  
  
  My experience
&lt;/h4&gt;

&lt;p&gt;I worked as a software engineer at Google for several years, during which I led multiple teams and projects which required writing, managing, and maintaining various types of processing pipelines. During that time I became convinced that - for the majority of use-cases - &lt;strong&gt;expressing these pipelines in SQL is simpler, cheaper, and easier than the alternatives, with few disadvantages&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For what it’s worth, I’ll note that I’m actually a big fan of these pipelining technologies. While at Google, I was a cheerleader for the internal version of Cloud Dataflow (Flume) for both batch and streaming use-cases. However, I think that the reasons for using them - broadly - no longer apply to today’s world of highly scalable cloud warehouses and query engines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why isn’t SQL the de facto processing pipeline language today?
&lt;/h3&gt;

&lt;p&gt;SQL wasn’t really a scalable option for processing data before we had widely available cloud data warehouses such as BigQuery and Redshift. Without these highly-scalable query engines, the only reasonable choice was to perform any significant data processing outside of the data warehouse.&lt;/p&gt;

&lt;h4&gt;
  
  
  Scalable processing
&lt;/h4&gt;

&lt;p&gt;The first truly scalable data processing solution was probably something like Google MapReduce. It then quickly became obvious that chaining MapReduce-like processing steps into a full pipeline using some higher-level API can produce very powerful pipelining systems, and frameworks such as Hadoop, Apache Spark, and Google Cloud Dataflow were born. These systems enabled users to process terabytes of data (or more, with some tuning) quickly and scalably, which was often simply impossible using SQL query engines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;However, cloud data warehouse systems have evolved dramatically over the past 5 years.&lt;/strong&gt; SQL queries running on BigQuery’s query engine will generally run much more quickly than the alternative, which requires reading all of the relevant data out of the warehouse, processing it, then writing the result back to some other table. It’s also much easier to run in production; there’s no need to manage temporary state, queries are optimized automatically, etc. All of these concerns are pushed to the query engine, and the user doesn’t have to care about them.&lt;/p&gt;

&lt;p&gt;The query engine is the best place to optimize the pipeline since it has access to the most metadata about what data is being processed; as a result it’s much easier to manage the pipeline operationally in production. This is much better than the alternative - I can’t tell you how many hours (or days, or even weeks) my teams and I have spent debugging scalability issues and poor optimization choices in Java pipelines.&lt;/p&gt;

&lt;h4&gt;
  
  
  Existing bias towards imperative languages
&lt;/h4&gt;

&lt;p&gt;I think there is an understandable cultural bias in software engineering teams towards using standard imperative programming languages to implement processing pipelines, and until very recently it wasn’t really possible to mix and match SQL and non-SQL (see below for more on this).&lt;/p&gt;

&lt;p&gt;Engineers are much more familiar with configuring jobs written in these languages in production, but happily, modern SaaS options obviate this problem for SQL pipelines by taking responsibility for scheduling and running the user’s code, so that &lt;strong&gt;the user needs to do very little productionization&lt;/strong&gt; at all.&lt;/p&gt;

&lt;p&gt;Additionally, SQL scripts have sometimes been treated as a second class citizen versus other languages. Some tools used for SQL script development haven’t supported standard software engineering techniques such as version control or code review. However, this too has changed, with modern toolchain options supporting these practices as first-class features.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL has distinct advantages over the alternatives
&lt;/h3&gt;

&lt;p&gt;SQL is a language built and designed to support exactly what you want to do when you’re processing data: joining, filtering, aggregating, and transforming data. Thus, it’s usually &lt;strong&gt;much simpler and easier to express your pipeline in SQL&lt;/strong&gt; than it is in some other pipelining technology. (If you’d like to see an example of just how powerful SQL can be, take a look at &lt;a href="https://towardsdatascience.com/deep-neural-network-implemented-in-pure-sql-over-bigquery-f3ed245814d3"&gt;this article&lt;/a&gt; in which a deep neural network is implemented with it!)&lt;/p&gt;

&lt;h4&gt;
  
  
  A common language
&lt;/h4&gt;

&lt;p&gt;The biggest advantage of implementing your pipeline in SQL is that it’s likely to be the same language that you or your data team use to actually perform final analysis on the output of the pipeline.&lt;/p&gt;

&lt;p&gt;This means that the data team don’t need support from engineers to make changes to the pipeline. Instead, they’re empowered to make the changes themselves.&lt;/p&gt;

&lt;h4&gt;
  
  
  Debugging
&lt;/h4&gt;

&lt;p&gt;When something goes wrong, SQL pipelines are usually much easier to introspect than the alternative. If you want to check exactly what data is being output by any given processing stage of a SQL pipeline, you can simply pull out those results into some relevant SELECT query.&lt;/p&gt;

&lt;p&gt;Doing the same using a pipelining system can be a real pain, involving making significant code changes (just to add enough instrumentation to enable debugging) and re-deploying the pipeline.&lt;/p&gt;

&lt;h4&gt;
  
  
  Faster development
&lt;/h4&gt;

&lt;p&gt;During development of a SQL-based pipeline, the iteration cycle is significantly faster. This is because the feedback loop is much quicker - make an edit to your query(s), re-run the pipeline, and immediately get new results.&lt;/p&gt;

&lt;p&gt;If the pipeline processes so much data that it takes more than a minute or two to execute, it’s trivial to process a fraction of the data (to get results more quickly) by adding a LIMIT to your query (or subqueries), or by only selecting rows belonging to a subset of the input dataset.&lt;/p&gt;

&lt;p&gt;When writing Java pipelines from scratch, I would often find that testing out a single bugfix would take hours - not so with SQL. I actually often found myself writing a SQL script to validate the output of some productionized Java pipeline, only to belatedly realize that I had essentially re-implemented the Java pipeline in SQL - in much fewer lines of code, with much more readability, and significantly less complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  SQL’s disadvantages
&lt;/h3&gt;

&lt;p&gt;In my experience there are two distinct domains where other languages have an edge on pure SQL: (1) unit testing and (2) the readability of particularly complex data transformations.&lt;/p&gt;

&lt;p&gt;Some SQL queries can be fairly complex, especially if they use powerful features such as BigQuery’s &lt;a href="https://cloud.google.com/bigquery/docs/reference/standard-sql/analytic-function-concepts"&gt;analytic functions&lt;/a&gt;. I’d like to be able to write unit tests for these SQL queries, statically defining sets of input rows and expected output rows, asserting that the query does exactly what it’s supposed to. We’re working on implementing this feature within Dataform, and expect to have basic unit test support out soon. However, a useful tool which can help out here is &lt;a href="https://dataform.co/blog/data-assertions/?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=consider_sql"&gt;data assertions&lt;/a&gt;, using which you can express requirements of your input data, for example to check for correctness, before continuing to run your processing pipeline.&lt;/p&gt;

&lt;p&gt;Occasionally, you will want to run some particularly complicated data transformation logic. (For one interesting - if slightly insane - example, check out &lt;a href="https://medium.com/@urish/yes-i-compiled-1-000-000-typescript-files-in-under-40-seconds-this-is-how-6429a665999c"&gt;this&lt;/a&gt; Medium post.) Sometimes, when expressed in SQL, this can become difficult to read and/or maintain due to its complexity. However, there exists a nice solution to this problem: User-Defined Functions (UDFs). UDFs allow you to break out of SQL and use JavaScript or Python (depending on the warehouse) when you need the power of a full imperative programming language to implement your own function.&lt;/p&gt;

&lt;h3&gt;
  
  
  The future
&lt;/h3&gt;

&lt;p&gt;We’re seeing a general move towards expressing pipelines in plain SQL. Indeed, Apache Beam recently launched &lt;a href="https://beam.apache.org/documentation/dsls/sql/overview/"&gt;support&lt;/a&gt; for Beam SQL, allowing Java users to express transformations using inline SQL. I expect that as time goes on, we’ll see fewer and fewer processing pipelines expressed using Java/Python/Go, and much more work being done inside data warehouses using simple SQL, for all of the reasons discussed above.&lt;/p&gt;

</description>
      <category>sql</category>
    </item>
    <item>
      <title>Testing data quality with SQL assertions</title>
      <dc:creator>Lewis Hemens</dc:creator>
      <pubDate>Wed, 26 Jun 2019 10:51:09 +0000</pubDate>
      <link>https://forem.com/dataform/testing-data-quality-with-sql-assertions-248g</link>
      <guid>https://forem.com/dataform/testing-data-quality-with-sql-assertions-248g</guid>
      <description>&lt;p&gt;Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.&lt;/p&gt;

&lt;p&gt;Often, &lt;strong&gt;input data sources are missing rows, contain duplicates, or include just plain invalid data&lt;/strong&gt;. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.&lt;/p&gt;

&lt;p&gt;In this article we explain the concept of a SQL data assertion, look at some common data quality problems, how to detect them, and - most importantly - &lt;strong&gt;how to fix them in a way that persists for all data consumers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a data assertion?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;A data assertion is a query that looks for problems in a dataset&lt;/strong&gt;. If the query returns any rows then the assertion fails.&lt;/p&gt;

&lt;center&gt;
  &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstatic.dataform.co%2Fimages%2Fblog%2Fassertions_illustration.png"&gt;
&lt;/center&gt;

&lt;p&gt;Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  Checking field values
&lt;/h4&gt;

&lt;p&gt;Let’s take a look at a simple example.&lt;/p&gt;

&lt;p&gt;Assume that there is a &lt;code&gt;database.customers&lt;/code&gt; table containing information about customers in the database.&lt;br&gt;
Some checks that we might want to verify on the table’s contents include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The field &lt;code&gt;email_address&lt;/code&gt; is always set&lt;/li&gt;
&lt;li&gt;The field &lt;code&gt;customer_type&lt;/code&gt; is one of &lt;code&gt;“business”&lt;/code&gt; or &lt;code&gt;“individual”&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The following simple query will return any rows violating these rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt;  &lt;span class="n"&gt;email_address&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;business&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;individual&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Checking for unique fields
&lt;/h4&gt;

&lt;p&gt;We may also want to run checks across more than one row. For example, we might want to verify that the &lt;code&gt;customer_id&lt;/code&gt; field is unique. A query like the following will return any duplicate &lt;code&gt;customer_id&lt;/code&gt; values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Combinining multiple assertions into a single query
&lt;/h4&gt;

&lt;p&gt;We can combine all of the above into a single query to quickly find any &lt;code&gt;customer_id&lt;/code&gt; value violating one of our rules using &lt;code&gt;UNION ALL&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;missing_email&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;email_address&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;

&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;invalid_customer_type&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;business&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;individual&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;

&lt;span class="k"&gt;UNION&lt;/span&gt; &lt;span class="k"&gt;ALL&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;duplicate_id&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now have one query we can run to detect any problems in our table, and we can easily add another unioned &lt;code&gt;SELECT&lt;/code&gt; statement if we want to add new conditions in the future.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating clean datasets
&lt;/h3&gt;

&lt;p&gt;Now that we’ve detected the issues in our data, we need to clean them up. Ultimately choosing how to handle data quality issues depends on your business use case.&lt;/p&gt;

&lt;p&gt;In this example we will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Remove any rows that are missing the &lt;code&gt;email_address&lt;/code&gt; field&lt;/li&gt;
&lt;li&gt;Set a default customer type if it’s invalid&lt;/li&gt;
&lt;li&gt;Remove rows with duplicate &lt;code&gt;customer_id&lt;/code&gt; fields, retaining one row per &lt;code&gt;customer_id&lt;/code&gt; value (we don’t care which one)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Rather than editing the dataset directly, we can create a new clean copy of the dataset&lt;/strong&gt; - this gives us freedom to change or add rules in the future and avoids deleting any data.&lt;/p&gt;

&lt;p&gt;The following SQL query defines a view of our &lt;code&gt;database.customers&lt;/code&gt; table in which invalid rows are removed, default customer types are set, and duplicate rows for the same &lt;code&gt;customer_id&lt;/code&gt; are removed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ANY_VALUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;email_address&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;email_address&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;ANY_VALUE&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;individual&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;individual&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;
        &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;business&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;   &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;business&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;
        &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="k"&gt;unknown&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;
    &lt;span class="k"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;customer_type&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="n"&gt;email_address&lt;/span&gt; &lt;span class="k"&gt;IS&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query can be used to create either a view or a table in our cloud data warehouse, perhaps called &lt;code&gt;database_clean.customers&lt;/code&gt;, which can be consumed in dashboards or by analysts who want to query the data.&lt;/p&gt;

&lt;p&gt;Now we've fixed the problem, we can check that the above query has correctly fixed the problems by re-running the original assertion on the new dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Continuous data quality testing
&lt;/h3&gt;

&lt;p&gt;Assertions should be run as part of any data pipelines to make sure breaking changes are picked up the moment they happen.&lt;/p&gt;

&lt;p&gt;If an assertion returns any rows, future steps in a pipeline should either fail, or a notification delivered to the data owner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dataform.co/developers?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=sql_assertions" rel="noopener noreferrer"&gt;Dataform&lt;/a&gt; has built in support for data assertions, and provides a way to run them as part of a larger SQL pipeline.&lt;/p&gt;

&lt;p&gt;These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.&lt;/p&gt;

&lt;center&gt;
  &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fstatic.dataform.co%2Fimages%2Fdataform_sdk_schema.svg"&gt;
&lt;/center&gt;

&lt;p&gt;For more information on how to start writing data assertions with Dataform, check out the &lt;a href="https://docs.dataform.co/guides/assertions?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=sql_assertions" rel="noopener noreferrer"&gt;assertions documentation&lt;/a&gt; guide for Dataform’s &lt;a href="https://docs.dataform.co/" rel="noopener noreferrer"&gt;open-source framework&lt;/a&gt;, or &lt;a href="https://dataform.co/signup?utm_medium=organic&amp;amp;utm_source=dev_to&amp;amp;utm_campaign=sql_assertions" rel="noopener noreferrer"&gt;create an account for free&lt;/a&gt; and start using Dataform's fully managed Web platform.&lt;/p&gt;

</description>
      <category>sql</category>
    </item>
  </channel>
</rss>
