Forem: Dataform

Cut data warehouse costs with run caching

BenBirt — Thu, 24 Sep 2020 12:20:29 +0000

As we've mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).

Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our "run caching" feature.

Don't waste time and money re-computing the same data

Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.

Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.

Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.

We've built this feature into Dataform.

Run caching in Dataform

Try out an example project with run caching here!

You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.

For example, consider the following SQLX file, which configures Dataform to publish a table age_count containing the transformed results of a query reading a people relation:

config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age

Dataform only needs to (re-)publish this table if any of the following conditions are true:

The output table age_count doesn't exist
The output table age_count has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself)
The query has changed since the last time the age_count table was published
The input table people has changed since the last time the age_count table was published (or, if people is a view, then if any of the input(s) to people have changed)

Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.

Building in intelligence so you don't have to

At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.

This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.

If you'd like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!

CI/CD for ETL/ELT pipelines

BenBirt — Mon, 08 Jun 2020 12:14:52 +0000

One of Dataform’s key motivations has been to bring software engineering best practices to teams building ETL/ELT pipelines. To further that goal, we recently launched support for you to run Continuous Integration (CI) checks against your Dataform projects.

What is CI/CD?

CI/CD is a set of processes which aim to help teams ship software quickly and reliably.

Continuous integration (CI) checks automatically verify that all changes to your code work as expected, and typically run before the change is merged into your Git master branch. This ensures that the version of the code on the master branch always works correctly.

Continuous deployment (CD) tools automatically (and frequently) deploy the latest version of your code to production. This is intended to minimize the time it takes for new features or bugfixes to be available in production.

CI/CD for Dataform projects

Dataform already does most of the CD gruntwork for you. By default, all code committed to the master branch is automatically deployed. For more advanced use cases, you can configure exactly what you want to be deployed and when using environments.

CI checks, however, are usually configured as part of your Git repository (usually hosted on GitHub, though Dataform supports other Git hosting providers).

How to configure CI checks

Dataform distributes a Docker image which can be used to run the equivalent of Dataform CLI commands. For most CI tools, this Docker image is what you'll use to run your automated checks.

If you host your Dataform Git repository on GitHub, you can use GitHub Actions to run CI workflows. This post assumes you’re using GitHub Actions, but other CI tools are configured in a similar way.

Here’s a simple example of a GitHub Actions workflow for a Dataform project. Once you put this in a .github/workflows/<some filename>.yaml file, GitHub will run the workflow on each pull request and commit to your master branch.

name: CI

on:
  push:
    branches:
      - master
  pull_request:
    branches:
      - master

jobs:
  compile:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code into workspace directory
        uses: actions/checkout@v2
      - name: Install project dependencies
        uses: docker://dataformco/dataform:1.6.11
        with:
          args: install
      - name: Run dataform compile
        uses: docker://dataformco/dataform:1.6.11
        with:
          args: compile

This workflow runs dataform compile - this means that if the project fails to compile, the workflow will fail, and this will be reflected in the GitHub UI.

Note that it’s possible to run any dataform CLI command in a CI workflow. However, some commands do need credentials in order to run queries against your data warehouse. In these circumstances, you should encrypt those credentials and commit the encrypted file to your Git repository. Then, in your CI workflow, you decrypt the credentials so that the Dataform CLI can use them.

For further details on configuring CI/CD for your Dataform projects, please see our docs. As always, if you have any questions, or would like to get in touch with us, please send us a message on Slack!

Building an end to end Machine Learning Pipeline in Bigquery

Ahmad Faiyaz — Fri, 13 Mar 2020 10:41:16 +0000

Google BigQuery is one of the more advanced data warehouses in the market, and has out of the box support for building and training ML models using SQL like statements without requiring any code. This is extremely powerful, however managing end to end ML pipelines in this way can be fragile and requires manual steps to updating training and prediction.

In this article we walk through building a simple end to end BigQuery ML pipeline using the open-source framework Dataform to help us manage the end to end process of data preparation, training and prediction.

Google BigQuery provides some Machine Learning algorithms such as Linear regression, Binary logistic regression etc. To find out more about the models that BigQuery supports, check out the documentation.

A typical workflow for building a machine learning model looks like:

Data exploration
Data pre-processing (data transformation)
Model training
Model evaluation on test dataset
Prediction/Inference on real dataset

In this article I am going to follow the tutorial from Google Cloud documentation to create a machine learning model with Google BigQuery, please read the official documentation for understanding the technical details.

For managing our end to end pipeline, we are going to use Dataform to help us version control our queries and manage pipeline execution order. Dataform makes it easy for us to version control our BigQuery code and execute complex pipelines with just a few commands.

All of the code for this example is available in the demo repository here for you to follow along with. To run this example in your own BigQuery project you perform the following steps.

Clone the repo:

git clone https://github.com/dataform-co/bigquery-ml-pipeline.git

Install the Dataform CLI using npm or yarn:

npm i -g @dataform/cli

Set up the Dataform project:

cd bigquery-ml-pipeline && dataform install

For instructions on authenticating BigQuery so that you can run the queries, you can follow the Dataform documentation here.

Data exploration

Data exploration is usually done in jupyter notebooks or some dashboard solution: for example Looker or Google Data Studio. In this step one needs to find out which datasets are required, and which columns should be used as features for model training. Following the Google Cloud tutorial, I am going to use the public dataset named Census Adult Income (bigquery-public-data.ml_datasets.census_adult_income), which contains these columns:

Data pre processing

I need to split the dataset into three sections: training, evaluation and prediction. To do this using Dataform, I will create a new sqlx file with the code block below. This query extracts data on census respondents, including education_num, which represents the respondent's level of education, and workclass, which represents the type of work the respondent performs. This query excludes several categories that duplicate data: for example, the columns education and education_num in the census_adult_income table express the same data in different formats, so this query excludes the education column. The dataframe column uses the excluded functional_weight column to label 80% of the data source for training, and reserves the remaining data for evaluation and prediction. The query creates a table containing these columns, so that I can use them to perform training and prediction later.

census_input.sqlx

config {
  type: "table"
}

SELECT
  age,
  workclass,
  native_country,
  marital_status,
  education_num,
  occupation,
  race,
  hours_per_week,
  income_bracket,
  CASE
    WHEN MOD(functional_weight, 10) < 8 THEN 'training'
    WHEN MOD(functional_weight, 10) = 8 THEN 'evaluation'
    WHEN MOD(functional_weight, 10) = 9 THEN 'prediction'
  END AS dataframe
FROM
  `bigquery-public-data.ml_datasets.census_adult_income`

Model training

Now it's time to create a model using the training dataset. I am creating a new file for this model creation. You can find more about the model options from here.

model_train.sqlx

config {
  type: 'operations',
  hasOutput: true,
  tags: ["training"]
}

CREATE
OR REPLACE MODEL ${self()} OPTIONS (
  model_type = 'LOGISTIC_REG',
  auto_class_weights = TRUE,
  input_label_cols = ['income_bracket']
) AS
SELECT
  *
FROM
  ${ref('census_input')}
WHERE
  dataframe = 'training'

You’ll notice a few non-SQL constructs above, these are Dataform features. Our configuration block at the top tells Dataform that this is an operation (a custom SQL statement) that generates an output relation. We can use the self() function in dataform to get the fully qualified name of this relation that we should create, and we can use the ref() function to select from the preprocessed dataset that we created in the previous step.

I have also added a tag, which will be useful to run only this step with or without dependencies using the command: dataform run --tags training --include-deps or on Dataform web. Check the Dataform documentation on tags.

Model evaluation

Continuing with the standard ML workflow, I will evaluate my model with the evaluation dataset.
As I am creating a ML model pipeline, it is important not to run the prediction step if my model’s accuracy (for this model, I will consider accuracy less than .8 is bad) is not good enough. To solve this in the pipeline, I will use Dataform’s assertion feature.

Bigquery’s ML.EVALUATE function returns a row with column accuracy on which I can run the assertion.

model_evaluate.sqlx

config {
  type: 'assertion',
  tags: ["evaluate"]
}

SELECT
  *
FROM
  ML.EVALUATE (
    MODEL ${ref('model_train')},
    (
      SELECT
        *
      FROM
        ${ref('census_input')}
      WHERE
        dataframe = 'evaluation'
    )
  )
WHERE accuracy < .8

To train the model and run the accuracy test, use the following command:

dataform run --actions model_evaluate --include-deps

Run prediction

As I have created the model and evaluated, now I want to run prediction on a real datasets. Let's do it.

predict.sqlx

config {
  type: 'table',
  dependencies: ['model_evaluate'],
  tags: ["predict"]
}

SELECT
  *
FROM
  ML.PREDICT (
    MODEL ${ref('model_train')},
    (
      SELECT
        *
      FROM
        ${ref('census_input')}
      WHERE
        dataframe = 'prediction'
    )
  )

This prediction step depends on the evaluation step, so I have added it as a dependency.

So let's take a look at the dependency graph:

Now I can run the whole ML pipeline as a schedule within Dataform, written entirely in SQL and executed on BigQuery’s CPUs.

How we store protobufs in MongoDB

BenBirt — Thu, 09 Jan 2020 12:09:42 +0000

At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.

We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.

Here's a simple example of a protobuf definition:

message Person {
  string first_name = 1;
  string last_name = 2;
  int64 birth_timestamp_millis = 3;
}

One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the .proto file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.

However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.

One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of Person (as defined above) in JSON format, but then renamed birth_timestamp_millis to birthday_timestamp - the previously stored Person would now have an undecodeable birthTimestampMillis field, and would be missing a value for birthdayTimestamp.

What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.

Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.

We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.

For example, given the following protobuf definition:

message Example {
  string string_field = 3;
  ExampleEnum enum_field = 10;
  oneof example_oneof {
    int32 int32_field = 78;
    int64 int64_field = 33;
  }
  NestedMessage nested_message = 107;
}

enum ExampleEnum {
  VAL_0 = 1;
  VAL_1 = 573;
}

message NestedMessage {
  string nested_string_field = 2;
  int32 nested_int32_field = 1;
}

And the following instance of Example, in canonical JSON format:

{
  "stringField": "foo",
  "enumField": "VAL_0",
  // Note that this is represented as a string because the JavaScript number type is smaller than an int64.
  "int64Field": "123456789",
  "nestedMessage": {
    "nestedStringField": "bar",
    "nestedInt32Field": 12
  }
}

Our MongoDB codec will encode the instance of Example as the following Mongo BSON document:

{
  "3": "foo",
  "10": 1,
  "33": 123456789,
  "107": {
    "2": "bar",
    "1": 12
  }
}

With this encoding, if we change the name of nested_string_field to something_else, or the enum value VAL_0 to BETTER_ENUM_VALUE_NAME, we'll still be able to decode the document, without any loss of data.

This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.

The code is open-sourced here (godoc). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!

How we use MobX at Dataform to solve our frontend application state problems

Ahmad Faiyaz — Wed, 16 Oct 2019 11:40:51 +0000

Having a state management library on a React based single page application is quite useful, especially if the application is complex in nature, for example, if we want to share states between two react components which are neither siblings nor child. But even if you use a state management library, it might not solve the application state in a clean and expected way.

What library did we use before?

We initially used our in-house developed state management tool, which I will refer to as Goggle Store in this whole article. Goggle Store follows object oriented style, where you need to create state entity and state entities have a flat structure. And the store implementation was type-safe.

What problems did we face with Goggle Store?

As an early stage startup, we couldn’t invest a lot of development time on this in house Goggle store. So we have little to no documentation for the store.
Goggle store uses React’s “forceUpdate” method to re-render react components on state change, which made our React app rendering kinda inefficient. Also forceUpdate usage is discouraged on React’s documentation.
We have to do “console.log” based debugging to check current state of the application with Goggle store.
Not having control over mutating the state on Goggle store, means one can set values in any component by directly calling entity.set(x) which makes hard to keep track of where state is mutated. We had to search the whole code base to find out where set method is being called.
Goggle Store doesn’t have caching mechanism for some state combination. For example, on our Dataform web application, you can switch git branches, so if you open some directories on Branch A, then switch to Branch B open some other directories, then move to Branch A again, we couldn’t show the directories you opened last time due to lack of scoped state caching mechanism.
Goggle Store code structure doesn’t enforce state dependency, so one can add a state entity to the store and make it independent even though it is supposed to be dependent on other state(s). We found many bugs related to this issue, as the developer forgot to reset value on some state changes, which led to inconsistent information on the UI. After having all those issues above, we finally decided to move from Goggle store to another store library, which should solve the above problems and make our life easier.

We chose MobX

We did some R&D with two state management libraries named Redux and MobX. With Redux, we couldn’t have an object oriented structure: it seems best practice for Redux is to have flat store structure. Another thing about Redux is that it requires lots of boilerplate code to work with React, which seems annoying. And last but not least, we couldn’t find a solution to our caching and state dependency problem with Redux.
As a result, we decided on using MobX for our application because of its derivation feature, such as computed values and reactions. Also with MobX we can follow object oriented paradigm and it requires less boilerplate code to work with React. We turned on enforceActions flag so that one can mutate state only inside an action. We have turned mobx-logger on so that one can see how state changes. But MobX didn’t solve our caching and state dependency enforcement issue. To solve those issues we have introduced a state dependency tree.

State Dependency Tree

We grouped our state entities in a store, and created a dependency tree. Our entity structure with Goggle Store (simplified) is like this:

We converted the state like a tree on MobX below:

So the code implementation looks like:

import {action, computed, observable, runInAction} from 'mobx';
import Loadable from './loadable';
export default class Loadable<T> {
  // our state entity class
  public static create<T>(val?: T) {
    return new Loadable<T>(val);
  }
  @observable private value: T;
  @observable private loading: boolean = false;
  constructor(val?: T) {
    this.set(val);
  }
  public isLoading() {
    return this.loading;
  }
  public val() {
    return this.value;
  }
  public set(value: T) {
    this.loading = false;
    this.value = value;
  }
  public setLoading(loading: boolean) {
    this.loading = loading;
  }
}
interface IProject {
  projectName: string;
  projectId: string;
}
export class RootStore {
  @observable public currentProjectId: string = null;
  @observable public projectsList = Loadable.create<IProject[]>();
  public readonly projectStoreMap = new Map<string, ProjectStore>();
  public projectStore(projectId: string) {
    if (!this.projectStoreMap.has(projectId)) {
      const project = this.projectsList
        .val()
        .find(project => project.projectId === projectId);
      if (!project) {
        throw new Error('Project not found');
      }
      this.projectStoreMap.set(projectId, new ProjectStore(project));
    }
    return this.projectStoreMap.get(projectId);
  }
  @computed public get currentProjectStore() {
    return this.projectStore(this.currentProjectId);
  }
  @action public setCurrentProjectId(projectId: string) {
    this.currentProjectId = projectId;
  }
  @action.bound
  public async fetchProjectsList() {
    this.projectsList.setLoading(true);
    const response = await ApiService.get().projectList({});
    runInAction('fetchProjectsListSuccess', () =>
      this.projectsList.set(response.projects)
    );
  }
}
interface IBranch {
  branchName: string;
}
class ProjectStore {
  public readonly currentProject: IProject;
  @observable public branchList = Loadable.create<IBranch[]>();
  @observable public currentBranchName: string = null;
  public readonly branchStoreMap = new Map<string, BranchStore>();
  constructor(project: IProject) {
    this.currentProject = project;
  }
  public branchStore(branchName: string) {
    if (!this.branchStoreMap.has(branchName)) {
      const branch = this.branchList
        .val()
        .find(branch => branch.branchName === branchName);
      if (!branch) {
        throw new Error('Branch not found');
      }
      this.branchStoreMap.set(branchName, new BranchStore(branch));
    }
    return this.branchStoreMap.get(branchName);
  }
  @computed public get currentBranchStore() {
    return this.branchStore(this.currentBranchName);
  }
  @action public setCurrentBranchName(branchName: string) {
    this.currentBranchName = branchName;
  }
  @action.bound
  public async fetchBranchList() {
    this.branchList.setLoading(true);
    const response = await ApiService.get().branchList({
      projectId: this.currentProject.projectId,
    });
    runInAction('fetchBranchListSuccess', () =>
      this.branchList.set(response.branches)
    );
  }
}
const rootStore = new RootStore();

We have utilized the computed value feature to add state dependency. So the developer doesn’t need to know which state entity they need to change. And as we have grouped entities together in a domain based store object, we can now cache the states for which we are using ES6 map, please take a look at line 46-57 for further understanding.

Conclusion

In software development world, no library is good at everything, which is also true for MobX. For example: its documentation, dev-tools are not rich like Redux but so far it is solving our problems. Many people don’t know about MobX as Redux is quite popular in react world. But I think, MobX can also be a great state management solution for many react developers.

How to write unit tests for your SQL queries

BenBirt — Mon, 15 Jul 2019 14:31:00 +0000

I’ve previously written about how I think we should prefer writing processing pipelines in pure SQL. However, a big difference between SQL and more widely-used languages is that those other languages generally have a strong tradition of unit testing.

Usually, when we talk about ‘tests’ in the context of SQL, we don’t actually mean unit tests. Instead, the term generally refers to data tests, which are really assertions that the data itself conforms to some test criteria.

Unit tests are not assertions. Unit tests verify the logic of a SQL query by running that query on some fixed set of inputs. Assertions necessarily depend upon the real datasets which they validate, while unit tests should never depend on any real data.

The benefits of unit tests

Unit testing is a standard practice in software engineering. Unit tests help ensure that difficult pieces of logic or complex interactions between components work as expected - and continue to work as expected as the surrounding code changes.

Unit tests should not have any external dependencies; tests run the code in question on some faked inputs, ensuring that changes outside of that unit of code do not affect the test. This means that the success or failure of the test comes down purely to the code’s logic. Thus, if the test fails, you know exactly where to start debugging.

Why isn’t SQL unit testing widespread?

In standard languages, a unit test typically consists of injecting fake input into the code under test and checking that the output matches some expected result. However, SQL scripts don’t label their input datasets - typically, they’re just defined statically inline in a FROM clause. This makes it difficult to inject fake input test data into your SQL code.

The result of this is that most SQL code goes untested.

The solution

Various SQL frameworks let you define layers of indirection between your SQL and its input(s); i.e. you declare and label the input datasets upon which a query depends. Unit testing frameworks can use this indirection to replace real input data with faked versions.

We can then run the code under test, using some faked input, and compare the output result rows against a set of expected outputs. If the actual output of the code under test matches the expected output, the test passes; if not, it fails.

This technique is simple and gives you real power to verify that a SQL script does what you think it does. You can pass faked inputs to your SQL that your real data may not currently contain, giving you confidence that it can robustly handle a wide range of data.

Test case support in Dataform

When using Dataform’s enriched SQL, you reference input datasets using either the ref() or resolve() function. This functionality gives us an easy way to inject fake input datasets into a script, thus enabling users to write unit tests.

We have defined a new type of Dataform script: test. In a test query, you specify:

The query you’re testing
The faked inputs, each labeled with their referenced name
The expected output of running the query on the faked inputs

Behind the scenes, when you run the test, we dynamically replace the inputs to the dataset query with your faked input data. We then run the dataset query, along with the query that defines your expected output, and check that the resulting rows match. Simple!

An example

Here’s a worked example written using Dataform’s JavaScript API.

// First, define a dataset - we’ll follow this up with the unit test.
publish("age_groups").query(
  ctx =>
    `
      SELECT
      FLOOR(age / 5) * 5 AS age_group,
      COUNT(1) AS user_count
      FROM ${ctx.ref("ages")}
      GROUP BY age_group
    `
);

// Now, define the unit test.
test("test_age_groups")
  // Specify the name of the dataset under test.
  .dataset("age_groups")
  // Provide the fake input “ages” dataset.
  .input(
    "ages",
    `
      SELECT 15 AS age UNION ALL
      SELECT 21 AS age UNION ALL
      SELECT 24 AS age UNION ALL
      SELECT 34 AS age
    `
  )
  // Provide the expected output of running “age_groups” on the “ages” dataset.
  .expect(
    `
      SELECT 15 AS age_group, 1 AS user_count UNION ALL
      SELECT 20 AS age_group, 2 AS user_count UNION ALL
      SELECT 30 AS age_group, 1 AS user_count
    `
  );

Alternatively, if you prefer to use Dataform’s enriched SQL, the unit test would look as follows (note that publishing the dataset is elided for simplicity):

config {
  type: "test",
  dataset: "age_groups"
}

input "ages" {
  SELECT 15 AS age UNION ALL
  SELECT 21 AS age UNION ALL
  SELECT 24 AS age UNION ALL
  SELECT 34 AS age
}

SELECT 15 AS age_group, 1 AS user_count UNION ALL
SELECT 20 AS age_group, 2 AS user_count UNION ALL
SELECT 30 AS age_group, 1 AS user_count

For more details, see our documentation.

We’ve released this functionality as part of the v1.0.0 release of our @dataform NPM packages. Dataform Web will soon support test cases, too. Let us know what you think!

Consider SQL when writing your next processing pipeline

BenBirt — Thu, 27 Jun 2019 11:03:34 +0000

Once a team or organization has some data to manage - customer data, events to be fed into some machine learning system, or whatever else - they almost immediately find themselves writing, running, and maintaining processing pipelines.

Outputs of these pipelines are many and varied, including customer / market analysis, data cleaning, etc, but such pipelines seem to pop up more often and more quickly than one expects.

Today, most non-trivial data processing is done using some pipelining technology, for example Dataflow / Apache Beam, with user code typically written in languages such as Java, Python, or perhaps Go.

My experience

I worked as a software engineer at Google for several years, during which I led multiple teams and projects which required writing, managing, and maintaining various types of processing pipelines. During that time I became convinced that - for the majority of use-cases - expressing these pipelines in SQL is simpler, cheaper, and easier than the alternatives, with few disadvantages.

For what it’s worth, I’ll note that I’m actually a big fan of these pipelining technologies. While at Google, I was a cheerleader for the internal version of Cloud Dataflow (Flume) for both batch and streaming use-cases. However, I think that the reasons for using them - broadly - no longer apply to today’s world of highly scalable cloud warehouses and query engines.

Why isn’t SQL the de facto processing pipeline language today?

SQL wasn’t really a scalable option for processing data before we had widely available cloud data warehouses such as BigQuery and Redshift. Without these highly-scalable query engines, the only reasonable choice was to perform any significant data processing outside of the data warehouse.

Scalable processing

The first truly scalable data processing solution was probably something like Google MapReduce. It then quickly became obvious that chaining MapReduce-like processing steps into a full pipeline using some higher-level API can produce very powerful pipelining systems, and frameworks such as Hadoop, Apache Spark, and Google Cloud Dataflow were born. These systems enabled users to process terabytes of data (or more, with some tuning) quickly and scalably, which was often simply impossible using SQL query engines.

However, cloud data warehouse systems have evolved dramatically over the past 5 years. SQL queries running on BigQuery’s query engine will generally run much more quickly than the alternative, which requires reading all of the relevant data out of the warehouse, processing it, then writing the result back to some other table. It’s also much easier to run in production; there’s no need to manage temporary state, queries are optimized automatically, etc. All of these concerns are pushed to the query engine, and the user doesn’t have to care about them.

The query engine is the best place to optimize the pipeline since it has access to the most metadata about what data is being processed; as a result it’s much easier to manage the pipeline operationally in production. This is much better than the alternative - I can’t tell you how many hours (or days, or even weeks) my teams and I have spent debugging scalability issues and poor optimization choices in Java pipelines.

Existing bias towards imperative languages

I think there is an understandable cultural bias in software engineering teams towards using standard imperative programming languages to implement processing pipelines, and until very recently it wasn’t really possible to mix and match SQL and non-SQL (see below for more on this).

Engineers are much more familiar with configuring jobs written in these languages in production, but happily, modern SaaS options obviate this problem for SQL pipelines by taking responsibility for scheduling and running the user’s code, so that the user needs to do very little productionization at all.

Additionally, SQL scripts have sometimes been treated as a second class citizen versus other languages. Some tools used for SQL script development haven’t supported standard software engineering techniques such as version control or code review. However, this too has changed, with modern toolchain options supporting these practices as first-class features.

SQL has distinct advantages over the alternatives

SQL is a language built and designed to support exactly what you want to do when you’re processing data: joining, filtering, aggregating, and transforming data. Thus, it’s usually much simpler and easier to express your pipeline in SQL than it is in some other pipelining technology. (If you’d like to see an example of just how powerful SQL can be, take a look at this article in which a deep neural network is implemented with it!)

A common language

The biggest advantage of implementing your pipeline in SQL is that it’s likely to be the same language that you or your data team use to actually perform final analysis on the output of the pipeline.

This means that the data team don’t need support from engineers to make changes to the pipeline. Instead, they’re empowered to make the changes themselves.

Debugging

When something goes wrong, SQL pipelines are usually much easier to introspect than the alternative. If you want to check exactly what data is being output by any given processing stage of a SQL pipeline, you can simply pull out those results into some relevant SELECT query.

Doing the same using a pipelining system can be a real pain, involving making significant code changes (just to add enough instrumentation to enable debugging) and re-deploying the pipeline.

Faster development

During development of a SQL-based pipeline, the iteration cycle is significantly faster. This is because the feedback loop is much quicker - make an edit to your query(s), re-run the pipeline, and immediately get new results.

If the pipeline processes so much data that it takes more than a minute or two to execute, it’s trivial to process a fraction of the data (to get results more quickly) by adding a LIMIT to your query (or subqueries), or by only selecting rows belonging to a subset of the input dataset.

When writing Java pipelines from scratch, I would often find that testing out a single bugfix would take hours - not so with SQL. I actually often found myself writing a SQL script to validate the output of some productionized Java pipeline, only to belatedly realize that I had essentially re-implemented the Java pipeline in SQL - in much fewer lines of code, with much more readability, and significantly less complexity.

SQL’s disadvantages

In my experience there are two distinct domains where other languages have an edge on pure SQL: (1) unit testing and (2) the readability of particularly complex data transformations.

Some SQL queries can be fairly complex, especially if they use powerful features such as BigQuery’s analytic functions. I’d like to be able to write unit tests for these SQL queries, statically defining sets of input rows and expected output rows, asserting that the query does exactly what it’s supposed to. We’re working on implementing this feature within Dataform, and expect to have basic unit test support out soon. However, a useful tool which can help out here is data assertions, using which you can express requirements of your input data, for example to check for correctness, before continuing to run your processing pipeline.

Occasionally, you will want to run some particularly complicated data transformation logic. (For one interesting - if slightly insane - example, check out this Medium post.) Sometimes, when expressed in SQL, this can become difficult to read and/or maintain due to its complexity. However, there exists a nice solution to this problem: User-Defined Functions (UDFs). UDFs allow you to break out of SQL and use JavaScript or Python (depending on the warehouse) when you need the power of a full imperative programming language to implement your own function.

The future

We’re seeing a general move towards expressing pipelines in plain SQL. Indeed, Apache Beam recently launched support for Beam SQL, allowing Java users to express transformations using inline SQL. I expect that as time goes on, we’ll see fewer and fewer processing pipelines expressed using Java/Python/Go, and much more work being done inside data warehouses using simple SQL, for all of the reasons discussed above.

Testing data quality with SQL assertions

Lewis Hemens — Wed, 26 Jun 2019 10:51:09 +0000

Ensuring that data consumers can use their data to reliably answer questions is of paramount importance to any data analytics team. Having a mechanism to enforce high data quality across datasets is therefore a key requirement for these teams.

Often, input data sources are missing rows, contain duplicates, or include just plain invalid data. Over time, changes to business definitions or the underlying software which produces input data can cause drift in the meaning of columns - or even the overall structure of tables. Addressing these issues is critical to creating a successful data team and generating valuable, correct insights.

In this article we explain the concept of a SQL data assertion, look at some common data quality problems, how to detect them, and - most importantly - how to fix them in a way that persists for all data consumers.

The SQL snippets in this post apply to Google BigQuery but can be ported easily enough to Redshift, Postgres or Snowflake data warehouses.

What is a data assertion?

A data assertion is a query that looks for problems in a dataset. If the query returns any rows then the assertion fails.

Data assertions are defined this way because it’s much easier to look for problems rather than the absence of them. It also means that assertion queries can themselves be used to quickly inspect the data causing the assertion to fail - making it easy to diagnose and fix the problem.

Checking field values

Let’s take a look at a simple example.

Assume that there is a database.customers table containing information about customers in the database.
Some checks that we might want to verify on the table’s contents include:

The field email_address is always set
The field customer_type is one of “business” or “individual”

The following simple query will return any rows violating these rules:

SELECT customer_id
FROM database.customers
WHERE  email_address IS NULL
OR NOT customer_type IN (“business”, “individual”)

Checking for unique fields

We may also want to run checks across more than one row. For example, we might want to verify that the customer_id field is unique. A query like the following will return any duplicate customer_id values:

SELECT
    customer_id,
    SUM(1) AS count
FROM database.customers
GROUP BY 1
HAVING count > 1

Combinining multiple assertions into a single query

We can combine all of the above into a single query to quickly find any customer_id value violating one of our rules using UNION ALL:

SELECT customer_id, “missing_email” AS reason
FROM database.customers
WHERE email_address IS NULL

UNION ALL

SELECT customer_id, “invalid_customer_type” AS reason
WHERE not customer_type in (“business”, “individual”)
FROM database.customers

UNION ALL

SELECT customer_id, “duplicate_id” AS reason
FROM (
    SELECT customer_id, SUM(1) AS count
    FROM database.customers
    GROUP BY 1
)
WHERE count > 1

We now have one query we can run to detect any problems in our table, and we can easily add another unioned SELECT statement if we want to add new conditions in the future.

Creating clean datasets

Now that we’ve detected the issues in our data, we need to clean them up. Ultimately choosing how to handle data quality issues depends on your business use case.

In this example we will:

Remove any rows that are missing the email_address field
Set a default customer type if it’s invalid
Remove rows with duplicate customer_id fields, retaining one row per customer_id value (we don’t care which one)

Rather than editing the dataset directly, we can create a new clean copy of the dataset - this gives us freedom to change or add rules in the future and avoids deleting any data.

The following SQL query defines a view of our database.customers table in which invalid rows are removed, default customer types are set, and duplicate rows for the same customer_id are removed:

SELECT
    customer_id,
    ANY_VALUE(email_address) AS email_address,
    ANY_VALUE(
        CASE customer_type
        WHEN “individual” THEN “individual”
        WHEN “business”   THEN “business”
        ELSE “unknown”
    END) AS customer_type
FROM database.customers
WHERE NOT email_address IS NULL
GROUP BY 1

This query can be used to create either a view or a table in our cloud data warehouse, perhaps called database_clean.customers, which can be consumed in dashboards or by analysts who want to query the data.

Now we've fixed the problem, we can check that the above query has correctly fixed the problems by re-running the original assertion on the new dataset.

Continuous data quality testing

Assertions should be run as part of any data pipelines to make sure breaking changes are picked up the moment they happen.

If an assertion returns any rows, future steps in a pipeline should either fail, or a notification delivered to the data owner.

Dataform has built in support for data assertions, and provides a way to run them as part of a larger SQL pipeline.

These can be run at any frequency, and if an assertion fails an email will be sent to notify you of the problem. Dataform also provides a way to easily create new datasets in your warehouse, making managing the process of cleaning and testing your data extremely straightforward.

For more information on how to start writing data assertions with Dataform, check out the assertions documentation guide for Dataform’s open-source framework, or create an account for free and start using Dataform's fully managed Web platform.