Forem: BenBirt

Cut data warehouse costs with run caching

BenBirt — Thu, 24 Sep 2020 12:20:29 +0000

As we've mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for 'incremental' tables).

Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our "run caching" feature.

Don't waste time and money re-computing the same data

Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.

Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn't change between one execution and the next, then the next execution will result in no changes to the output data, but it'll still cost time and money to run.

Instead, we believe that the pipeline should automatically detect if it's not going to change the output data - and if so, then the affected stage(s) should be skipped, saving those resources.

We've built this feature into Dataform.

Run caching in Dataform

Try out an example project with run caching here!

You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.

For example, consider the following SQLX file, which configures Dataform to publish a table age_count containing the transformed results of a query reading a people relation:

config { name: "age_count", type: "table" }

select age, count(1) from ${ref("people")} group by age

Dataform only needs to (re-)publish this table if any of the following conditions are true:

The output table age_count doesn't exist
The output table age_count has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself)
The query has changed since the last time the age_count table was published
The input table people has changed since the last time the age_count table was published (or, if people is a view, then if any of the input(s) to people have changed)

Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.

Building in intelligence so you don't have to

At Dataform we believe that you shouldn't have to manage the infrastructure involved in running analytics workloads.

This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don't have to. All you need to do is define your business-logic transformations, and we'll handle the rest.

If you'd like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!

CI/CD for ETL/ELT pipelines

BenBirt — Mon, 08 Jun 2020 12:14:52 +0000

One of Dataform’s key motivations has been to bring software engineering best practices to teams building ETL/ELT pipelines. To further that goal, we recently launched support for you to run Continuous Integration (CI) checks against your Dataform projects.

What is CI/CD?

CI/CD is a set of processes which aim to help teams ship software quickly and reliably.

Continuous integration (CI) checks automatically verify that all changes to your code work as expected, and typically run before the change is merged into your Git master branch. This ensures that the version of the code on the master branch always works correctly.

Continuous deployment (CD) tools automatically (and frequently) deploy the latest version of your code to production. This is intended to minimize the time it takes for new features or bugfixes to be available in production.

CI/CD for Dataform projects

Dataform already does most of the CD gruntwork for you. By default, all code committed to the master branch is automatically deployed. For more advanced use cases, you can configure exactly what you want to be deployed and when using environments.

CI checks, however, are usually configured as part of your Git repository (usually hosted on GitHub, though Dataform supports other Git hosting providers).

How to configure CI checks

Dataform distributes a Docker image which can be used to run the equivalent of Dataform CLI commands. For most CI tools, this Docker image is what you'll use to run your automated checks.

If you host your Dataform Git repository on GitHub, you can use GitHub Actions to run CI workflows. This post assumes you’re using GitHub Actions, but other CI tools are configured in a similar way.

Here’s a simple example of a GitHub Actions workflow for a Dataform project. Once you put this in a .github/workflows/<some filename>.yaml file, GitHub will run the workflow on each pull request and commit to your master branch.

name: CI

on:
  push:
    branches:
      - master
  pull_request:
    branches:
      - master

jobs:
  compile:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code into workspace directory
        uses: actions/checkout@v2
      - name: Install project dependencies
        uses: docker://dataformco/dataform:1.6.11
        with:
          args: install
      - name: Run dataform compile
        uses: docker://dataformco/dataform:1.6.11
        with:
          args: compile

This workflow runs dataform compile - this means that if the project fails to compile, the workflow will fail, and this will be reflected in the GitHub UI.

Note that it’s possible to run any dataform CLI command in a CI workflow. However, some commands do need credentials in order to run queries against your data warehouse. In these circumstances, you should encrypt those credentials and commit the encrypted file to your Git repository. Then, in your CI workflow, you decrypt the credentials so that the Dataform CLI can use them.

For further details on configuring CI/CD for your Dataform projects, please see our docs. As always, if you have any questions, or would like to get in touch with us, please send us a message on Slack!

The right way to install Helm charts

BenBirt — Sun, 05 Apr 2020 13:06:57 +0000

Helm has become the de facto package management tool for Kubernetes resources. As an example, take a look at these installation instructions for Istio (a Kubernetes service mesh and observability tool).

While many common Helm chart installation instructions encourage you to run a very simple command (helm install <chart>), and - hey presto - some new software is running in your Kubernetes cluster, I think that this workflow should generally (if not always) be avoided.

The big disadvantage of this workflow is that you sacrifice repeatability.

Repeatability is critical

Consider the scenario when you need to reinstall your Helm charts.

Say, for example, you need to migrate to a new Kubernetes cluster, for some reason. You can run helm ls to determine all currently installed charts, and their versions, and then install all of those on the new cluster, but this is significant manual work, and it only applies if you have a functioning cluster from which to 'copy' your Helm charts.

If for some reason your cluster is sufficiently broken - or, perhaps, accidentally deleted - you've now lost the accurate record of which Helm charts you had installed, and at what versions.

`helm install` is anti-GitOps

The GitOps model to managing Kubernetes resources - where a Git repo is treated as the source-of-truth for what should be running in the cluster - is precisely the solution for the repeatability problem. Somebody made a manual change to the cluster that broke something? No problem, we'll just rollback to what's in Git. (Indeed, preferably some automation will detect the change and do it for you!)

On the other hand, manually running helm install commands completely breaks this model, because your Git repo no longer completely encapsulates a description of what should be running in your Kubernetes cluster.

To be fair to Helm, it is possible to work around this problem. As long as you're willing to package up all of your Kubernetes resources into Helm chart(s), with all dependencies listed (managed using semver), you can continue using the GitOps deployment model. But this forces you to use Helm for everything.

What you should do instead

If you want to install Helm charts to your Kubernetes cluster, I strongly recommend:

vendoring the chart into your Git repo (or otherwise fully specifying the precise version of the chart in source control)
using helm template on the chart to render it as Kubernetes YAML
running plain-old kubectl apply on the result

This makes your Kubernetes cluster's configuration fully repeatable. It also means that all chart installations, and version upgrades, are fully auditable in source control.

In fact - if you take another look at the Istio installation instructions - you'll see that this is exactly the recommended workflow for installing Istio using Helm!

We do all of this using our build system - Bazel. We've written some custom Bazel rules that help us achieve this workflow - feel free to use them, if you like.

Bringing a new chart into the build system is simple. In your Bazel WORKSPACE file, you'll need to include the following code:

# Add the 'dataform' repository as a dependency.
git_repository(
    name = "df",
    commit = "de1eb66e558fbd349092d9519a8d5a1edefba94f",
    remote = "https://github.com/dataform-co/dataform.git",
)

# Load the Helm repository rules.
load("@dataform//tools/helm:repository_rules.bzl", "helm_chart", "helm_tool")

# Download the 'helm' tool.
helm_tool(
    name = "helm_tool",
)

To add a single new chart:

# Download the 'istio' Helm chart.
helm_chart(
    name = "istio",
    chartname = "istio",
    repo_url = "https://storage.googleapis.com/istio-release/releases/1.4.0/charts/",
    version = "v1.4.0",
)

Then, when you want to template it, add a BUILD rule somewhere, looking something like:

helm_template(
    name = "istio",
    chart_tar = "@istio//:chart.tgz",
    namespace = "istio-system",
    values = { ... },
)

The output of this rule is plain Kubernetes YAML, ready for you to deploy to your cluster however you wish. (We use the standard Bazel Kubernetes rules.)

How we store protobufs in MongoDB

BenBirt — Thu, 09 Jan 2020 12:09:42 +0000

At Dataform, we use Google Datastore to store customer data. However, for various reasons, we need to move off Datastore and onto a self-managed database.

We store all of our data in protobuf format; each entity we store corresponds to a single protobuf message. Since we already store structured documents (as opposed to SQL table rows), MongoDB is a great fit for us.

Here's a simple example of a protobuf definition:

message Person {
  string first_name = 1;
  string last_name = 2;
  int64 birth_timestamp_millis = 3;
}

One of the major benefits of using protocol buffers as a storage format is that it's very easy to make changes to our database 'schema'. Renaming a field is as simple as editing the .proto file, and it's (usually, with some caveats) safe to change a field's type, etc, whereas renaming a 'field' (column) in a traditional SQL-like table is usually a lot of work, involving some amount of DB migration.

However, safely making changes to a protobuf definition requires the data at rest to actually be stored in protobuf format, which would make it impossible to query, since the database engine doesn't speak protobuf.

One solution to this problem is to just store messages in their canonical JSON format. However, we'd then lose the ability to make many kinds of changes to our protobuf definitions. For example, we'd never be able to (easily) rename fields: imagine we stored an instance of Person (as defined above) in JSON format, but then renamed birth_timestamp_millis to birthday_timestamp - the previously stored Person would now have an undecodeable birthTimestampMillis field, and would be missing a value for birthdayTimestamp.

What we really want is the best of both worlds: we want to be able to store messages as JSON, so that it's possible to easily query the data; but we want stored data to be agnostic to the various kinds of backwards/forwards-compatible changes we might want to make to the protobuf definition.

Luckily, the MongoDB client libraries include a very helpful feature: they allow the user to define how data is encoded/decoded as it is stored/retrieved from the database, using custom, user-defined codecs.

We have used this feature to define our own new codec, written in Go, which solves the protobuf storage problem for us. It encodes protobuf messages using their tag numbers as document keys, and uses standard encoding/decoding for each of the protobuf field values.

For example, given the following protobuf definition:

message Example {
  string string_field = 3;
  ExampleEnum enum_field = 10;
  oneof example_oneof {
    int32 int32_field = 78;
    int64 int64_field = 33;
  }
  NestedMessage nested_message = 107;
}

enum ExampleEnum {
  VAL_0 = 1;
  VAL_1 = 573;
}

message NestedMessage {
  string nested_string_field = 2;
  int32 nested_int32_field = 1;
}

And the following instance of Example, in canonical JSON format:

{
  "stringField": "foo",
  "enumField": "VAL_0",
  // Note that this is represented as a string because the JavaScript number type is smaller than an int64.
  "int64Field": "123456789",
  "nestedMessage": {
    "nestedStringField": "bar",
    "nestedInt32Field": 12
  }
}

Our MongoDB codec will encode the instance of Example as the following Mongo BSON document:

{
  "3": "foo",
  "10": 1,
  "33": 123456789,
  "107": {
    "2": "bar",
    "1": 12
  }
}

With this encoding, if we change the name of nested_string_field to something_else, or the enum value VAL_0 to BETTER_ENUM_VALUE_NAME, we'll still be able to decode the document, without any loss of data.

This does make it slightly harder to query the database, since we now need to specify field numbers as opposed to human-readable field names. However, for production use, we have put a gRPC server in front of MongoDB which knows how to construct correct MongoDB queries, and for ad-hoc queries we plan to write a small translator which can do the same when given queries containing protobuf field names.

The code is open-sourced here (godoc). Examples of how to use it in a MongoDB codec registry are in the tests. Please feel free to use it if it helps you!

Thinking about technical debt

BenBirt — Sat, 20 Jul 2019 13:09:08 +0000

Technical debt is a common concern among software engineers. But we don't often think about it in the right terms.

Monetary debt

After you take out a loan, you (usually) have to make interest payments.

If you cannot make interest payments in full, in the short term, your debt grows. If you continue to be unable to repay interest, you may have to declare bankruptcy. This means that you may lose your assets to pay down the debt.

If circumstances are good, you will be able to both make interest payments and pay down the debt. After enough of these payments, you will be debt-free.

Technical debt as monetary debt

Introducing technical debt to a codebase is equivalent to taking out a loan.

You must now make continuous interest payments on the loan. Interest can come in various forms. One is engineering time spent on fixing bugs or regressions. Another is an increase in operational load from running the system in production.

Technical debt will grow if you do not have enough income (engineering time) to meet its costs. Lacking the time to make proper bug fixes, engineers will introduce yet more debt to the codebase.

If this situation remains unchecked, you may have to declare technical bankruptcy. This is a terrible situation: you must now drop all other work to fix the technical debt. In the worst case, you may have to drop all support for the system while you rewrite it completely from scratch.

We should not be afraid of technical debt

There are often good reasons for taking out a loan. For example, it might enable us to make an investment that has a large payoff in the future.

As with loans, so too with technical debt. We must make tradeoffs between technical debt and investment in new features. Those features may pay off in the form of more profits, enabling us to hire more engineers. Thus paying down technical debt becomes much easier.

Manage technical debt as you would a loan

Just as with real debt, technical debt requires careful management.

We should not be afraid of technical debt for its own sake. What should concern us is poor planning.

If you were going to use a loan to make an investment, it would be prudent to make a plan. This plan would model both the loan's value and the investment's value over time.

Technical debt is harder to model. But we should carefully consider both its cost and the opportunity cost of not taking it on. We should also keep track of our current debt, what it is costing us, and what headroom is available to take on more.

How to write unit tests for your SQL queries

BenBirt — Mon, 15 Jul 2019 14:31:00 +0000

I’ve previously written about how I think we should prefer writing processing pipelines in pure SQL. However, a big difference between SQL and more widely-used languages is that those other languages generally have a strong tradition of unit testing.

Usually, when we talk about ‘tests’ in the context of SQL, we don’t actually mean unit tests. Instead, the term generally refers to data tests, which are really assertions that the data itself conforms to some test criteria.

Unit tests are not assertions. Unit tests verify the logic of a SQL query by running that query on some fixed set of inputs. Assertions necessarily depend upon the real datasets which they validate, while unit tests should never depend on any real data.

The benefits of unit tests

Unit testing is a standard practice in software engineering. Unit tests help ensure that difficult pieces of logic or complex interactions between components work as expected - and continue to work as expected as the surrounding code changes.

Unit tests should not have any external dependencies; tests run the code in question on some faked inputs, ensuring that changes outside of that unit of code do not affect the test. This means that the success or failure of the test comes down purely to the code’s logic. Thus, if the test fails, you know exactly where to start debugging.

Why isn’t SQL unit testing widespread?

In standard languages, a unit test typically consists of injecting fake input into the code under test and checking that the output matches some expected result. However, SQL scripts don’t label their input datasets - typically, they’re just defined statically inline in a FROM clause. This makes it difficult to inject fake input test data into your SQL code.

The result of this is that most SQL code goes untested.

The solution

Various SQL frameworks let you define layers of indirection between your SQL and its input(s); i.e. you declare and label the input datasets upon which a query depends. Unit testing frameworks can use this indirection to replace real input data with faked versions.

We can then run the code under test, using some faked input, and compare the output result rows against a set of expected outputs. If the actual output of the code under test matches the expected output, the test passes; if not, it fails.

This technique is simple and gives you real power to verify that a SQL script does what you think it does. You can pass faked inputs to your SQL that your real data may not currently contain, giving you confidence that it can robustly handle a wide range of data.

Test case support in Dataform

When using Dataform’s enriched SQL, you reference input datasets using either the ref() or resolve() function. This functionality gives us an easy way to inject fake input datasets into a script, thus enabling users to write unit tests.

We have defined a new type of Dataform script: test. In a test query, you specify:

The query you’re testing
The faked inputs, each labeled with their referenced name
The expected output of running the query on the faked inputs

Behind the scenes, when you run the test, we dynamically replace the inputs to the dataset query with your faked input data. We then run the dataset query, along with the query that defines your expected output, and check that the resulting rows match. Simple!

An example

Here’s a worked example written using Dataform’s JavaScript API.

// First, define a dataset - we’ll follow this up with the unit test.
publish("age_groups").query(
  ctx =>
    `
      SELECT
      FLOOR(age / 5) * 5 AS age_group,
      COUNT(1) AS user_count
      FROM ${ctx.ref("ages")}
      GROUP BY age_group
    `
);

// Now, define the unit test.
test("test_age_groups")
  // Specify the name of the dataset under test.
  .dataset("age_groups")
  // Provide the fake input “ages” dataset.
  .input(
    "ages",
    `
      SELECT 15 AS age UNION ALL
      SELECT 21 AS age UNION ALL
      SELECT 24 AS age UNION ALL
      SELECT 34 AS age
    `
  )
  // Provide the expected output of running “age_groups” on the “ages” dataset.
  .expect(
    `
      SELECT 15 AS age_group, 1 AS user_count UNION ALL
      SELECT 20 AS age_group, 2 AS user_count UNION ALL
      SELECT 30 AS age_group, 1 AS user_count
    `
  );

Alternatively, if you prefer to use Dataform’s enriched SQL, the unit test would look as follows (note that publishing the dataset is elided for simplicity):

config {
  type: "test",
  dataset: "age_groups"
}

input "ages" {
  SELECT 15 AS age UNION ALL
  SELECT 21 AS age UNION ALL
  SELECT 24 AS age UNION ALL
  SELECT 34 AS age
}

SELECT 15 AS age_group, 1 AS user_count UNION ALL
SELECT 20 AS age_group, 2 AS user_count UNION ALL
SELECT 30 AS age_group, 1 AS user_count

For more details, see our documentation.

We’ve released this functionality as part of the v1.0.0 release of our @dataform NPM packages. Dataform Web will soon support test cases, too. Let us know what you think!

Consider SQL when writing your next processing pipeline

BenBirt — Thu, 27 Jun 2019 11:03:34 +0000

Once a team or organization has some data to manage - customer data, events to be fed into some machine learning system, or whatever else - they almost immediately find themselves writing, running, and maintaining processing pipelines.

Outputs of these pipelines are many and varied, including customer / market analysis, data cleaning, etc, but such pipelines seem to pop up more often and more quickly than one expects.

Today, most non-trivial data processing is done using some pipelining technology, for example Dataflow / Apache Beam, with user code typically written in languages such as Java, Python, or perhaps Go.

My experience

I worked as a software engineer at Google for several years, during which I led multiple teams and projects which required writing, managing, and maintaining various types of processing pipelines. During that time I became convinced that - for the majority of use-cases - expressing these pipelines in SQL is simpler, cheaper, and easier than the alternatives, with few disadvantages.

For what it’s worth, I’ll note that I’m actually a big fan of these pipelining technologies. While at Google, I was a cheerleader for the internal version of Cloud Dataflow (Flume) for both batch and streaming use-cases. However, I think that the reasons for using them - broadly - no longer apply to today’s world of highly scalable cloud warehouses and query engines.

Why isn’t SQL the de facto processing pipeline language today?

SQL wasn’t really a scalable option for processing data before we had widely available cloud data warehouses such as BigQuery and Redshift. Without these highly-scalable query engines, the only reasonable choice was to perform any significant data processing outside of the data warehouse.

Scalable processing

The first truly scalable data processing solution was probably something like Google MapReduce. It then quickly became obvious that chaining MapReduce-like processing steps into a full pipeline using some higher-level API can produce very powerful pipelining systems, and frameworks such as Hadoop, Apache Spark, and Google Cloud Dataflow were born. These systems enabled users to process terabytes of data (or more, with some tuning) quickly and scalably, which was often simply impossible using SQL query engines.

However, cloud data warehouse systems have evolved dramatically over the past 5 years. SQL queries running on BigQuery’s query engine will generally run much more quickly than the alternative, which requires reading all of the relevant data out of the warehouse, processing it, then writing the result back to some other table. It’s also much easier to run in production; there’s no need to manage temporary state, queries are optimized automatically, etc. All of these concerns are pushed to the query engine, and the user doesn’t have to care about them.

The query engine is the best place to optimize the pipeline since it has access to the most metadata about what data is being processed; as a result it’s much easier to manage the pipeline operationally in production. This is much better than the alternative - I can’t tell you how many hours (or days, or even weeks) my teams and I have spent debugging scalability issues and poor optimization choices in Java pipelines.

Existing bias towards imperative languages

I think there is an understandable cultural bias in software engineering teams towards using standard imperative programming languages to implement processing pipelines, and until very recently it wasn’t really possible to mix and match SQL and non-SQL (see below for more on this).

Engineers are much more familiar with configuring jobs written in these languages in production, but happily, modern SaaS options obviate this problem for SQL pipelines by taking responsibility for scheduling and running the user’s code, so that the user needs to do very little productionization at all.

Additionally, SQL scripts have sometimes been treated as a second class citizen versus other languages. Some tools used for SQL script development haven’t supported standard software engineering techniques such as version control or code review. However, this too has changed, with modern toolchain options supporting these practices as first-class features.

SQL has distinct advantages over the alternatives

SQL is a language built and designed to support exactly what you want to do when you’re processing data: joining, filtering, aggregating, and transforming data. Thus, it’s usually much simpler and easier to express your pipeline in SQL than it is in some other pipelining technology. (If you’d like to see an example of just how powerful SQL can be, take a look at this article in which a deep neural network is implemented with it!)

A common language

The biggest advantage of implementing your pipeline in SQL is that it’s likely to be the same language that you or your data team use to actually perform final analysis on the output of the pipeline.

This means that the data team don’t need support from engineers to make changes to the pipeline. Instead, they’re empowered to make the changes themselves.

Debugging

When something goes wrong, SQL pipelines are usually much easier to introspect than the alternative. If you want to check exactly what data is being output by any given processing stage of a SQL pipeline, you can simply pull out those results into some relevant SELECT query.

Doing the same using a pipelining system can be a real pain, involving making significant code changes (just to add enough instrumentation to enable debugging) and re-deploying the pipeline.

Faster development

During development of a SQL-based pipeline, the iteration cycle is significantly faster. This is because the feedback loop is much quicker - make an edit to your query(s), re-run the pipeline, and immediately get new results.

If the pipeline processes so much data that it takes more than a minute or two to execute, it’s trivial to process a fraction of the data (to get results more quickly) by adding a LIMIT to your query (or subqueries), or by only selecting rows belonging to a subset of the input dataset.

When writing Java pipelines from scratch, I would often find that testing out a single bugfix would take hours - not so with SQL. I actually often found myself writing a SQL script to validate the output of some productionized Java pipeline, only to belatedly realize that I had essentially re-implemented the Java pipeline in SQL - in much fewer lines of code, with much more readability, and significantly less complexity.

SQL’s disadvantages

In my experience there are two distinct domains where other languages have an edge on pure SQL: (1) unit testing and (2) the readability of particularly complex data transformations.

Some SQL queries can be fairly complex, especially if they use powerful features such as BigQuery’s analytic functions. I’d like to be able to write unit tests for these SQL queries, statically defining sets of input rows and expected output rows, asserting that the query does exactly what it’s supposed to. We’re working on implementing this feature within Dataform, and expect to have basic unit test support out soon. However, a useful tool which can help out here is data assertions, using which you can express requirements of your input data, for example to check for correctness, before continuing to run your processing pipeline.

Occasionally, you will want to run some particularly complicated data transformation logic. (For one interesting - if slightly insane - example, check out this Medium post.) Sometimes, when expressed in SQL, this can become difficult to read and/or maintain due to its complexity. However, there exists a nice solution to this problem: User-Defined Functions (UDFs). UDFs allow you to break out of SQL and use JavaScript or Python (depending on the warehouse) when you need the power of a full imperative programming language to implement your own function.

The future

We’re seeing a general move towards expressing pipelines in plain SQL. Indeed, Apache Beam recently launched support for Beam SQL, allowing Java users to express transformations using inline SQL. I expect that as time goes on, we’ll see fewer and fewer processing pipelines expressed using Java/Python/Go, and much more work being done inside data warehouses using simple SQL, for all of the reasons discussed above.