Forem: Joe Auty

The Easiest Way to Clone a PostgreSQL Database

Joe Auty — Wed, 07 Dec 2022 00:00:32 +0000

If you use a managed database service like AWS RDS or Google Cloud SQL, it is super easy to create a snapshot and create a new instance based on this snapshot, but there is an even easier way to clone a database. At Redactics our use case was building automated workflows to test your database migrations against a copy of your production database (and, incidentally, this is free for use in case this is of interest to you), but there are likely dozens of additional use cases. For example, you may want to A/B test against a particular database or query optimization.

This approach leverages PostgreSQL's support for Linux pipes, and the fact that pg_dump output can be directed into pg_restore as input. This approach means that what would normally be two steps can be combined into one, and you don't need to set aside a bunch of disk space for this task.

pg_dump -h [DB_hostname] -U [username] -W -d [source_database] -v -Fc | pg_restore -h [DB_hostname] -U [username] -W -v -d [destination_database]

Since you likely need for this to be run without interaction, i.e. entering the database password, you might need to create a .pgpass file or leverage environment variables for storing this info. This will allow you to remove the -h, -U, and -W arguments.

The Easiest Way to Clone a PostgreSQL Database

Joe Auty — Tue, 06 Dec 2022 15:37:42 +0000

pg_dump -h [DB_hostname] -U [username] -W -d [source_database] -v -Fc | pg_restore -h [DB_hostname] -U [username] -W -v -d [destination_database]

How To Get the Most Out of Airflow's Dynamic Task Mapping

Joe Auty — Mon, 05 Dec 2022 22:28:46 +0000

Dynamic task mapping (DTM) is a major feature that adds a lot of flexibility to how you build your DAGs. However, it does not provide infinite flexibility and break you free of being beholden to Airflow's patterns. For example, your task mappings are constrained by datatypes supported by XCom, namely Python dict and lists. I have not found a way to set a Kubernetes secret for the KubernetesPodOperator dynamically, for instance.

Python allows you to pickle an object such as a Kubernetes secret, which converts an object such as this into a byte stream, and XCom does support pickled data, but I have not found a way to use this in conjunction with DTM.

Here are some useful tips and observations collected from our working with DTM at Redactics:

Working with Global Variables

Understand that values set inside task mapping functions are set at runtime after values are assigned to variables outside of these mapped functions. Therefore, you cannot, for example, assign a value to a global variable inside a task mapping and expect that this value will be available outside of these mapped functions:

secrets = []

@task()
def set_vars(**context):
    global secrets
    secrets.append(Secret('volume', "/secretpath/" + context["params"]["secretname"], context["params"]["secretname"]))
    return secrets

@task()
def init_wf(secrets, **context):
    print(secrets)
    return "hello"

init_workflow = init_wf.partial().expand(
    secrets=set_vars()
)

start_job = KubernetesPodOperator(
    task_id="start-job",
    image="postgres:12"
    cmds=["uname"],
    secrets=secrets ### this value is going to be null
    )
init_workflow >> start_job

The secrets value in the KubernetesPodOperator is going to be null because by the time start_job is initialized, set_vars has not run.

You Can Duplicate Your DAGs To Have Them Run Within a Different Context, e.g. a Different Schedule

Not the prettiest pattern, but you can duplicate your DAGs, for example with a Docker entrypoint script:

#!/bin/bash

arr=( "workflow1" "workflow2" )

for workflow_id in "${arr[@]}"
do
    cp /tmp/dag-template.py /opt/airflow/dags/${workflow_id}-mydag.py
done

Then, your DAGs can pull a configuration from an API, environment variables, files, or whatever makes the most sense to serve up these variations to your DAG. This may seem obvious, and it certainly isn't pretty, but we had to bite this bullet because we needed certain parameters (e.g. schedule) accessible and this was not territory for DTM.

If you elect to retrieve some values via an API, this allows even more of the DAG to be dynamic so that it doesn't need updating whenever you want to change values. We elected to use that workflow_id in the filename to pass on to the API:

dag_file = os.path.basename(__file__).split('.')[0]
dag_id = dag_file.replace('-mydag','')

API_KEY = os.environ['API_KEY']
API_HOST = "https://api.yourthing.com"
headers = {'Content-type': 'application/json', 'Accept': 'text/plain', 'x-api-key': API_KEY}
apiUrl = API_HOST + '/airflowconfigs/' + dag_id
request = requests.get(apiUrl, headers=headers)
wf_config = request.json()

Running this near the top of the DAG ensures that wf_config is available as a global variable throughout your DAG. You can control how often your API is polled, and if you are concerned about how this scales, cache these configs with Redis.

Accessing the Context Object, Including DagRun Params, Requires the TaskFlow API

If you are using the Airflow REST API and passing in a conf object to the DAGRun endpoint, for example, you cannot access these arguments from within a classic style operator such as PythonOperator. Instead, you must use the TaskFlow API designed for usage with DTM. For example:

@task()
def start_job(**context):
    print(context["params"]["myparam"])

How To Automate Database Migration Testing/Dry-runs in Your CI/CD Pipelines

Joe Auty — Mon, 28 Nov 2022 17:19:54 +0000

The most logical starting place is to stand up a new database cluster based on a snapshot, especially if you are using a managed database service like AWS RDS where doing so is easy. However, this still requires some manual steps or some sort of script to automate this via the CLI provided by your cloud provider, and what about your migration job itself? What about teardown? Is there a way to automate all of this? Incorporate this as a part of your CI/CD pipeline?

Is This Even Worth Automating?

I guess this depends upon the risk factor of the migrations you run typically, and whether your migrations tend to manipulate data vs. simply manipulate schema. I would suggest that if you just need to manipulate schema, you probably don't need to even conduct a dry-run sort of test at all, but having a way to simulate your more stressful migrations is always a very nice-to-have, because it sucks when a migration barfs and you have to consider this an outage while you repair tainted data (or at least assess whether your data is tainted). If you are really diligent about wrapping queries in a transaction that can be undone when something goes wrong, perhaps you don't need a solution like this either, but even then it can often be easier to just take the conservative approach by conducting a test like this rather than trusting that the way you've coded your migrations will really provide the safety net you need.

Having automation for your safety net is nice. If you don't have a solution for doing this, by the time you find or build one, test it and document it for next time, you might be able to just put together a more permanent solution you can trust, such as the approach that will be described here:

Consider the Hardware and Load Required by Your Migration(s)

I like using the exact same hardware to really provide as accurate a simulation as possible, or at least hardware with the same memory allocation. This way if you have written some sort of loop or data mapping process that reads data into memory you can assure that if this is destined to run out of memory when deployed to production it will do so during your simulation. Of course, it is much better to rely on a small number of SQL queries wrapped in transactions rather than some sort of row by row loop, but if you can't get around that (or can't afford the time to level up your SQL skills), I suggest at least considering the hardware this will run on.

If you are extremely concerned about the load of these migrations, doing your testing with a completely new database cluster might be best, otherwise simply creating a new database on the same cluster will work. I find more often than not doing the latter is just fine, and that it is best to focus on optimizing your SQL and approach to your migration rather than attempting to coddle your infrastructure. Your database is designed to take a beating, just within reason, and if your migrations have memory consumption issues a new database cluster (with the same hardware specs) is not going to help with this.

The Ingredients

Here are the ingredients you need:

Something to create a copy of your production database, taking into account the question of whether this should be a brand new database cluster or simply a new database on your existing cluster. We will assume the latter for the remainder of this blog post, and will consider the use case to be more about assuring that the migrations will run smoothly without resulting in data integrity issues, as opposed to trying to ensure your underlying infrastructure doesn't explode.
Some sort of variable in your CI/CD pipeline that says "I want these migrations to run on my clone, and I do not want to trigger a code deploy".
The changes in your database migration job and CI/CD pipeline to respect this variable and use this variable to override the normal database used for running the migrations, and not deploy a new application release respectfully.

The Steps

My preferred way to clone the database to another database on the same infrastructure, for example in PostgreSQL, is piping pg_dump output to pg_restore:

pg_dump -h dbhost -U pguser -W -d [your source database] -Fc | pg_restore -h dbhost -U pguser -W -d [your target database]

If you've already run this command before you might have to drop existing connections and drop and recreate the database. Since you will likely need this command to be non-interactive, you'll need either a .pgpass file, or environment variables containing this info so you can remove the -h, -U, and -W arguments.

How a variable is created for your CI/CD pipeline is obviously CI/CD platform specific, I'll leave that with you. We like to use Kubernetes Helm for deployments, and with Helm we setup our database migrations as a pre-upgrade chart hook. If your setup is similar, you can create a new hook to run the above SQL with a weight lower than your normal database migration hook so that it runs before your migration.

Then, all you need to do is setup another variable containing the name of your database clone (i.e. production_clone as per our above example), and feed this value to your migration job as an override to the normal database that would be used for your migration. Of course this is definitely worth testing in your non-production environment, and the beauty of this approach is that you'll of course have this same database migration setup in your non-production we can piggyback off of. Using variables passed into Helm you can easily handle these workflow modifications, including skipping the deployment when your boolean "do the dry run instead" variable is set.

Unfortunately, if you are using Helm there is no way to output the contents of these logs while these jobs run if you want to watch the output in your CI/CD pipeline, although there is a proposed issue to add this feature to Helm. For now, you'll have to rely on the "no failure = success" Linux/Unix convention and retrieve your logs via kubectl to troubleshoot any issues.

If you are not using Kubernetes or Helm, the same flows here should work for you, you'll just need to adapt them to your situation. For example, you could create a shell script to issue a psql command to run the above query, or set this up in whatever coding framework you use (e.g. in Rails this could be a Rake task).

As far as teardown goes, you could drop your database clone, but since you've already allocated the storage resources for the clone, there aren't going to be any cost savings to tearing down now vs. just leaving this database clone dormant. We prefer to drop before re-cloning rather than dropping on teardown, just to front load any issues that might exist with this step (such as dropping existing connections).

Don't Let This Clone Tempt You

Because this clone doesn't provide any data privacy tooling, it is not a good idea to just start using this for general purpose troubleshooting and testing purposes. For one, it's on the same infrastructure so you'd have to be mindful about your experimentation causing load issues that impact production traffic, and you'd also have to sort out access issues to create users that have access to this clone but not the master database. The big reason in my mind here though, is that without a data privacy solution you are exposing all of the sensitive/confidential information in the production database, including its PII to your users. You may trust these users with your life, but data breaches and leaks are almost always statistically the result of human error, so it is far better to not risk a data leak or copies spawning more copies throughout your organization that have to be accounted for, especially since there is no reason you need to accept this risk with an adequate data privacy solution.

The Complete Turnkey Solution

If you are looking for the easiest approach to this that aligns with the above approach, check out Redactics. Their most recent release includes a workflow for handling the above database migration recipe, it just requires plugging in to your current setup. Additionally, it also provides additional workflows to provide data privacy solutions, including another workflow for creating a PII-free database clone that auto-updates with delta updates going forward. It is very convenient to use the same software for multiple use cases, as once you have this installed in your infrastructure using additional workflows is a very easy lift.

How To Poll an Airflow Job (i.e. DAG Run)

Joe Auty — Tue, 22 Nov 2022 15:31:53 +0000

Ever wanted to actually know when an Airflow DAG Run has completed? Perhaps your use case involves this completed work being some sort of workflow dependency, or perhaps it is used in a CI/CD pipeline. I'm sure there are a myriad of possible scenarios here beyond ours at Redactics, which is using Airflow to clone a database for a dry-run of a database migration, but you be the judge!

Here is the repo

In this repo you can find a sample Dockerfile, a sample Kubernetes job that provides some context as to how this poller script can be used, as well as the script itself. There isn't much to it, but I hope it saves some developers a moment or three of their time should they need to recreate this functionality!

What Every Developer Should Know About Data Security/Privacy

Joe Auty — Fri, 18 Nov 2022 16:22:52 +0000

There are studies, including this one, that show that the vast majority of data breaches are due to human error.

Some of this human error is the result of being vulnerable to manipulation, social engineering, etc. Other forms are due to bad practices or even technical debt that lead to not adhering to the principles of least privilege, for example.

The obvious remedy to addressing the former is with security training (whether required for compliance or voluntary), and while I'm sure that to some understanding what Johnny did wrong to infect his PC with malware, or how Jane was tricked by a hacker wearing a hoodie to clicking on a thing in her email can be useful to some extent (depending on how charitable one wants to be), to others (including myself) this is just super lame. I've always wanted to better understand how and when to access the production database, how to identify sensitive information (including PII), and what specific things developers should be doing and not doing in their day-to-day development, etc. Here is what I found that I feel is worth sharing with developers:

Understand that Unfettered Access to the Production Database is Not Ideal

I know this is super obvious to many, but this is as good a starting place as any. Some less experience developers I've met have felt that under no terms should anybody have access to the production data, and in some companies VPN access to the database is embedded in their working culture. In reality, there are always "break glass" scenarios where somebody needs to make exception to any established rules for the purposes of fire-fighting. Even if you are not fire-fighting, there may be times when access is required in lieu of a better option. Therefore, my position is not absolutist and that access should be a "no, never, under any circumstances", but at the very least, we should all strive towards providing better, more sustainable sorts of solutions. I think many developers understand this, but we tend to kick that can down the street without a justifiable replacement option.

Even if the risk of leaking sensitive information somehow is not compelling, we all know that a single bad query can cause a myriad of problems, whether this is load related or actual data integrity issues of an errant query (or script). To state the obvious yet again, there is no "undo" button in a database, sadly.

Moving Data Off of the Production Database For Analysis/Testing

Moving data elsewhere is a perfectly rational thing to entertain — it's sort of the idea of taking the bullets out of the loaded gun.

The problem is, if by doing so you create a copy of this data and this data contains sensitive information, you are potentially creating data leaks, especially if these data copies start breeding more and more copies throughout your organization (including on laptops). It is important to understand the risks involved here and start thinking about a possible solution rather than trading off one can to kick down the street for another.

Are Data Leaks Really All That Common? Is the Risk Overhyped a Little?

Truthfully, the actual risk of a copy of production data on an engineer's laptop is probably minimal, but the problem is if this becomes a regular sort of practice, this becomes extremely difficult to manage. If your company is subject to some sort of compliance, this may hang up a passing grade from an audit. If you aren't and your company is ready to exit, an auditor will likely look for skeletons in your closet in the form of bad data management practices. Nobody wants a PR nightmare of a data leak or some sort of exposé on sloppy data management practices. In many cases the psychological pressures here and possible narratives dominate over the actual reality.

This is why there are very successful, highly valued companies (an example) that help identify sensitive information throughout organizations. I'm not criticizing the existence of these companies or gaslighting their utility, they are clearly a necessary evil, but I feel that bad practices with data is sort of a form of original sin that provides companies like this business in the first place. By the time you need them you are kind of trying to put that horse back in the barn.

How Do We Keep That Horse in the Barn?

I suggest checking out Redactics. What I feel is compelling about the approach taken here is they automate creating production samples and sending that data to wherever it should go (whether this is another database or an Amazon S3 bucket, for example), and redacting data is a part of this same process. By the time the data arrives at its desired destination, the sensitive information and PII is already gone. There is nothing to account for (aside from the engineers that have physical access to the infrastructure where this software runs). Moreover, this software is free for developer usage without caveat (and may soon be open-sourced), so this product is not some pay-to-play sort of solution targeted only at large enterprise companies.

Know Your Data

Whether your solution of choice automatically identifies sensitive information and PII or you have to find it yourself, I feel it's important for developers to be aware of specific fields that contain this information, and at least document this in some way so it can be accounted for, however you choose to do so.

With this information for example, you could create table grants that forcefully block access to this data for non-privileged users, or if you are using a solution like Redactics you can ensure that all fields are included in your redaction configuration. Redactics does include a PII scanner tool that looks for PII based on HIPAA identifiers based on a limited sample of your production data, but this tool is intended to be a guide and an additional check. It is much better to simply know your data.

It is also important to note that it is not just personal information that we need to be careful about, but it's sensitive or confidential information like API keys and hashed passwords as well. In my experience, this information is rarely actually needed for QA, analysis, etc. so my rule of thumb is to apply a redaction rule to whatever you are unsure of, especially if there is no harm in doing so.

Data Re-identification

In some cases leaving certain (needed) data in its original state can have unintended consequences from a security perspective. For example, if you redact a person's name and email but not the town they live in, in a small enough town somebody with access to this data and is really invested in knowing who this person is might be able to figure it out if they have the person's age, for example.

Redacting fields is not always a bullet-proof solution, especially if the use case requires retaining a certain amount of data for analysis (analytics, machine learning, etc.). If you feel this risk is relevant to you, looking at a tactic such as differential privacy might be worthwhile.

What is Differential Privacy?

Differential Privacy is effectively a giant shuffle button you press to take values in a dataset and assign them to random users within that dataset. This way, the numerical values in the aggregate remain the same, yet if you capture a certain row of data data re-identification is no longer possible. A number of BigTech companies have adopted differential privacy tactics in their platforms and applications.

Don't be Paralyzed

Perhaps some developers are so concerned about the possible risks here that they stick with their seed/dummy data, or find workarounds that don't involve working with their production data even if these are less efficient. You simply need to be mindful of these issues, and with a solution in place (whether Redactics or something else) you are confident with, you don't have to be blocked. Chances are this same solution will unblock a whole bunch of other use cases in your company — your production data is extremely valuable when leveraged to its full potential!

How To Stop Living With Your Seed Data Sucking

Joe Auty — Thu, 17 Nov 2022 16:56:52 +0000

So often throughout my career I've come across projects where seed data was created during the very early days of a repo, and the bare minimum is done to update this seed data as schema changes. If a new string column is added, maybe at best developers will add a "blah" on required string/varchar fields just to get tests to pass, but in no way is this data robust enough to really back rigorous tests, and it is a pain to maintain for thorough regression testing.

The main problems, as I see them, are:

When writing unit tests it can be challenging to think of edge cases that your users will find, and conjuring up the time and creativity to challenge your code with good seed data.
As your users expose new edge cases, developers have to go back and maintain this seed data if you want your test coverage to account for these edge cases (which is generally a good idea).
Unit tests typically mock certain data from other services or from external APIs/data sources, while integration tests are designed to provide a more in-depth sort of experience using your product with as realistic data as possible. Developers typically don't have a great way to generate datasets robust enough to justify this entire exercise.

I don't have a great solution to the first problem, unfortunately we all lack crystal balls. However, let's discuss the other challenges...

How Do We Come Up With Realistic Data for Integration Testing?

There are SaaS providers that create synthetic data, which is fake data that has all of the same characteristics of real data. I haven't used it for machine learning modelling, but I would imagine it would excel for this purpose in using logic to cover all possibilities of data permutation. However, I'm not sold on it for QA/testing purposes. In my experience, the most realistic form of production-like data is... production data. Customers always seem to have a knack at wanting to do unexpected things with products I've supported, following their lead seems like the best approach. One problem with using production data has always been grabbing a fairly small sample (you don't want gigabytes of data belonging to users that are not being used for your tests), and protecting customer's PII and sensitive information. There are solutions for both of these problems...

How Do We Support New Features with Production Data?

We are back at the chicken and egg problem, as described above. However, in my experience integration tests are great for regression testing, and often times it is that legacy feature or part of your code that is rarely your happy path, so we can use production data as test data for existing features, and do your best with fake data for unreleased features that have no production data created yet.

Does This Mean Resetting the Data Used For Integration Tests Periodically?

Yes. Fortunately, this too can be automated (and integrated with your CI/CD platform). When should this be done? Well, I've also found value in using these same data sets for local testing, particularly when a customer has reported a problem, so whenever I install a dataset locally after I resolve the issue I update my integration tests to use this same dataset. You could also do this before a major release, the beginning of a coding cycle, whatever works for you and your team. This sort of requires creating some new habits, but it's worth it to have smooth releases without the frenzy of fire fighting.

With This Production Data Sample How Does We Impersonate Specific Users?

Your sample can be for the most recent 5-10 users, some of your super users, or even just a random selection of users. Your approach of choice will influence your tactics, but it's not hard to take over an account once you have physical access to the underlying data.

How Do We Actually Do All Of This?

Check out Redactics. It is free, there are plans to open source it, and it is ideal to:

Generate production data samples (support is coming to pull data for specific users, but you can extract data from specific time periods).
Handle all PII and sensitive data so that it never finds its way into your testing environments.
Automatically generate datasets on-demand or on schedule that are pushed to an Amazon S3 bucket your tests can pull from.
Install these same datasets locally for local testing/QA.
Automate not only generating these datasets, but also installing them directly into your environment. For example, for local usage there is a CLI tool to install the data locally via docker-compose.

Thoughts? Really curious to hear what you have to say!

Microservices Without the Complexity or Architectural Concessions

Joe Auty — Tue, 15 Nov 2022 03:32:33 +0000

Virtually every backend engineer I've ever met has some sort of opinion about microservices. If I had to summarize the entire discussion:

Not having a giant monolith that requires tactics for establishing code boundaries and preventing code paths from getting all tangled up is at least of interest.
The mere mention of microservices surfaces complex subjects about messaging between services, and data ownership and sharing.
In my opinion, the messaging problems are largely addressed with some sort of pub/sub service, Kafka, or something that will require messages to be acknowledged and retried in the event of failure (http is usually not the best choice for this).
The data issues are the most vexing. Namely, once you've determined where that source of truth comes from, what do you do when some piece of data needs to be shared with another service (as it often does)? Do you request it as needed via your messaging service? Create a dedicated service to serve up this data, spawning more and more services in the process? Invest in solutions that provide real-time data replication through Kafka? Rationalize sharing the same database?

What are some simple options that should not offend the purists among us?

Database Sharing

This one is often a non-starter for many, and I'm not necessarily advocating for this approach, but I disagree with some of the reasons against this option.

In my opinion, infrastructure and application code does not have to be joined at the hip. I don't see a problem with, for example, having a single database cluster running multiple databases and having this cluster shared between services. Yes, you don't get your resource isolation so that the immense load from one service doesn't affect the other, but let's face it, databases are designed to withstand a lot of load. We don't necessarily have to coddle them. Unless your service is just getting bananas traffic and is only getting a healthy amount of traffic, sharing a cluster is not necessarily the end of the world.

If we can rationalize sharing a cluster, why not a database, especially if the same caveat of not being overly concerned with load applies? We can prevent access to tables that are not the business of the service in question with appropriate table/column grants. Yes, this requires diligently keeping up with updating and applying these grants, but this is possible, and may even be a good option if, for example:

The service will only ever need read-only access.
The schema of the service is fairly static and not likely to change much.

If these simple sort of conditions apply to you, maybe this is an experimental service and you don't want to invest too much into it, or the context otherwise justifies this, perhaps this better than going crazy with getting into database replication?

Data Sharing

There isn't anything necessarily wrong with creating copies of data, so long as:

The source of truth is never in question.
You aren't spreading or leaking sensitive information/PII beyond where it needs to reside, from a data security perspective.

At Redactics we built what we think is a really simple approach to all of this, and it is free for developer usage (by the time you read this the relevant technologies here might be open source). I won't jump into sales talk, you can decide for yourself whether this approach is to your liking, but we clone specific tables using only SQL and provide options to redact sensitive information including PII.

The first time a table is cloned it is copied in its entirety, and the next time it is cloned only changes are copied (i.e. delta updated) using the techniques described in this blog post. The result, particularly with our support for a lot of concurrency, is updates in near real-time, but without all of the cost and overhead of technologies like Kafka/Confluent, logical replication, etc.

If your service really needs up to the second data, this approach is probably not for you, but if it can afford to be a few minutes behind, we hope this is worth your entertaining. We get around the source of truth challenges by assuring that each table that is replicated includes a column called source_primary_key that relates to the original master. This way it is clear that this is a copy. You probably don't want to update this replicated data, but if you have to this column will help you reconcile possible differences.

Do These Options Tip the Scales?

That entirely depends on the situation and context, but certainly this helps provide the usual benefits given from having multiple services (which I don't think I need to spell out here), but making important compromises in keeping that complexity level down. If you're a startup, keeping the complexity down can be a great thing, and even if you aren't, not every situation warrants the most complex solutions, just like not every website needs to be run on Kubernetes.

Please let us know what you think! We are a new company, we really benefit from having conversations with engineers like you, no matter which way you are inclined to lean with these debates and balances.

How To Deliver Timely Safe Production Data to your Engineering Teams

Joe Auty — Thu, 10 Nov 2022 03:34:24 +0000

This guide showcases a brand new tool (free for use for developers) called Redactics, which is a managed appliance for powering a growing number of data management workflows using your own infrastructure. They would love to hear from you if you appreciate what they are building. This guide is one of a series of recipe-based instructions for solving specific problems and accomplishing specific tasks using Redactics, and its focus is in powering datasets for demo environments with the following features/characteristics:

A growing collection of “data feeds” which are data delivery and replication options including uploading data to an Amazon S3 bucket used to back a new or existing data lake, or a “digital twin” database/data warehouse which clones new production data (minus sensitive information and PII). If you do not see a data feed that is appropriate for your use case, please let us know and we will let you know when this will be added based on our product roadmap.
Options to delta-update this new data so that performance approaches real-time, allowing you to schedule updates every few minutes or so.
Delivery of the original raw data from a selection of tables, with the only transformations being the handling of sensitive information including PII. This allows your teams to build out their own views and reports based on a copy of this original data, the focus of this workflow being simply to provide your stakeholders with a clean copy of this data that is being constantly updated.
The replicated tables include a column called source_primary_key which provides a definitive record of the original primary key from the production database, should data ever need to be reconciled against its master records.
Options to replicate data from a certain time period for specific tables, in addition to full-copy options, and exclude tables that provide no value to be replicated. Support to aggregate data from multiple input data sources within a single workflow.

Step 1: Create Your Redactics Account

Have your engineer create your company’s Redactics account (or create it yourself and invite them to it). They will need to also create their first Redactics SMART Agent and workflow. Don’t worry about configuring the workflow for right now, the engineer simply needs to follow the instructions to install the SMART Agent with an empty workflow of type ERL (Extract, Redact, Load). You can give this workflow any name you like, e.g. “ML Experimentation”. They’ll also need to define the master database by clicking on “Add Database” in the “Input Settings” section. This will also require listing all of the tables you intend to use within this workflow (and don’t worry, if you need to change these you can always come back later). Once the SMART Agent has been installed it will report back to the Redactics Dashboard, and you’ll see a green checkmark in the SMART Agent section:

With this step complete, once you have established a working workflow and decide to update it later, the SMART Agent will automatically recognize these changes without a re-installation being required.

Step 2: Configure Your Redactics Workflow

Return to your workflow configuration, your Input Settings should already be completed, but if you wish to aggregate data from multiple input sources you can define additional databases here. Some notes on this:

The SMART Agent requires network access to each database.
Please let us know if you require support for input sources other than the available options. It is not difficult for us to add support for additional input sources, we are prioritizing based on customer feedback!

Once you’ve defined all of your input sources, proceed to do the following:

In “Processing/Transformation Settings” define all of your tables and fields containing sensitive information, and select a ruleset for handling this.
If you are unsure that you’ve identified all of your PII, your engineer can install the SMART Agent CLI and kick off an automated scan using the PII Scanner tool. Results will be reported back to the “PII Scanner” section of the Dashboard where you can automate creating additional redaction rules to your configuration for these new fields.
In the “Workflow Schedule Options” you can decide to put your workflow on a schedule. Please note that these times are in the UTC timezone (also known as Greenwich Mean Time or GMT), and custom times are expressed in crontab format. You can use this guide to format a custom time if you wish. You might want to start with running this jobs overnight (e.g. to run this at midnight UTC this custom time will be *0 0 * * **). For testing purposes your engineer can run these jobs manually whenever needed, and you can change this schedule whenever you want and have this recognized within minutes.
In the “Output Settings” section, you can specify time periods for each table. Since a lot of databases include relationships to things like users and companies it is advisable to include all of these tables as to not break any relationships, but for data such as individual transactions made by users you can usually safely omit historic data within these tables. You can always return to this setting and make adjustments later. Note that these time periods can be based on creation times, update times, or both, and you’ll need to note which fields are used for recording these respective timestamps.
Select as many data feeds as you wish for your workflow. For example, the “Create a PII-free Digital Twin/Clone” will create the aforementioned clone, and the “Upload/Sync Data to an Amazon S3 Bucket” will add data to an S3 bucket which could back a data lake.
If you’ve selected the Digital Twin data feed option you will be provided with instructions for adding the connection info for this database to your SMART Agent configuration (which will require a one-time re-install to inject and save this information).
Click “Update”, and then “Save Changes”.

Step 3: Running Your Workflow

Congratulations, your data feeds will deliver production data to their destinations! To ensure that you are good to go, you can either bump up your schedule or else have an engineer invoke the workflow manually via the Redactics SMART Agent CLI. Any issues with the workflow will be reported to the Redactics Dashboard, and, of course, you are welcome to contact Redactics Support if you require any assistance!

As workflows run, the progress will be reported to the Workflows -> Jobs page, and when this work has been completed a report will be provided detailing what was copied. If you’ve enabled the Digital Twin data feed option with delta updates updated, on subsequent runs you’ll see visual feedback like the following:

Note that if the schema of these tables change, this will automatically detected and the table will be full copied for its next run instead. This way, the table’s schema will be applied in addition to any backfilled data.

Relating Data Back To Its Master

This is worth re-iterating: when you delta update a new column will be created called source_primary_key containing the primary key of the master record. If you need to search for specific records in your digital twin, you’ll need to adjust your queries to use this field instead. One reason for this design is to establish a single source of truth, with that truth being the master record in your production database. Whenever this master record is updated, this update will automatically be applied to your test data on its next run so long as the updated_at fields in these original tables have been updated.

Data Privacy By Design

By setting up all of your stakeholders with access to the data they need delivered by an appropriate data feed, you can then sever any access they had to your production database. By doing so, this establishes a new paradigm of safe data by default, your own “No PII Zone”, which in turn provides data privacy by design, and not having to account for production data access in your compliance audits as you would have otherwise.

Case Study: Apache Airflow As a Managed Appliance

Joe Auty — Mon, 07 Nov 2022 17:48:54 +0000

We bet our business around our customers being sold on running our software in their own infrastructure, rather than building a more traditional SaaS platform. Redactics is all things data management: primarily data pipelining with PII removal for test and demo environments (including ML, analytics, etc.), data delivery to stakeholders, database migration/schema dry-running, data destruction, etc. In short, we are trying to position ourselves as a robust toolkit for building out automated data workflows. We are a new company and would love to hear from all of you, but I'll stop here so that this doesn't come across as a self-serving sales pitch!

We feel that there would be numerous technological disadvantages to building this company as a SaaS platform: data security concerns in shipping data to the cloud, performance, cost overhead, etc. It seems like certain areas of technology have a knack of swinging back and forth like a pendulum - for example the classic mainframe vs. thin client question. Relatively speaking, there are not many companies taking the managed appliance approach (although we do supplement our local agent with our cloud-hosted APIs), so who knows, perhaps we are part of the pendulum swinging in the other direction (for very specific use cases) or are just out of our minds!

The existence of Airflow and Kubernetes really tipped the scales for us in making this decision, it is hard to imagine building this company without these technologies. Airflow can certainly scale horizontally to manage a gazillion jobs and workloads, but we were impressed with how well it runs with a minimalist resource footprint, and how resilient it has been for us on very modest hardware. The improvements to the scheduler in 2.x and the dynamic task mapping features in 2.3 have been huge improvements, the latter in allowing our DAGs to be remote managed from the cloud (more on this later).

We thought there might be an audience interested in understanding our tactics that have been successful for us.

Kubernetes and Helm

We are big fans of Helm, and it is a key ingredient to how our customers install/bootstrap our Agent software. Our Dashboard builds out both a workflow configuration for our customers, as well as a helm install/upgrade command for installing and updating our Agent software. We run our own Chartmuseum registry so that customers can take the "if it ain't broke, don't fix it" approach, pinning their version so they can upgrade (and downgrade) at their leisure. A single command that users can copy and paste (as well as one time Docker registry and ChartMuseum authentication steps) really fits well with our approach of building a turnkey-style appliance.

The popularity and ubiquity of managed Kubernetes services (EKS, GKE, AKS, etc.) was another important factor, so we can provide a simple recipe (or Terraform plan) for setting up this infrastructure for customers not already using Kubernetes.

Database Authentication

If you've used Airflow below you know that it includes support for encrypted variables, which is an appropriate way to store sensitive information such as database passwords. These are not available to the KubernetesPodOperator though, so these will need to be provisioned as secrets. We created a script to run airflow connections delete and airflow connections add commands to recreate these connections based on values passed in to Helm via a local values.yaml file. This way, as new input sources are defined or passwords rotated these will be applied on each helm upgrade command, and we don't have to get our users to input their passwords into the Redactics Dashboard. In other words, their connection information is sort of a local augmentation to the configuration generated by the Redactics Dashboard, and the Dashboard generates a template configuration file with "changemes" where authentication information should be provided, and they create their local configuration file by replacing these "changemes" with their real values.

The KubernetesPodOperator, Resource Management

Because we are installing onto Kubernetes, we have the luxury of leveraging the KubernetesPodOperator for some of our workflow steps, particularly those with dependencies that go beyond Python. Setting container resource limits help us stay true to our goals of sticking in a very minimalist resource footprint, even when datasets grow in size. This is particularly helpful when running steps in parallel, and because the number of steps in our workflows is dynamic and context dependent, it is important to set the max_active_tasks DAG argument to ensure that your resources operate within their allocated footprint. Otherwise your workflows could be really slow and/or you could face pod evictions.

We set our Helm chart for Airflow to skip installing the webserver, as we use on_failure_callback callbacks to read the logs from the filesystem and send them to our APIs so that our Redactics Dashboard is what customers interface with, so that they don't have to jump between our Dashboard and the Airflow webserver. We send pod exec commands to the Airflow CLI in the scheduler rather than the REST API for starting workflows manually. Our Agent software installs a few different pods, but the only Airflow pod needed is a single schedule. It sends a heartbeat to our APIs as well so we can provide feedback in our Dashboard about the status and health of the installation.

Dynamic Task Mapping

We were anxious to try out Airflow 2.3's Dynamic Task Mapping along with the Taskflow API introduced prior, as without these features we could only inject preset workflow configurations into our DAGs, which meant that whenever we wanted to update our workflow configurations these DAGs also had to be updated, which meant another helm upgrade command which needed to be run. Now, our DAGs do the following:

Fetch the workflow config from our dashboard's API (and we can control how often the DAG is refreshed with this Airflow variable, or else mitigate the load with Redis/memory caching
Several workflow steps have KubernetesPodOperator commands that are dynamically generated based on these workflow configs fetched from our API with Python functions such as the following:

@task(on_failure_callback=post_logs)
        def gen_table_resets(input_id, schema, **context):
            tables = []
            for input in wf_config["inputs"]:
                if input["id"] == input_id:
                    for table in initial_copies:
                        tables.append(table)
            if len(tables):
                return [["/scripts/table-resets.sh", dag_name, schema, ",".join(tables)]]
            else:
                return []

Each workflow configuration has its own DAG so that this work is tracked independently. The dag_name variable, above, or a unique configuration ID. This ensures that if you use Redactics with multiple databases (as we hope you will), a problem with one workflow is isolated and doesn't impact the others.

Reporting

We mentioned leveraging the on_failure_callback callbacks to report issues to our APIs, but we also leverage the on_success_callback callbacks to render progress bars and show visual feedback of the workflow being run by sending this to our APIs for display in our Dashboard.

Our CLI supports outputting diagnostic information about the health of the pods, zipping up logs, etc. but so far our customers have not needed this capability. We built this for the possibility of having to deep-dive specific issues with our customers, but one advantage to using Kubernetes is when customers are already familiar with it, they can be somewhat self-sufficient. For example, our Agent supports being pinned to specific nodes, and this works the same way you'd assign any other pod to a node.

Persistent Storage, Parallel Tasks

We built an open source streaming network file system, sort of like a NAS with an http interface, so that pods could easily access and share files, including concurrent usage by tasks running in parallel. We decided on this approach rather than shipping files up to Amazon S3 or the like, in part because this fit better in our "use your infrastructure" appliance approach without having to get customers to navigate creating buckets for use with the Agent (and dealing with possible data privacy issues in doing so). Performance, of course, was another important factor in evaluating our options here, as was providing customers the option to not have to open up any sort of egress networking. We wrote another article about this in case you're curious to learn more, but this was definitely a challenge, as ReadWriteMany Kubernetes persistent storage options are not a given.

Airflow supports running tasks in parallel, and this is a great way to improve the performance of your workflows where possible.

Summary

Airflow is a superb choice for workloads in general, but even those that need the sort of "Airflow-slim" minimalist approach. Your only challenge might be some sort of shared filesystem. There is no reason I can think of why Redactics and a managed Airflow appliance in general could not work on a VM as well, providing you could come up with a way to simplify installing and updating this software.

We hope that some of this information is useful! Please be sure to let us know...

How To Give Awesome Demos Using Updatable Datasets

Joe Auty — Tue, 01 Nov 2022 16:27:03 +0000

This guide showcases a brand new tool (free for use for developers) called Redactics, which is a managed appliance for delivering "safe" datasets to your stakeholders. They would love to hear from you if you appreciate what they are building. This guide is one of a series of recipe-based instructions for solving specific problems and accomplishing specific tasks using Redactics, and its focus is in powering datasets for demo environments with the following features/characteristics:

Demo-specific customizations (such as using the name and logo of the company you want to demo to) generated as needed without the assistance of an engineer.
Scheduled or on-demand dataset generation automatically hard reset before each demo to ensure that each new demo does not show data from the last one.
Multiple datasets for multiple variants of demos you might want to offer.

Step 1: Prep Your Master Data

Get an engineer to setup a new environment with an empty database. This infrastructure can be reused for your actual demos, its use will be to host your master demo template data. You'll want to integrate this environment with your CI/CD to support code updates/deploys as desired to showcase your latest and stable application build, and you'll need some sort of domain pointed at this app (e.g. demo-master.yourcompany.com). Create an account within this environment for each demo variant you intend to give. Tweak this data for each demo to show what you want to show, but don't worry about the company name, logo, or any customizations you wish to make for each demo you give. Consider this data your "start of the demo" data for internal use only. Each time Redactics runs it will hard-reset your demos back to this master data, plus any individual customizations you wish (more on this in a bit...)

Step 2: Prep Your "Digital Twin" Database Clone

Setup another empty database in this same environment (we recommend using the same database cluster), which we'll use for running your actual demos. Each demo will feature data based on your template database reset for each demo — a clone or "digital twin" of the original, so that there is no residual data left over from the previous demo. You'll need another domain to point at this application (e.g. demos.yourcompany.com). Do not start writing to this database, but ensure that your engineer that will be installing Redactics has the database connection information (i.e. hostname, username, password, database names) for both your master template and your empty demo database.

Step 3: Create Your Redactics Account

Have your engineer create your company's Redactics account (or create it yourself and invite them to it). They will need to also create their first Redactics SMART Agent and workflow. Don't worry about configuring the workflow for right now, the engineer simply needs to follow the instructions to install the SMART Agent with an empty workflow of type ERL (Extract, Redact, Load). You can give this workflow any name you like, e.g. "Demos". They'll also need to define the master database by clicking on "Add Database" in the "Input Settings" section. This will also require listing all of the tables used by the demo — this listing of tables will be copied over (and reset in your database clone/digital twin whenever the job runs). Once the SMART Agent has been installed it will report back to the Redactics Dashboard, and you'll see a green checkmark in the SMART Agent section:

Step 4: Configure Your Redactics Workflow

Return to your workflow configuration, your Input Settings should already be completed. Then:

Since this database data won't include any sensitive information you can skip the "Database Fields Containing PII/Confidential Info" section.
In the "Workflow Schedule Options" you can decide to put your data cloning/reset on a schedule. Please note that these times are in the UTC timezone (also known as Greenwich Mean Time or GMT), and custom times are expressed in crontab format. You can use this guide to format a custom time if you wish. You might want to start with running this jobs overnight (e.g. to run this at midnight UTC this custom time will be *0 0 * * **). For testing purposes your engineer can run these jobs manually whenever needed, and you can change this schedule whenever you want and have this recognized within minutes. The workflow usually takes a few minutes to run, so you can come back here and change the schedule to run shortly before your next demo.
In the "Output Settings" leave "all table rows" set for each of your tables.
Create your digital twin data feed by clicking on "Add Data Feed", followed by "Create a PII-free Digital Twin/Clone". Check the "Database Connectivity Should Be TLS/SSL Encrypted" checkbox if this is a technical requirement (and follow the instructions for enabling this if so), and leave "Enable Delta Updates" disabled.
Optional: If you are interested in making data customizations for individual demos, for example displaying the company name and logo of the company you are demoing to, you'll need to have your engineer help out with one more step, and that is to create your prepared statements secret as per the instructions in "Enable Custom Data Updates" section. Your prepared statements are simply SQL queries needed for your data customizations, but unlike regular SQL queries they work based on parameters which you'll provide here. For example, if your prepared statement is UPDATE company SET name=%(company_name)s WHERE company=%(template_company_name)s, you'll need to provide parameters for company_name and template_company_name. In this example the template company name will be the original company name from your template database, and the company name the name of the company you are demoing to. Whenever you book a new demo, simply come back here and change the value for company_name (and company logo URL and anything else you'd like to customize).
Once you've created your Digital Twin data feed you will be provided with instructions for adding the connection info for this database to your SMART Agent configuration (which will require a one-time re-install to inject and save this information).
Click "Update", and then "Save Changes".

Now What?

Congratulations, your demo databases will be populated and reset as per your schedule, you are ready to start giving demos via your application that is configured to use your digital twin/clone database. Your product/sales people can return to the Redactics Dashboard to customize their next demo, and you can edit your master template data at any time and have this copied to your demo environment without engineering assistance. To ensure that you are good to go (because you probably don't want to wait until your next demo to find out), you can either bump up your schedule or else have your engineer invoke the workflow manually via the Redactics SMART Agent CLI. Any issues with the workflow will be reported to the Redactics Dashboard, and, of course, Redactics also provides support should you require any assistance.

As workflows run, the progress will be reported to the Workflows -> Jobs page, and when this work has been completed a report will be provided detailing what was copied.

How to Create Prepared Statements with the Airflow PostgresOperator

Joe Auty — Sat, 29 Oct 2022 23:38:23 +0000

Airflow's PostgresOperator includes a field called parameters for providing SQL parameters for prepared statements.

We wanted to use named parameters since our parameter order would vary, which meant understanding the syntax. In their provided example: SELECT * FROM pet WHERE birth_date BETWEEN SYMMETRIC %(begin_date)s AND %(end_date)s, the format here was unclear - for example the trailing s (we guessed this meant string, but queries with numerical inputs were failing without the s.

After some research we realized that the underlying dependency here was Psycopg. From this page:

Named arguments are supported too using %(name)s placeholders in the query and specifying the values into a mapping. Using named arguments allows to specify the values in any order and to repeat the same value in several places in the query:

and:

The variables placeholder must always be a %s, even if a different placeholder (such as a %d for integers or %f for floats) may look more appropriate for the type.

So, follow the above format for your variable placeholders, and ensure that the variable passed in is cast to the correct format (i.e. don't quote numbers).