Forem: abdu masah

Arrowjet is now a Cross-Database Sync Tool in Python (PG, MySQL, Redshift)

abdu masah — Tue, 28 Apr 2026 21:46:50 +0000

I've been building Arrowjet, an open-source Python library for fast bulk data movement. It started as a Redshift speed tool, but it now supports PostgreSQL, MySQL, and cross-database transfers.

The latest addition: stateful sync that keeps tables in sync across databases.

The problem

Moving data between databases usually means writing custom scripts per source/destination pair. Add incremental logic, schema drift handling, retry on failure, and you're maintaining a mini-ETL framework.

What sync does

One function call:

import arrowjet
from arrowjet_pro import sync

pg = arrowjet.Engine(provider="postgresql")
my = arrowjet.Engine(provider="mysql")

result = sync(
    source_engine=pg, source_conn=pg_conn,
    dest_engine=my, dest_conn=mysql_conn,
    table="orders",
    key_column="updated_at",  # incremental
)
# Sync SUCCESS: 12,000 rows (incremental)

It decides full vs incremental automatically based on previous state. Truncates destination on full sync. Validates row counts after. Retries with backoff on failure.

Schema-level sync

Sync an entire schema with filtering:

from arrowjet_pro import sync_schema

result = sync_schema(
    source_engine=pg, source_conn=pg_conn,
    dest_engine=my, dest_conn=mysql_conn,
    schema="public",
    exclude=["*_tmp", "*_backup"],
)
# Multi-table sync: ALL OK
#   Tables: 14/14 succeeded
#   Total rows: 2,340,000

YAML config for repeatable jobs

source:
  profile: my-postgres
destination:
  profile: my-mysql
defaults:
  mode: auto
  key_column: updated_at
  retry: 2
tables:
  - orders
  - users
  - name: products
    dest_table: product_catalog
    mode: full

CLI

arrowjet sync --table orders \
  --from-profile pg --to-profile mysql \
  --key-column updated_at

arrowjet sync --schema public \
  --from-profile pg --to-profile mysql \
  --exclude "*_tmp" --dry-run

Under the hood

All transfers use the fast path for each database:

PostgreSQL: COPY protocol (850x faster than INSERT)
MySQL: LOAD DATA LOCAL INFILE (6.6x faster)
Redshift: COPY/UNLOAD via S3

Arrow is the in-memory bridge between databases. No intermediate files, no serialization overhead.

Future

pip install arrowjet - bulk read/write/transfer, CLI, 3 database providers
pip install arrowjet-pro - sync, drift detection, schema auto-fix, alerting, operation log

GitHub: https://github.com/arrowjet/arrowjet PyPI: https://pypi.org/project/arrowjet/0.6.0/

850x Faster PostgreSQL Writes With One Line of Python

abdu masah — Sun, 26 Apr 2026 10:51:33 +0000

Every Python developer loading data into PostgreSQL hits the same wall.

executemany() with 1M rows? 16 minutes. df.to_sql()? Same thing — it generates INSERT statements under the hood. Even method='multi' with chunking is slow because each batch is still a SQL statement parsed by the server.

PostgreSQL has had a faster path since version 7.x: the COPY protocol. It bypasses the SQL parser entirely and streams CSV or binary data directly to the storage engine. But wiring it up yourself means dealing with copy_expert(), CSV serialization, NULL handling, and type mapping.

I built arrowjet to wrap that in one line.

The numbers

Benchmarked on RDS PostgreSQL 16.6, EC2 instance in the same region, 1M rows (3 columns: BIGINT, DOUBLE PRECISION, VARCHAR):

Writes:

Approach	Time	vs Arrowjet
`executemany` (batch 1000)	~16 min	850x slower
Multi-row VALUES (batch 1000)	8.4s	7.4x slower
Arrowjet	1.13s	baseline

Reads:

Approach	Time	vs Arrowjet
`cursor.fetchall()`	1.00s	1.5x slower
Arrowjet	0.65s	baseline

The write speedup is the headline: 850x. That's not a typo. executemany sends each row as a separate protocol-level operation. COPY sends the entire dataset in one streaming operation.

How it works

import arrowjet
import psycopg2

conn = psycopg2.connect(
    host="your-host", dbname="mydb",
    user="user", password="...",
)

engine = arrowjet.Engine(provider="postgresql")

# Write — COPY FROM STDIN (850x faster than executemany)
engine.write_dataframe(conn, df, "target_table")

# Read — COPY TO STDOUT (1.5x faster than fetchall)
result = engine.read_bulk(conn, "SELECT * FROM target_table")
df = result.to_pandas()

No S3 bucket. No IAM role. No staging config. Just your existing psycopg2 connection.

Works everywhere

The COPY protocol is core PostgreSQL — not an AWS extension. This works with:

Aurora PostgreSQL
RDS PostgreSQL
Self-hosted PostgreSQL
Docker PostgreSQL
Supabase, Neon, CockroachDB

If psycopg2 can connect to it, arrowjet can bulk-load it.

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, Django), you don't need to change it:

import arrowjet

# Your existing connection — keep it
conn = your_existing_pg_connection()

engine = arrowjet.Engine(provider="postgresql")
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")

Arrowjet doesn't own the connection. It just does the bulk part.

Also supports Redshift

Arrowjet started as a Redshift bulk engine (COPY/UNLOAD via S3). The PostgreSQL provider is new in v0.3.0. Same API, different execution strategy:

# PostgreSQL — COPY protocol, no staging
pg_engine = arrowjet.Engine(provider="postgresql")

# Redshift — COPY/UNLOAD via S3
rs_engine = arrowjet.Engine(
    provider="redshift",
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3",
    staging_region="us-east-1",
)

# Same API for both
engine.write_dataframe(conn, df, "my_table")
result = engine.read_bulk(conn, "SELECT * FROM my_table")

Redshift benchmarks: 4x faster reads, 14,000x faster writes.

CLI

arrowjet export --provider postgresql \
  --query "SELECT * FROM users" \
  --to ./users.parquet \
  --host your-host --password your-pass

What it doesn't do

It's PostgreSQL and Redshift only for now. MySQL is next on the roadmap.
The read speedup (1.5x) is modest compared to writes (850x). The COPY protocol advantage for reads grows with data size and network latency.
You need psycopg2 or psycopg3. Standard library connections won't work (no COPY protocol support).

Try it

pip install arrowjet

GitHub: github.com/arrowjet/arrowjet
Docs: configuration, IAM setup
Redshift benchmarks: 4x faster reads blog post

MIT. Feedback welcome — especially from anyone doing bulk PostgreSQL loads at scale.

4x Faster Redshift Reads With One Line of Python

abdu masah — Tue, 21 Apr 2026 20:59:05 +0000

Arrowjet wraps Redshift's COPY/UNLOAD commands in a simple Python API. Benchmarked at 4x faster reads and 3,000x faster writes than standard drivers.

Standard Redshift drivers fetch data row-by-row over the PostgreSQL wire protocol. For 10M rows, cursor.fetchall() takes about 145 seconds on a 4-node cluster.

That's not a Redshift problem. That's a wire protocol problem.

Redshift has a much faster path: UNLOAD to S3 as Parquet, then read the files. This is how AWS moves data internally. But wiring it up yourself means 30-50 lines of boilerplate every time — S3 paths, IAM roles, Parquet conversion, cleanup, error handling.

I built Arrowjet to wrap that in one line.

The numbers

I benchmarked on a 4-node ra3.large cluster with an EC2 instance in the same region. Each test ran 5 iterations in randomized order to eliminate ordering bias.

Reads (10M rows):

Approach	Time	Speedup
`cursor.fetchall()`	144.5s	baseline
Manual UNLOAD script	~58s	2.5x
Arrowjet	36.3s	4x

Writes (1M rows):

Approach	Time	vs Arrowjet
`write_dataframe()` INSERT	~23 hours	3,138x slower
Manual COPY script	11.1s	parity
Arrowjet	11.7s	baseline

The read speedup scales with cluster size — more nodes means more UNLOAD parallelism. On an 8-node cluster, you'd expect ~7x. The write path matches what a competent engineer scripts manually, but in one line with automatic cleanup.

How it works

Reads: your query goes through UNLOAD → S3 → Parquet → Arrow.
Writes: your data goes through Arrow → Parquet → S3 → COPY.

For small queries, Arrowjet uses the standard PostgreSQL wire protocol (safe mode). The bulk path only fires when you explicitly ask for it.

import arrowjet

conn = arrowjet.connect(
    host="your-cluster.region.redshift.amazonaws.com",
    database="dev",
    user="awsuser",
    password="...",
    staging_bucket="your-staging-bucket",
    staging_iam_role="arn:aws:iam::123456789:role/RedshiftS3Role",
    staging_region="us-east-1",
)

# Bulk read — 4x faster for large results
result = conn.read_bulk("SELECT * FROM events WHERE date > '2025-01-01'")
df = result.to_pandas()

# Bulk write — COPY-speed with one line
conn.write_dataframe(my_dataframe, "target_table")

# Small queries still use the normal path
df = conn.fetch_dataframe("SELECT COUNT(*) FROM events")

Bring your own connection

If you already have connection management (Airflow DAGs, ETL scripts, dbt hooks), you don't need to change it. The Engine API takes any DBAPI connection:

import arrowjet
import redshift_connector

conn = redshift_connector.connect(host=..., database=..., ...)

engine = arrowjet.Engine(
    staging_bucket="your-bucket",
    staging_iam_role="arn:aws:iam::123:role/RedshiftS3Role",
    staging_region="us-east-1",
)

result = engine.read_bulk(conn, "SELECT * FROM events")
engine.write_dataframe(conn, df, "target_table")

Works with redshift_connector, psycopg2, ADBC, or anything DBAPI-compatible.

CLI

arrowjet export --query "SELECT * FROM sales" --to s3://bucket/sales/
arrowjet import --from s3://bucket/data/ --to sales_table
arrowjet preview --file ./out.parquet
arrowjet validate --table sales --row-count --schema --sample

The S3-direct export path runs UNLOAD straight to your destination — data goes Redshift → S3 with no client roundtrip.

What it doesn't do

It's Redshift-only for now. The provider abstraction is built (adding Snowflake/BigQuery means implementing one interface), but Redshift is the first and only backend today.
Bulk mode has S3 overhead. For queries returning 100 rows, the standard wire protocol is faster. Arrowjet's sweet spot is 100K+ rows.
You need an S3 bucket in the same region as your cluster, and an IAM role attached to Redshift with S3 access. The IAM setup guide covers three deployment models.

Try it

pip install arrowjet              # core + CLI
pip install arrowjet[redshift]    # + Redshift drivers

GitHub: github.com/arrowjet/arrowjet
Docs: configuration, CLI reference, IAM setup
Migration from redshift_connector: migration guide

Apache 2.0. Feedback welcome — especially from anyone doing bulk Redshift work at scale.