Forem: Burak Karakan

I built a data pipeline tool in Go

Burak Karakan — Mon, 23 Dec 2024 08:30:00 +0000

Over the past few years, the data world has convinced itself that it needs many different tools to extract insights:

one tool to ingest data
another one to transform it
another one to check the quality
another one to orchestrate them all
another one for cataloging
another one for governance

The result? A very fragile, expensive, and rigid infrastructure with a terrible experience. Teams have to build a lot of glue between these systems, and try to get different parts of these systems to talk to each other while trying to onboard analytical teams on them.

Does it work? No.

Are we ready to have that conversation? I hope so.

Obsessing over impact

The engineering work behind building a ticking machine is very satisfying: small pieces that each do their part, and they work like a clock. It feels like an engineering marvel:

It is simple: you push code to the main branch, the backend automatically pulls the branch, pulls the DAGs and uploads them to S3. The sync sidecars on Airflow containers automatically pull from S3, which will then update the DAGs. When the DAG runs, for data ingestion jobs it will connect to our Airbyte deployment and trigger the ingestion from Airflow, then we create a sensor that waits until the ingestion is done. Then we connect to dbt Cloud to initiate some parts of the transformation jobs from the analytical team, if anything fails Airflow connects to our notification system to find the right team if they are defined on the catalog, if not we check our AD users and try to find a matching org to send a notification. Once the transformation is done then we execute our custom Python operators that do X and Y, then we provision a pod in our Kubernetes cluster to run quality checks. Our Kafka sinks are ingesting CDC data from the internal Postgres with Debezium in the meantime, then load them to the data lake in Parquet format, then we register them as Glue tables so that they can be queried, then the sensors in our Airflow clusters keep track of these states to run SQL transformations with the internal framework, and…

Sounds ridiculous, doesn’t it? It certainly does to me, whereas this is a very common response when we ask engineering teams what their data infrastructure looks like. The joy they get out of building a house of cards is far more important than the business impact that is being delivered. In the meantime, the analytics teams, data analysts, data scientists, and the business teams are waiting for their questions to be answered, trying to understand why it takes 6 weeks to get a new chart on their sales dashboard.

I am not sure if this is due to ZIRP or not, but it is pretty easy to spot organizations where highly inefficient engineering teams, coupled with engineering leaders who don’t know what their teams are doing, are ruling the game, where the people that create real value with data is left alone. They have to jump through a billion different tools, trying to figure out why their dashboard didn’t update, and waiting for a response to their ticket from the central data team.

These are data analysts in business teams, a growth hacker running marketing campaigns over 5 different platforms, or an all-rounder data scientist who is trying to predict LTV. They are trying to create a real impact, but their progress is being heavily hindered by the internal toys.

We are building Bruin for these people: simpler data tooling for impact-obsessed teams.

Bruin CLI & VS Code Extension

Bruin CLI is an end-to-end data pipeline tool that brings together data ingestion, data transformation with SQL and Python, and data quality in a single framework.

Bruin is batteries-included:

📥 ingest data with ingestr / Python
✨ run SQL & Python transformations on many platforms
📐 table/view materializations, incremental tables
🐍 run Python in isolated environments using uv
💅 built-in data quality checks
🚀 Jinja templating to avoid repetition
✅ validate pipelines end-to-end via dry-run
👷 run on your local machine, an EC2 instance, or GitHub Actions
🔒 secrets injection via environment variables
VS Code extension for a better developer experience
⚡ written in Golang
📦 easy to install and use

This means using Bruin, teams can build end-to-end workflows without having to resort to a bunch of different tools. It is extensible enough with the usage of SQL and Python, while also guiding the users through its opinionated approach to build maintainable data pipelines.

One of the things that accompany Bruin CLI is our open-source Visual Studio Code extension:

The extension does a few stuff that makes it pretty unique:

While everything in Bruin is driven with code, the extension adds a UI layer on top of it, which means you get:
- visual documentation
- rendered queries
- column & quality checks
- lineage
- the ability to validate code & run backfills
- syntax highlighting
everything happens locally, which means there are no external servers or systems that can access any of your data
extension visualizes a lot of the configuration options, which makes it trivial to run backfills, validations, and more.

This is a good example of our design principles: everything is version-controlled, while also giving a better experience through a thoughtful UI.

The extension is a first-class citizen of the Bruin ecosystem, and we intend to expand its functionality further to make it the easiest platform to build data workloads out there.

Supported Platforms

Bruin supports many of the cloud data platforms out of the box at the launch:

AWS Athena
Databricks
DuckDB
Google BigQuery
Microsoft SQL Server
Postgres
Redshift
Snowflake
Synapse

The list of platforms we support will grow more and more over time. We are always looking forward to hearing community feedback on these, so feel free to share your thoughts with us in our Slack community.

Bruin Cloud

We are building Bruin for those obsessed with impact. You can go from zero to full data pipelines in minutes, and we are dedicated to making this experience even bigger. Using all of our open-source tooling you can build and run all of your data workloads locally, on GitHub Actions, in Airflow, or anywhere else.

While we do believe there are many useful deployment options of Bruin CLI across different infrastructures, we are also obsessed with building the best managed-experience for building and running Bruin workloads on production. That’s why we are building Bruin Cloud:
Lineage view on Bruin Cloud

It has quite a few niceties:

managed environment for ingestion, transformation, and ML workloads
column-level lineage
governance & cost reporting
team management
cross-pipeline dependencies
multi-repo “mesh”

and quite a few more. Feel free to drop your email to get a demo.

Share your thoughts

We are very excited to share Bruin CLI & VSCode Extension with the world, and we would love to hear from the community. We’d appreciate if you shared your thoughts on what would make Bruin more useful for your needs.

https://github.com/bruin-data/bruin

The Pains of Data Ingestion

Burak Karakan — Tue, 27 Feb 2024 09:45:51 +0000

One of the first issues companies run into when it comes to analyzing their data is having to move the data off of their production databases:

The production instances have different abilities that are not suitable for analytical workloads, such as row-oriented vs column-oriented databases.
Analytical use-cases have very different SLAs and requirements than production applications, which means their reliability needs are very different. An analytical query may take 1.2s instead of 400ms and that would be fine, whereas that latency could be a huge disruption to the user experience if it happened on the production database.
Online Transaction Processing (OLTP) databases are focused on transactional use-cases, which means they lack quite a few features around data analysis.

Due to these contributing factors, companies usually move their data off to an analytical database such as Google BigQuery or Snowflake after a certain size and scale for analytical purposes.

While moving the data to a database that is fit for purpose sounds good, it has its own challenges:

The data needs to be copied over at a regular cadence via some tool/custom code. This means extra effort and cost to build and maintain.
The data is now duplicated across multiple databases, meaning that changes in the source need to be accurately reflected in the analytical database.
The assumptions around the atomicity/reliability of the data change since there are multiple places where the data resides now.
After a certain data size, transferring the data becomes expensive and slow, requiring further engineering investment to make the process more efficient.

All of these reasons build up the problems around data ingestion, and there are already a bunch of tools in the market that aim to solve this.

Building everything from scratch

The moment the data ingestion/copy problem is acknowledged, the first reaction across many teams is to build a tool that does the ingestion for them, and then schedule it via cronjobs or more advanced solutions. The problem sounds simple on the surface:

Download a copy of the original data
Upload it to the destination database either via SQL insert statements, or some other platform-specific way to load the data into the database

However, this on-the-surface analysis forgets quite a few crucial questions:

How do you ensure the data is copied over accurately?
What happens when there’s a bug that requires copying the data again?
What if the data does not fit into the memory at once? How do you paginate?
What happens as the data grows and copying/overwriting everything becomes too slow/expensive?
What happens when the schema of the data changes?
How does the team know about failures?
Where do you deploy the logic?
and quite a few more…

As you can see, there are many open points and they all require a solid understanding of the problem at hand, along with the investment to make the overall initiative success. Otherwise, the engineering team builds quick hacks to get them up and running, and these “hacks” start to become the backbone of the analytical use-cases, making it very hard, if not impossible, to evolve the architecture as the business evolves.

Some smart people saw the problem at hand and came up with various solutions to make this process easier.

No-code solutions

Over the years some teams have decided that data ingestion can be performed simply via UI-driven solutions that have pre-built connectors across various platforms, which means non-technical people can also ingest data. Two major players that come to mind are Fivetran and Airbyte, both giant companies trying to tackle the long-tail of the data ingestion problem.

Even though there are a few differences between these no-code platforms, their primary approach is that you use their UI to set up connectors, and you forget about the problem without needing any technical person, e.g. a marketing person can set up a data ingestion task from Postgres to BigQuery.

While these tools do have a great deal of convenience, they still pose some challenges:

The movement of the data is a fairly technical work that has quite a few questions about the ways of copying the data, and the data work rarely ends after just copying the data, therefore technical people such as data analysts, scientists, or engineers still need to be involved in the process, which means that the actual audience is the data people rather than non-technical folks.
The UI-driven workflow causes lock-in, which means that the company is not going to be able to move away from these platforms until they build a replacement, which usually means further investment later on without disrupting the current ways of working + migrating the existing usecases.
For the open-source solutions such as Airbyte, they still need to be hosted and maintained internally, which means engineering time and effort & infrastructure costs.

All in all, while UI-driven data ingestion tools like Fivteran or Airbyte allow teams to get going from zero, there are still some issues that cause teams to stay away from them and resort to writing code due to the flexibility it provides.

Yes-code solutions: dlt

There has been an emerging open-source Python library from the company dltHub called dlt, which focuses more on the use cases where there’ll still be code written to ingest the data, but the code could actually be a lot smaller and maintainable. dlt has built-in open-source connectors, but it also allows custom sources & destinations to be built by teams for their specific needs. It is flexible but allows quick iteration when it comes to ingestion.

There are a couple of things dlt takes care of very nicely:

dlt supports schema evolution, meaning that when the schema of an upstream dataset changes, dbt will make sure the destination tables are updated accordingly based on the changes.
dlt supports custom sources and destinations, meaning that their prebuilt sources & destinations can be combined with custom ones.
dlt has support for incremental updates and deduplication, which means that the data can be incrementally updated with only the changed data while ensuring it still matches the source data.

dlt is quite a powerful library and has a very vibrant, growing community. It might be the perfect companion when it comes to engineers wanting to write code for certain custom requirements.

However, we felt that there might be a middle ground for simpler use-cases that doesn’t require coding, but also doesn’t lock us into a UI-driven workflow.

While we like dlt a lot at Bruin, we felt that there were quite a few simpler scenarios that we couldn’t justify writing, maintaining, and deploying code for:

Some of our customers wanted to be able to simply copy a source table to a destination, and override the data in the destination because the data was small enough.

Some others required fixed incremental strategies, such as “just get the latest data based on the updated_at column.

Some others needed to be able to merge the new records with the old ones incrementally.

While all of these are possible with dlt, it requires these people to write code and figure out a way to deploy them and monitor them. It is not incredibly hard, but it is also not trivial. Seeing all these patterns, we have decided to take a stab at the problem in an open-source fashion.

Introducing: ingestr

ingestr is a command-line application that allows you to ingest data from any source into any destination using simple command-line flags, without having to write any code and still having it as part of your tech stack.

✨ copy data from your Postgres / MySQL / SQL Server or any other source into any destination, such as BigQuery or Snowflake
➕ incremental loading
🐍 single-command installation: pip install ingestr

ingestr takes away the complexity of managing any backend or writing any code for ingesting data, simply run the command and watch the magic

ingestr makes a couple of opinionated decisions about how the ingestion should work:

it is built as an open-source solution, and anyone can use it for any purpose based on its permissive license
ingestr treats everything as a URI: every source and every destination has its own URI based on SQLAlchemy connection URIs.
ingestr has a few built-in incremental loading strategies: replace, append, merge and delete+insert

While there will be quite a few scenarios where the teams would benefit from the flexibility dlt provides, we believe that 80% of the real-life scenarios out there would fall into these presets, and for those ingestr could simplify things quite a bit.

🌟 Give it a look and give us a star on GitHub! We’d love to hear your feedback and feel free to join our Slack community here.

The Mythical Data Team

Burak Karakan — Thu, 08 Feb 2024 15:24:04 +0000

The mythical "data teams" have been living their prime time over the past few years. The first thing every company that is moving towards a data-driven culture does is to hire engineers for the thing they wished they did well, without thinking why.

It's like the "DevOps" of 2010s: everyone claims they do it, but everyone understands the role –or the philosophy, for that matter– differently.

The whole story revolves around this idea that the lifecycle of data is too complex –it is!– for analysts/scientists to manage –not just for them, all of us!–, therefore we drop these engineers from outside into a highly complex business environment, expecting them to deliver value using data right away.

Of course, this never works: engineering teams spend months building infrastructure, restructuring the data, and building tooling for things that might not even be used by the rest of the business. In the end, the company is out millions of dollars, and likely worse than where they were before.

Delivering without impact

The idea that data would be dealt with outside the rest of the business by throwing more engineers on it is a very old idea. It often plays out as follows:

The company builds software and becomes successful in some way.
They generate a lot of data; although, not as a first-class citizen, always as an afterthought.
There'll be a Jane in marketing, and a Joe in sales who knows how to pull some data to Excel and get some numbers out. The data and the story it tells are still an afterthought.
The company grows further, but the "Jane"s and "Joe"s are not enough to serve the rest of the business. They decide to hire an analyst or two.
- Guess what? Data is still an afterthought.
The analysts are trying to help the rest of the business, but are unable to keep up with all the demand and feel abandoned while trying to navigate the hot mess.
At some point, some exec will get mad at not being able to get some numbers and call the shots to build a data team.

This is where things get nasty because an issue that is primarily a cultural matter is being tackled by the means of throwing people on it. The budgets are secured, the openings are posted, and the applications start flowing in, with no change in how to think about data.

The engineers that are hired jump straight into throwing solutions on problems that are not real problems, simply because that's what they can influence rather than what should be done. The ability to deliver results is euphoric, regardless of their impact. A complex infrastructure is built to move a CSV from Google Drive to S3, and the leaders feel accomplished: "look at all this cloud bill we have! we are definitely data-driven."

Data, my friends, is still an afterthought.

Data as a core value

The reason that companies struggle with shifting away from the data being an afterthought mindset is the top-down approach towards getting value out of data: data is not like cash that a leader can decide how to use the best at the moment; on the contrary, data is like oil. Not in the sense of "data is the new oil yay", but in the sense that it needs a lengthy and expensive process to bring out its value, and that requires deep investment. You know people, they don't love deep investments.

The first step in this process is to treat the data as an asset, just as any other asset the company has, not something nice to have. The good use of data can propel the business way further than any accounting process could, or any lawyer that can protect the company, therefore it needs to be treated with the same importance. Do you leave your bookkeeping unattended? Then you should not leave your data as well. This needs to propagate the ranks: the data is incredibly valuable, no data should be wasted, and it should be treated with utmost care.

Once the importance of the data is clear to the rest of the business, then comes the first tangible action to take: data is not a separate entity from the software organization, make them own it.

Own the damn data

I have seen it repeated countless times that data is being treated as if it is a separate thing from the rest of the software –and the team that builds it–, which means that there's a data analyst on one side trying to duct-tape a bunch of tables together in some weird drag-and-drop tool, while the software team just drops a full table from production the next day.

The organizations that have the healthiest data landscape are those that have a very clear understanding of data ownership: every bit of data that is produced/ingested/transformed/used must have an owner, no discussion. Do you own the service that writes to this database? You own the data. Do you own the internal events being generated on Kafka? You own the data. Do you join these billion different tables into a new table? You own the data.

The most important point here is that the software teams are aware that the data they produce is owned by them. This means that they will be responsible for a few core questions to be answered properly:

How will this data be made available to the rest of the organization?
How will the quality of this data be ensured?
How would anyone notice if the data went corrupt? How quickly?
What is the change management process around this data?

This means that the software teams will start treating their data just as they are treating their services. Not sure about your experience, but the quality of software services has been treated way higher than the quality of the data they produced in my experience, which means this is a win in my book.

Data as a product

There has been a large shift in the software world with the spread of concepts such as Domain-Driven Design, and the result ended up being domain-oriented services, owned by domain-focused teams, managing individual "products" of the larger product the company produces. This enabled ownership, independence, reliability, and more importantly the ability to deliver high-quality software quickly to become a competitive advantage. The data world is in for a similar transition.

The data teams are in for a similar transition: the days of having a team called "data teams" are over. Any sufficiently large software team has noticed the drawbacks of having isolated functional teams, and instead transitioning towards having cross-functional, agile teams, and the same principle applies to data teams. Instead of having a siloed team that takes in the data everyone else in the company produces and tries to make sense of it, the data team should be distributed among the business teams.

The mindset of the business teams started shifting towards treating data as an end-to-end product, and the associated rise in quality that comes with that. The organization treats data as part of its core product, and applies the appropriate measures to building, changing, governing, and protecting it.

This requires a rethinking of the structure of the data team:

The team is transparent and spread across the whole organization.
The data people within the organization work very closely with the business and product teams, no more siloes.
The data team needs fewer engineers, more analysts & scientists who understand the business context.

In the end, you do not have a central data team, you have an organization that speaks data on all levels, across all teams. We are making a huge bet at this in Bruin, doing our best to ensure organizations can actually scale to this point of data maturity.

This will take time, and it's fine

Shifting the whole data-as-an-afterthought mindset to making it a central beat of the company's heart is kind of like finally deciding to clean up that one junk drawer in your kitchen. You know it’s going to be a mess, and it’s way easier to just keep shoving stuff in there, but once you get it sorted, finding batteries or that one specific takeout menu becomes a breeze. It’s all about making data not just something you do, but a part of who you are as a company.

The real shift in terms of value generated from the data will come once the organizations internalize this change in the mindset.