Forem: Jurgita Motus

Build B2B data lists in seconds using plain English

Jurgita Motus — Thu, 12 Mar 2026 11:48:41 +0000

Coresignal launched Data Search Lists on Product Hunt today!

We're excited to introduce our newest product: the Data Search tool with its new Lists feature. We built Lists because we want to simplify B2B data access:

❌ Getting started often means navigating complex filters, Boolean logic, or studying API docs before you can test anything

❌ It's hard to sample the data without technical skills

❌ Enrichment is a separate, time-consuming step

❌ After all that, your list may still lack detail, so you need to enrich it with data from other providers

So we asked: How can we reduce time-to-value and create easier data flow for non-technical users?

With Data Search Lists, you write a prompt like Find SaaS companies that raised funding in the last year and currently have at least one active job posting and get a quality list of the best-matched records in our newly created interface.

What makes us different:

✅ AI-ranked results. Top matches based on your prompt, no alphabetical or random dumps

✅ See the query. Full transparency into how your list is built

✅ Flexible enrichment. Pick the data fields you need, enrich all or selected profiles

✅ Multi-source data with up to 500 data points per record

✅ Preview 100 records free. Refine your search before committing

✅ Bulk download up to 10K records. Scale from quick research to full campaigns

🎁 We're celebrating our Product Hunt launch and want you to test Coresignal Data Search yourself! There’s a promo code on the launch page to get the first month of our Starter plan for free.

Would really appreciate an upvote and any feedback!

The Rise of Open-Source Data Catalogs: New Opportunities For Implementing Data Mesh

Jurgita Motus — Tue, 30 Sep 2025 11:07:04 +0000

While the concept of data mesh as a data architecture model has been around for a while, it was hard to define how to implement it easily and at scale. That is, until now. Two data catalogs went open-source this year, changing how companies manage their data pipeline.

Let’s examine how open-source data catalogs can simplify data mesh implementation and what your organization can do to prepare for this change.

Understanding data mesh

Data mesh is a decentralized architecture type that allows different departments to access data independently. It’s different from traditional data architecture, which usually has dedicated data engineering teams that provide access to information after other departments request it.

Both methods have advantages, but the difference lies in scalability, flexibility, and access speed: more teams can work with the relevant data faster.

Think about the difference between traditional broadcast television, where the programming is centralized, and streaming services, which work independently and provide access to different types of custom content.

The creator of the concept, Zhamak Dehghani, explains that it is a shift defined by the following principles:

Domain-oriented data ownership. The departments closest to data should own it. For example, marketing teams should fully manage the entire marketing data pipeline.
Data as a product. Internally used domain-oriented data should be manageable and usable, just as product data would be. It has to be discoverable, secure, and interoperable, among other things.
Self-service data infrastructure. Although data ownership is domain-oriented, every team should be able to access the data quickly, allowing cross-functional collaboration.
Federated computational governance. Data should be governed by considering legal, compliance, and security requirements. This means that it has to be interoperable, its governance must be transparent, and its policies should be automatically applied to every data product across the entire organization.

Data mesh principles would be the most valuable to large and quickly growing mid-sized businesses with hundreds or thousands of data users across multiple departments and business units needing domain-specific data management. In these cases, decentralized ownership improves data access and reduces the dependency on central data teams.

As most large organizations have the resources and can purchase data catalogs, such as Collibra, open-source access might be revolutionary for medium-sized organizations seeking to optimize their costs and prepare their infrastructure for future growth.

So, how do open-source data catalogs unlock these opportunities?

The emergence of open-source data catalogs

Data catalogs provide a bird’s-eye view of your data assets. However, this service used to get quite expensive as your infrastructure expanded. As a result, companies would stick to the usual structure without a comprehensive inventory. After all, data storage and management are costly by default, even without paying for nice-to-have services.

This summer, multiple providers decided to go open-source, including Polaris Catalog by Snowflake and Unity Catalog from Databricks, in addition to other open-source data catalogs, such as Apache Atlas or DataHub.

So, how do these tools help to enable data mesh?

The open-source data catalogs provide several key features that are beneficial for a data mesh. These include a centralized metadata repository to enable the discovery of data assets across decentralized data domains.

The tools also help to enforce governance policies, track data lineage, ensure data quality, and understand data assets using a single layer of control for all data assets, regardless of where they reside. It also enables role-based access control and attribute-based access control.

Advantages of open-source data catalogs in data mesh

Open-source data catalogs unlock a new era of data mesh because these tools are now free and very flexible. There are many options to choose from, and you can integrate multiple platforms and tools without vendor lock-in.

In addition, open-source tools are supported by a community of millions of developers and engineers ready to share their experiences and assist in solving everyday problems. At the same time, open source enables the community to develop new plugins or extensions, which unlocks multiple opportunities and new use cases.

In its essence, data mesh helps with data observability — another important element every organization should consider. With granular access controls, data lineage, and domain-specific audit logs, data catalogs allow engineers and developers to have a better view of their systems than before.

However, while the tools have multiple advantages, data catalogs might not be right for smaller organizations. Setting up this architecture and integrating multiple domains requires a lot of technical knowledge. At the same time, using data catalogs for data mesh might not be a good option for companies with on-premise setups or those who work in multi-cloud environments.

As always, it’s important to research your options and ensure that the platforms you use are compatible with your organization’s current situation and future goals.

How to implement data mesh with open-source data catalogs

To build data mesh architecture in your organization, you should start by engaging with the community and exploring how other organizations have used these tools.

While every journey will be different, generally, you will have to achieve the following milestones:

Assess the needs. First, evaluate the current data infrastructure and define your organization’s key domains. Explore how each domain would be structured and whether your organization is big enough to need the restructuring.
Try out different data catalogs. Compare the features and capabilities of different open-source options to ensure you (and your team) pick the right ones for testing. Consult the community while installing, configuring, and customizing the tools.
Onboard the domains. Once you have selected the best tool and compared it with others, it’s time to establish domain teams and assign data ownership.
Define and implement governance policies. Together with domain owners, legal, compliance, and other responsible teams, define the data governance standards and set up the policies.
Integrate with existing data infrastructure. Once the domains are defined and onboarded and the data governance rules are clear, you must connect the catalog to data sources, pipelines, and business intelligence tools.
Train the teams. Providing training for domain teams and data consumers ensures each team has enough knowledge to own their domain fully.
Maintain the data mesh infrastructure. Once everything is set up, the final step is to review the policies regularly and update the metadata and governance practices.

Once you have the initial direction, you should start small — pick just one domain — and scale up once all the key stakeholders are on board.

What’s next for data mesh?

Data mesh architecture allows organizations to be faster, work smarter, and work as flexibly as necessary. With interoperability, multiple integration options, and customizable plugins, every domain can build its preferred data infrastructure more efficiently without compromising the speed of access.

While this is a change that most organizations should explore in the near future, data mesh integration cannot be done without prioritizing data security, privacy, and compliance. At the same time, it means that teams that are typically not as tech-savvy will also need to build the necessary skill sets to succeed.

With the emergence of machine learning and advanced data analytics, accessing data faster is crucial for business. This is an exciting time of transformation for data infrastructure — and open-source data catalogs will allow it to happen.

Automating Data Quality Checks: A Practical Guide Using Dagster and Great Expectations

Jurgita Motus — Tue, 30 Sep 2025 10:51:45 +0000

Ensuring data quality is paramount for businesses relying on data-driven decision-making. As data volumes grow and sources diversify, manual quality checks become increasingly impractical and error-prone. This is where automated data quality checks come into play, offering a scalable solution to maintain data integrity and reliability.

At my organization, which collects large volumes of public web data, we’ve developed a robust system for automated data quality checks using two powerful open-source tools: Dagster and Great Expectations. These tools are the cornerstone of our approach to data quality management, allowing us to efficiently validate and monitor our data pipelines at scale.

In this article, I’ll explain how we use Dagster, an open-source data orchestrator, and Great Expectations, a data validation framework, to implement comprehensive automated data quality checks. I’ll also explore the benefits of this approach and provide practical insights into our implementation process, including a Gitlab demo, to help you understand how these tools can enhance your own data quality assurance practices.

Let’s discuss each of them in more detail before moving to practical examples.

Dagster: an open-source data orchestrator

Used for ETL, analytics, and machine learning workflows, Dagster lets you build, schedule, and monitor data pipelines. This Python-based tool allows data scientists and engineers to easily debug runs, inspect assets, or get details about their status, metadata, or dependencies.

As a result, Dagster makes your data pipelines more reliable, scalable, and maintainable. It can be deployed in Azure, Google Cloud, AWS, and many other tools you may already be using. Airflow and Prefect can be named as Dagster competitors, but I personally see more pros in the latter, and you can find plenty of comparisons online before committing.

Great Expectations: a data validation framework

A great tool with a great name, Great Expectations is an open-source platform for maintaining data quality. This Python library actually uses “Expectation” as their in-house term for assertions about data.

Great Expectations provides validations based on the schema and values. Some examples of such rules could be max or min values and count validations. It also provides data validation and can generate expectations according to the input data. Of course, this feature usually requires some tweaking, but it definitely saves some time.

Another useful aspect is that Great Expectations can be integrated with Google Cloud, Snowflake, Azure, and over 20 other tools. While it can be challenging for data users without technical knowledge, it’s nevertheless worth attempting.

Why are automated data quality checks necessary?

Automated quality checks have multiple benefits for businesses that handle voluminous data of critical importance. If the information must be accurate, complete, and consistent, automation will always beat manual labor, which is prone to errors. Let’s take a quick look at the 5 main reasons why your organization might need automated data quality checks.

1. Data integrity
Your organization can collect reliable data with a set of predefined quality criteria. This reduces the chance of wrong assumptions and decisions that are error-prone and not data-driven. Tools like Great Expectations and Dagster can be very helpful here.

2. Error minimization
While there’s no way to eradicate the possibility of errors, you can minimize the chance of them occurring with automated data quality checks. Most importantly, this will help identify anomalies earlier in the pipeline, saving precious resources. In other words, error minimization prevents tactical mistakes from becoming strategic.

3. Efficiency
Checking data manually is often time-consuming and may require more than one employee on the job. With automation, your data team can focus on more important tasks, such as finding insights and preparing reports.

4. Real-time monitoring
Automatization comes with a feature of real-time tracking. This way, you can detect issues before they become bigger problems. In contrast, manual checking takes longer and will never catch the error at the earliest possible stage.

5. Compliance
Most companies that deal with public web data know about privacy-related regulations. In the same way, there may be a need for data quality compliance, especially if it later goes on to be used in critical infrastructure, such as pharmaceuticals or the military. When you have automated data quality checks implemented, you can give specific evidence about the quality of your information, and the client has to check only the data quality rules but not the data itself.

How we test data quality

As a public web data provider, having a well-oiled automated data quality check mechanism is key. So how do we do it? First, we differentiate our tests by the type of data. The test naming might seem somewhat confusing because it was originally conceived for internal use, but it helps us to understand what we’re testing.

We have two types of data:

Static data. Static means that we don’t scrape the data in real-time but rather use a static fixture.
Dynamic data. Dynamic means that we scrape the data from the web in real-time.

Then, we further differentiate our tests by the type of data quality check:

Fixture tests. These tests use fixtures to check the data quality.
Coverage tests. These tests use a bunch of rules to check the data quality.

Let’s take a look at each of these tests in more detail.

Static fixture tests
As mentioned earlier, these tests belong to the static data category, meaning we don’t scrape the data in real-time. Instead, we use a static fixture that we have saved previously.

A static fixture is input data that we have saved previously. In most cases, it’s an HTML file of a web page that we want to scrape. For every static fixture, we have a corresponding expected output. This expected output is the data that we expect to get from the parser. The test works like this:

The parser receives the static fixture as an input.
The parser processes the fixture and returns the output.
The test checks if the output is the same as the expected output. This is not a simple JSON comparison because some fields are expected to change (such as the last updated date), but it is still a simple process.

We run this test in our CI/CD pipeline on merge requests to check if the changes we made to the parser are valid and if the parser works as expected. If the test fails, we know we have broken something and need to fix it.

Static fixture tests are the most basic tests both in terms of process complexity and implementation because they only need to run the parser with a static fixture and compare the output with the expected output using a rather simple Python script.

Nevertheless, they are still really important because they are the first line of defense against breaking changes.

However, a static fixture test cannot check whether scraping is working as expected or whether the page layout remains the same. This is where the dynamic tests category comes in.

Dynamic fixture tests
Basically, dynamic fixture tests are the same as static fixture tests, but instead of using a static fixture as an input, we scrape the data in real-time. This way, we check not only the parser but also the scraper and the layout.

Dynamic fixture tests are more complex than static fixture tests because they need to scrape the data in real-time and then run the parser with the scraped data. This means that we need to launch both the scraper and the parser in the test run and manage the data flow between them. This is where Dagster comes in.

Dagster is an orchestrator that helps us to manage the data flow between the scraper and the parser. There are four main steps in the process:

Seed the queue with the URLs we want to scrape
Scrape
Parse
Check the parsed document against the saved fixture

The last step is the same as in static fixture tests, and the only difference is that instead of using a static fixture, we scrape the data during the test run.

Dynamic fixture tests play a very important role in our data quality assurance process because they check both the scraper and the parser. Also, they help us understand if the page layout has changed, which is impossible with static fixture tests. This is why we run dynamic fixture tests in a scheduled manner instead of running them on every merge request in the CI/CD pipeline.

However, dynamic fixture tests do have a pretty big limitation. They can only check the data quality of the profiles over which we have control. For example, if we don’t control the profile we use in the test, we can’t know what data to expect because it can change anytime. This means that dynamic fixture tests can only check the data quality for websites in which we have a profile. To overcome this limitation, we have dynamic coverage tests.

Dynamic coverage tests
Dynamic coverage tests also belong to the dynamic data category, but they differ from dynamic fixture tests in terms of what they check. While dynamic fixture tests check the data quality of the profiles we have control over, which is pretty limited because it is not possible in all targets, dynamic coverage tests can check the data quality without a need to control the profile. This is possible because dynamic coverage tests don’t check the exact values, but they check those against a set of rules we have defined. This is where Great Expectations comes in.

Dynamic coverage tests are the most complex tests in our data quality assurance process. Dagster also orchestrates them as dynamic fixture tests. However, we use Great Expectations instead of a simple Python script to execute the test here.

At first, we need to select the profiles we want to test. Usually, we select profiles from our database that have high field coverage. We do this because we want to ensure the test covers as many fields as possible. Then, we use Great Expectations to generate the rules using the selected profiles. These rules are basically the constraints that we want to check against the data. Here are some examples:

All profiles must have a name.
At least 50% of the profiles must have a last name.
The education count value cannot be lower than 0.

After we have generated the rules, called expectations in Great Expectations, we can run the test pipeline, which consists of the following steps:

Seed the queue with the URLs we want to scrape
Scrape
Parse
Validate parsed documents using Great Expectations

This way, we can check the data quality of profiles over which we have no control. Dynamic coverage tests are the most important tests in our data quality assurance process because they check the whole pipeline from scraping to parsing and validate the data quality of profiles over which we have no control. This is why we run dynamic coverage tests in a scheduled manner for every target we have.

However, implementing dynamic coverage tests from scratch can be challenging because it requires some knowledge about Great Expectations and Dagster. This is why we have prepared a demo project showing how to use Great Expectations and Dagster to implement automated data quality checks.

Automated data quality validation demo project

In this Gitlab repository, you can find a demo of how to use Dagster and Great Expectations to test data quality. The dynamic coverage test graph has more steps, such as seed_urls, scrape, parse, and so on, but for the sake of simplicity, in this demo, some operations are omitted. However, it contains the most important part of the dynamic coverage test — data quality validation. The demo graph consists of the following operations:

load_items — loads the data from the file and loads them as JSON objects
load_structure — loads the data structure from the file
get_flat_items — flattens the data
load_dfs — loads the data as Spark DataFrames by using the structure from the load_structure operation
ge_validation — executes the Great Expectations validation for every DataFrame
post_ge_validation — checks if the Great Expectations validation passed or failed

While some of the operations are self-explanatory, let’s examine some that might require further detail.

Generating a structure
The load_structure operation itself is not complicated. However, what is important is the type of structure. It’s represented as a Spark schema because we will use it to load the data as Spark DataFrames because Great Expectations works with them. Every nested object in the Pydantic model will be represented as an individual Spark schema because Great Expectations doesn’t work well with nested data.

For example, a Pydantic model like this:

 python
class CompanyHeadquarters(BaseModel):
    city: str
    country: str

class Company(BaseModel):
    name: str
    headquarters: CompanyHeadquarters

This would be represented as two Spark schemas:

   json
{
    "company": {
        "fields": [
            {
                "metadata": {},
                "name": "name",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    },
    "company_headquarters": {
        "fields": [
            {
                "metadata": {},
                "name": "city",
                "nullable": false,
                "type": "string"
            },
            {
                "metadata": {},
                "name": "country",
                "nullable": false,
                "type": "string"
            }
        ],
        "type": "struct"
    }
}

The demo already contains data, structure, and expectations for Owler company data. However, if you want to generate a structure for your own data (and your own structure), you can do that by following the steps below. Run the following command to generate an example of the Spark structure:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx structure"

This command generates the Spark structure for the Pydantic model and saves it as example_spark_structure.json in the gx_demo/data directory.

Preparing and validating data
After we have the structure loaded, we need to prepare the data for validation. That leads us to the get_flat_items operation, which is responsible for flattening the data. We need to flatten the data because each nested object will be represented as a row in a separate Spark DataFrame. So, if we have a list of companies that looks like this:

   json
[
    {
        "name": "Company 1",
        "headquarters": {
            "city": "City 1",
            "country": "Country 1"
        }
    },
    {
        "name": "Company 2",
        "headquarters": {
            "city": "City 2",
            "country": "Country 2"
        }
    }
]

After flattening, the data will look like this:

   json
{
    "company": [
        {
            "name": "Company 1"
        },
        {
            "name": "Company 2"
        }
    ],
    "company_headquarters": [
        {
            "city": "City 1",
            "country": "Country 1"
        },
        {
            "city": "City 2",
            "country": "Country 2"
        }
    ]
}

Then, the flattened data from the get_flat_items operation will be loaded into each Spark DataFrame based on the structure that we loaded in the load_structure operation in the load_dfs operation.

The load_dfs operation uses DynamicOut, which allows us to create a dynamic graph based on the structure that we loaded in the load_structure operation.

Basically, we will create a separate Spark DataFrame for every nested object in the structure. Dagster will create a separate ge_validation operation that parallelizes the Great Expectations validation for every DataFrame. Parallelization is useful not only because it speeds up the process but also because it creates a graph to support any kind of data structure.

So, if we scrape a new target, we can easily add a new structure, and the graph will be able to handle it.

Generate expectations
Expectations are also already generated in the demo and the structure. However, this section will show you how to generate the structure and expectations for your own data.

Make sure to delete previously generated expectations if you’re generating new ones with the same name. To generate expectations for the gx_demo/data/owler_company.json data, run the following command using gx_demo Docker image:

docker run -it - rm -v $(pwd)/gx_demo:/gx_demo gx_demo /bin/bash -c "gx expectations /gx_demo/data/owler_company_spark_structure.json /gx_demo/data/owler_company.json owler company"

The command above generates expectations for the data (gx_demo/data/owler_company.json) based on the flattened data structure (gx_demo/data/owler_company_spark_structure.json). In this case, we have 1,000 records of Owler company data. It’s structured as a list of objects, where each object represents a company.

After running the above command, the expectation suites will be generated in the gx_demo/great_expectations/expectations/owler directory. There will be as many expectation suites as the number of nested objects in the data, in this case, 13.

Each suite will contain expectations for the data in the corresponding nested object. The expectations are generated based on the structure of the data and the data itself. Keep in mind that after Great Expectations generates the expectation suite, which contains expectations for the data, some manual work might be needed to tweak or improve some of the expectations.

Let’s take a look at the 6 generated expectations for the followers field in the company suite:

expect_column_min_to_be_between
expect_column_max_to_be_between
expect_column_mean_to_be_between
expect_column_median_to_be_between
expect_column_values_to_not_be_null
expect_column_values_to_be_in_type_list

We know that the followers field represents the number of followers of the company. Knowing that, we can say that this field can change over time, so we can’t expect the maximum value, mean, or median to be the same.

However, we can expect the minimum value to be greater than 0 and the values to be integers. We can also expect that the values are not null because if there are no followers, the value should be 0. So, we need to get rid of the expectations that are not suitable for this field: expect_column_max_to_be_between, expect_column_mean_to_be_between, and expect_column_median_to_be_between.

However, every field is different, and the expectations might need to be adjusted accordingly. For example, the field completeness_score represents the company’s completeness score. For this field, it makes sense to expect the values to be between 0 and 100, so we can keep not only expect_column_min_to_be_between but also expect_column_max_to_be_between.

Take a look at the Gallery of Expectations to see what kind of expectations you can use for your data.

Running the demo

To see everything in action, go to the root of the project and run the following commands:

Build Docker image:

docker build -t gx_demo

Run Docker container:

docker composer up

After running the above commands, the Dagit (Dagster UI) will be available at localhost:3000. Run the demo_coverage job with the default configuration from the launchpad. After the job execution, you should see dynamically generated ge_validation operations for every nested object.

In this case, the data passed all the checks, and everything is beautiful and green. If data validation for any nested object fails, then postprocess_ge_validation operations would be marked as failed (and obviously, it would be red instead of green). Let’s say the company_ceo validation failed. The postprocess_ge_validation[company_ceo] operation would be marked as failed. To see what expectations failed particularly, click on the ge_validation[company_ceo] operation and open “Expectation Results” by clicking on the “[Show Markdown]” link. It will open the validation results overview modal with all the data about the company_ceo dataset.

Conclusion

Depending on the stage of the data pipeline, there are many ways to test data quality. However, it is essential to have a well-oiled automated data quality check mechanism to ensure the accuracy and reliability of the data. Tools like Great Expectations and Dagster aren’t strictly necessary (static fixture tests don’t use any of those), but they can greatly help with a more robust data quality assurance process. Whether you’re looking to enhance your existing data quality processes or build a new system from scratch, we hope this guide has provided valuable insights.

Building a Robust Data Observability Framework to Ensure Data Quality and Integrity

Jurgita Motus — Tue, 30 Sep 2025 07:31:55 +0000

Traditional monitoring no longer meets the needs of complex data organizations. Instead of relying on reactive systems to identify known issues, data engineers must create interactive observability frameworks that help them quickly find any type of anomaly.

While observability can encompass many different practices, in this article, I’ll share a high-level overview and practical tips from our experience building an observability framework in our organization using open-source tools.

So, how to build infrastructure that has good data health visibility and ensures data quality?

What is data observability?

Overall, observability defines how much you can tell about an internal system from its external outputs. The term was first defined in 1960 by Hungarian-American engineer Rudolf E. Kálmán, who discussed observability in mathematical control systems.

Over the years, the concept has been adapted to various fields, including data engineering. Here, it addresses the issue of data quality and being able to track where the data was gathered and how it was transformed.

Data observability means ensuring that data in all pipelines and systems is integral and of high quality. This is done by monitoring and managing real-time data to troubleshoot quality concerns. Observability assures clarity, which allows action before the problem spreads.

What is a data observability framework?

Data observability framework is a process of monitoring and validating data integrity and quality within an institution. It helps to proactively ensure data quality and integrity.

The framework must be based on five mandatory aspects, as defined by IBM:

Freshness. Outdated data, if any, must be found and removed.
Distribution. Expected data values must be recorded to help identify outliers and unreliable data.
Volume. The number of expected values must be tracked to ensure data is complete.
Schema. Changes to data tables and organization must be monitored to help find broken data.
Lineage. Collecting metadata and mapping the sources is a must to aid troubleshooting.

These five principles ensure that data observability frameworks help maintain and increase data quality. You can achieve these by implementing the following data observability methods.

How to add observability practices into the data pipeline

Only high-quality data collected from reputable sources will provide precise insights. As the saying goes: garbage in, garbage out. You cannot expect to extract any actual knowledge from poorly organized datasets.

As a senior data analyst at public data provider Coresignal, I constantly seek to find new ways to improve data quality. While it’s quite a complex goal to achieve in the dynamic tech landscape, many paths lead to it. Good data observability plays an important role here.

So, how do we ensure the quality of data? It all comes down to adding better observability methods into each data pipeline stage — from ingestion and transformation to storage and analysis. Some of these methods will work across the entire pipeline while others will be relevant in only one stage of it. Let’s take a look:

First off, we have to consider five items that cover the entire pipeline:

End-to-end data lineage. Tracking lineage lets you quickly access database history and follow your data from the original source to the final output. By understanding the structure and its relationships, you will have less trouble finding inconsistencies before they become problems.
End-to-end testing. A validation process that checks data integrity and quality at each data pipeline stage helps engineers determine if the pipeline functions correctly and spot any untypical behaviors.
Root cause analysis. If issues emerge at any stage of the pipeline, engineers must be able to pinpoint the source precisely and find a quick solution.
Real-time alerts. One of the most important observability goals is to quickly spot emerging issues. Time is of the essence when flagging abnormal behaviors, so any data observability framework has to be able to send alerts in real time. This is especially important for the data ingestion as well as storage and analysis phases.
Anomaly detection. Issues such as missing data or low performance can happen anywhere across the data pipeline. Anomaly detection is an advanced observability method that is likely to be implemented later in the process. In most cases, machine learning algorithms will be required to detect unusual patterns in your data and logs.

Then, we have five other items that will be more relevant in one data pipeline stage than the other:

Service level agreements (SLAs). SLAs help set standards for the client and the supplier and define the data quality, integrity, and general responsibilities. SLA thresholds can also help when setting up an alert system, and typically, they will be signed before or during the ingestion phase.
Data contracts. These agreements define how data is structured before it enters other systems. They act as a set of rules that clarify what level of freshness and quality you can expect and will usually be negotiated before the ingestion phase.
Schema validation. It guarantees consistent data structures and ensures compatibility with downstream systems. Engineers usually validate the schema during the ingestion or processing stages.
Logs, metrics, and traces. While essential for monitoring performance, collecting and easily accessing this crucial information will become a helpful tool in a crisis — it allows one to find the root cause of a problem faster.
Data quality dashboards. Dashboards help monitor the overall health of the data pipeline and have a high-level view of possible problems. They ensure that the data gathered using other observability methods is presented clearly and in real time.

Finally, data observability cannot be implemented without adding self-evaluation to the framework, so constant auditing and reviewing of the system is a must for any organization.

Next, let’s discuss the tools you might want to try to make your work easier.

Data observability platforms and what can you do with them

So, which tools should you consider if you are beginning to build a data observability framework in your organization? While there are many options out there, in my experience, your best bet would be to start out with the following tools.

As we were building our data infrastructure, we focused on making the most out of open source platforms. The tools listed below ensure transparency and scalability while working with large amounts of data. While most of them have other purposes than data observability, combined, they provide a great way to ensure visibility into the data pipeline.

Here is a list of five necessary platforms that I would recommend to check out:

Prometheus and Grafana platforms complement each other and help engineers collect and visualize large amounts of data in real time. Prometheus, an open-source monitoring system, is perfect for data storage and observation, while the observability platform Grafana helps track new trends through an easy-to-navigate visual dashboard.
Apache Iceberg table format provides an overview of database metadata, including tracking statistics about table columns. Tracking metadata helps to better understand the entire database without unnecessarily processing it. It’s not exactly an observability platform, but its functionalities allow engineers to get better visibility into their data.
Apache Superset is another open-source data exploration and visualization tool that can help to present huge amounts of data, build dashboards, and generate alerts.
Great Expectations is a Python package that helps test and validate data. For instance, it can scan a sample dataset using predefined rules and create data quality conditions that are later used for the entire dataset. Our teams use Great Expectations to run quality tests on new datasets.
Dagster data pipeline orchestration tool can help ensure data lineage and run asset checks. While it was not created as a data observability platform, it provides visibility using your existing data engineering tools and table formats. The tool aids in figuring out the root causes of data anomalies. The paid version of the platform also contains AI-generated insights. This application provides self-service observability and comes with an in-built asset catalog for tracking data assets.

Keep in mind that these are just some of the many options available. Make sure to do your research and find the tools that make sense for your organization.

What happens if you ignore the data observability principles

Once a problem arises, organizations usually rely on an engineer’s intuition to find the root cause of the problem. As software engineer Charity Majors vividly explains in her recollection of her time at MBaaS platform Parse, most traditional monitoring is powered by engineers who have been at the company the longest and can quickly guess their system’s issues. This makes senior engineers irreplaceable and creates additional issues, such as high rates of burnout.

Using data observability tools eliminates guesswork from troubleshooting, minimizes the downtime, and enhances trust. Without data observability tools, you can expect high downtime, data quality issues, and slow reaction times to emerging issues. As a result, these problems might quickly lead to loss of revenue, customers, or even damage brand reputation.

Data observability is vital for enterprise-level companies that handle gargantuan amounts of information and must guarantee its quality and integrity without interruptions.

What’s next for data observability?

Data observability is a must for every organization, especially companies that work with data collection and storage. Once all the tools are set in place, it’s possible to start using advanced methods to optimize the process.

Machine learning, especially large language models (LLMs), is the obvious solution here. They can help to quickly scan the database, flag anomalies, and help to improve the overall data quality by spotting duplicates or adding new enriched fields. At the same time, these algorithms can help keep track of the changes in the schema and logs, improving the data consistency and improving data lineage.

However, it is crucial to pick the right time to implement your AI initiatives. Enhancing your observability capabilities requires resources, time, and investment. Before starting to use custom LLMs, you should carefully consider whether this would truly benefit your organization. Sometimes, it might be more efficient to stick to the standard open-source data observability tools listed above, which are already effective in getting the job done.