Forem: Prudence Waithira

All About Change Data Capture CDC

Prudence Waithira — Tue, 16 Sep 2025 08:01:06 +0000

What is Change Data Capture?

Change Data Capture is an approach that detects, captures, and forwards only the modified data from a source system into downstream systems such as data warehouses, dashboards, or streaming applications.

Core principles of CDC include:
- Capture: Detect changes in source data while minimally impacting source system performance.
- Incremental Updates: Transmit only changed data to reduce overhead.
- Real-time or Near Real-time Processing: Maintain fresh data in targets.
- Idempotency: Ensure changes applied multiple times do not corrupt data.
- Log-based Tracking: Leverage database transaction logs for accurate and scalable data capture.

CDC Implementation Methods

A) Log-based CDC
-- The most robust, it reads database transaction logs (e.g., PostgreSQL WAL, MySQL binlogs) directly to stream change events with minimal latency and high scalability.
Advantages: Low system overhead and near‑real‑time performance make it ideal for high-volume environments.
Disadvantages: It requires privileged access to transaction logs and depends on proper log retention settings.

EG. Logical Replication with psql

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;

-- Create a logical replication slot to capture changes
SELECT pg_create_logical_replication_slot('cdc_slot', 'pgoutput');

-- Fetch recent changes from the WAL
SELECT * FROM pg_logical_slot_changes('cdc_slot', NULL, NULL);

B) Trigger based
Uses database triggers to capture changes. Offers lower latency but may impact performance due to trigger overhead.

 Advantages: Straightforward to implement on databases that support triggers and ensures immediate change capture.
Disadvantages: It can add extra load to the database and may complicate schema changes if not managed carefully.

-- Create an audit table to store changes
CREATE TABLE customers_audit (
    audit_id SERIAL PRIMARY KEY,
    operation_type TEXT,
    customer_id INT,
    customer_name TEXT,
    modified_at TIMESTAMP DEFAULT now()
);

-- Create a function to insert change records
CREATE OR REPLACE FUNCTION capture_customer_changes()
RETURNS TRIGGER AS $
BEGIN
    IF TG_OP = 'INSERT' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('INSERT', NEW.id, NEW.name);
    ELSIF TG_OP = 'UPDATE' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('UPDATE', NEW.id, NEW.name);
    ELSIF TG_OP = 'DELETE' THEN
        INSERT INTO customers_audit (operation_type, customer_id, customer_name)
        VALUES ('DELETE', OLD.id, OLD.name);
    END IF;
    RETURN NULL; -- No need to modify original table data
END;
$ LANGUAGE plpgsql;

-- Attach the trigger to the customers table
CREATE TRIGGER customer_changes_trigger
AFTER INSERT OR UPDATE OR DELETE ON customers
FOR EACH ROW EXECUTE FUNCTION capture_customer_changes();

C) Polling-based/Query-based CDC
Periodically queries the source database to check for changes based on a timestamp or version column.

EG. A products table with a version_number column that
increments on each update

Advantages: Simple to implement when log access or triggers are unavailable.
Disadvantages: It can delay the capture of changes and increase load if polling is too frequent.

D) Timestamp-based CDC
Relies on a dedicated column that records the last modified time for each record. By comparing these timestamps, the system identifies records that have changed since the previous check.

Key CDC Tools and Technologies

Debezeum
Open-sourced log-based CDC platform that captures row-level changes from various databases, including PostgreSQL, MySQL, SQL Server, and MongoDB, and publishes them as change event streams typically into Apache Kafka

{
  "name": "inventory-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "localhost",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.dbname": "inventory",
    "plugin.name": "pgoutput",
    "slot.name": "debezium_slot",
    "publication.name": "dbz_publication",
    "database.server.name": "dbserver1",
    "include.schema.changes": "true"
  }
}

Debezium supports incremental and blocking snapshots, accommodates schema changes, provides fault tolerance with offset tracking, and supports signal tables for ad hoc snapshots.

Apache Kafka & Kafka Connect
Kafka serves as a durable, scalable event streaming platform ideal for transporting CDC events. Kafka Connect offers extensible connectors to ingest CDC events from sources (like Debezium connectors) and deliver them downstream.

from kafka import KafkaProducer
import json

# Initialize the Kafka producer with bootstrap servers and a JSON serializer for values.
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Define a CDC event that includes details of the operation.
cdc_event = {
    "table": "orders",
    "operation": "update",
    "data": {"order_id": 123, "status": "shipped"}
}

# Send the CDC event to the 'cdc-topic' and flush to ensure transmission.
producer.send('cdc-topic', cdc_event)
producer.flush()
print("CDC event sent successfully!")

CDC events are published as Kafka topics, enabling downstream consumers to perform real-time analytics, caching, and replication tasks [Confluent CDC Blog].

Two main CDC connector types in Kafka Connect:
Source connectors: Capture and stream change events into Kafka.
Sink connectors: Consume CDC events from Kafka and write them to other data stores.

Confluent Cloud CDC Connectors
Confluent Cloud provides managed CDC connectors, including Oracle CDC Source Connector, enabling easy capture from Oracle redo logs and publishing to Kafka topics with built-in fault tolerance and support for security ACLs, offset management, and topic partitioning [Confluent Docs: Oracle CDC].

AWS Database Migration Service (DMS)
Uses log-based cdc to continuously replicate data from on-premises systems to the AWS cloud with minimal downtime.

Talend and Informatica
Talend and Informatica are comprehensive ETL platforms offering built‑in CDC functionality to capture and process data changes, reducing manual configurations. They are especially advantageous in complex data transformation scenarios, where integrated solutions can simplify operations

Database-native CDC solutions
Several relational databases offer native CDC features, reducing the need for external tools:

 - PostgreSQL logical replication: Captures changes in  WAL and streams them to subscribers.
 - SQL Server change data capture (CDC): Uses transaction logs to track changes automatically.
 - MySQL binary log (binlog) replication: Logs changes for replication purposes.

Real-World CDC Implementation Strategies

1. Initial Snapshot

Any CDC pipeline begins with capturing a consistent snapshot of the source database so downstream systems start with an accurate baseline.
Debezium takes snapshots using SQL queries within optimized transaction isolation levels.
The snapshot runs once at startup or ad hoc, capturing the existing state before switching to streaming.

2. Streaming Changes

Subsequent changes (INSERT, UPDATE, DELETE) are streamed as events sourced directly from database logs.
Kafka provides durable messaging and ordering guarantees.
Event consumers rebuild or maintain up-to-date representations of source data efficiently.

3. Denormalization Patterns
CDC typically mirrors highly normalized source schemas, which can be hard to consume for analytics. Denormalization approaches include:

 - No denormalization: Simple replication with downstream joins.
 - Materialized Views: Create database views that join/enrich data before CDC capture.
 - Outbox Pattern: Application writes change events to an immutable outbox table from which CDC is performed.
 - Stream Processing: Use Kafka Streams or ksqlDB to enrich and denormalize events downstream.
 - Denormalization at Destination: Perform transformations in data warehouse or lake.

Choosing where denormalization occurs depends on latency, complexity, and architectural preferences.

CDC Challenges and Solutions

-- Schema Evolution
Source database schema changes (add/drop columns, data types) can break CDC pipelines
Solution:
Use schema registry and versioning; Debezium supports some schema change handling; perform backward-compatible schema updates; Incremental snapshots can handle some changes gracefully
-- Event Ordering
Changes arriving out of order can cause incorrect data state.
Solution:
Use Kafka’s partitioning and ordering guarantees; Debezium buffers snapshot and streaming events to resolve collisions; design idempotent consumers
-- Late Data
Data changes delayed due to stream interruptions or replication lag
Solution:
Employ windowing and watermark strategies in stream processing; support replay by Kafka's retained log storage and offset management
-- Fault Tolerance
Network, system failures can interrupt pipeline operation
Solution:
Debezium offset tracking for resume; Kafka’s durability; Idempotent writes at sink; Signal tables for controlled snapshot restarts

Sample Kafka Connect Debezeum Sink Connector for writing CDC data to a data warehouse

{
  "name": "dw-sink-connector",
  "config": {
    "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
    "topics": "dbserver1.inventory.customers",
    "connection.url": "jdbc:postgresql://datawarehouse:5432/dw",
    "connection.user": "dw_user",
    "connection.password": "password",
    "auto.create": "true",
    "insert.mode": "upsert",
    "pk.mode": "record_key",
    "pk.fields": "id"
  }
}

Aside: Kafka source connectors read data from an external system and write it to Kafka topics, while Kafka sink connectors read data from Kafka topics and write it to an external system

References:

Debezium Documentation - PostgreSQL Connector:

debezium.io
Confluent Blog on CDC Patterns and Implementation:

How Change Data Capture (CDC) Works

CDC is a software design pattern that identifies and captures changes made to data in a database. Learn how CDC works, the best solutions, and how to get started with various implementations.

alt="favicon"
class="c-embed__favicon m-0 mr-2 radius-0"
src="https://www.confluent.io/favicon.ico"
loading="lazy" />
confluent.io
Confluent Oracle CDC Source Connector Documentation:
docs.confluent.io

</a> </div> </div>
Additional CDC Concepts and Tools Overview:

What Is Change Data Capture (CDC)? Methods & Use Cases

Change Data Capture is ideal for real time data movement. Learn how it works, the best use cases for CDC, and the role it plays in streaming ETL.

alt="favicon"
class="c-embed__favicon m-0 mr-2 radius-0"
src="https://media.striim.com/wp-content/uploads/2020/09/13183215/favicon-circle-157x150.png"
loading="lazy" />
striim.com

Apache Kafka CDC Implementation Guide:

How To Implement Change Data Capture With Apache Kafka and Debezium

Learn how to implement real-time Change Data Capture (CDC) with Apache Kafka, Debezium, and Estuary for seamless data integration and analytics.

estuary.dev

Data Engineering Core Concepts

Prudence Waithira — Sun, 14 Sep 2025 10:06:54 +0000

A) Batch vs Streaming Ingestion

Different approaches to bringing data to a system.
Batch – data is ingested in batches over a period of time and processed in a single operation
Eg. Processing historical data for trend analysis

Streaming – data is ingested continuously as it arrives in real-time.
Data is processed as it is generated
Eg. Personalized recommendations. Monitoring sensor data from IoT devices
The choice between them depends on factors like data volume, latency requirements, and the nature of the data source.

Key Considerations When Choosing:
• Data Volume:
Batch processing is generally preferred for large datasets, while streaming is better for smaller, continuous streams.
• Latency Requirements:
If real-time analysis is crucial, streaming is the way to go. If latency is not a major concern, batch processing may be sufficient.
• Data Source:
Batch ingestion is often used for data sources that generate data in batches, while streaming is better for continuous data sources like sensors or user activity logs.
• Complexity:
Batch processing is generally simpler to implement and manage than streaming, which can introduce complexities when dealing with stateful operations.
• Cost:
Real-time processing may require more powerful hardware and infrastructure.
• Data Consistency:
Streaming systems may need to handle out-of-order or late-arriving data, which can impact data consistency.

B) Change Data Capture CDC

A data integration pattern that identifies and tracks changes made to data in a source system and then delivers those changes to a target system.
It focuses on capturing inserts, updates, and deletes, enabling real-time or near real-time data synchronization and minimizing latency compared to traditional batch processing.
Eg.dbzm

What it is:
• CDC identifies and captures changes made to data in a database or other data source.
• These changes are typically captured as a stream of events, often referred to as a CDC feed.
• The captured changes are then propagated to one or more target systems.

Why CDC:

Real-time data integration
Reduced latency
Improved data consistency
Efficiency Reference:

Application backend (mutation operations) -> Database -> Kafka message (from db) -> target systems (where the mutation is to occur)

Use Cases
. Replicate data in other databases
. Stream processing based on data changes eg. If customer info changes

Eg. How if you change your say name for your google account it is tracked immediately and through streamline processing updated in real-time

C) Idempotency

-- Same input = Same output
-- No Side effects.
-- Focus on final state.
ie. A data operation repeated multiple times with the same input produces the same result every single time.
Eg. Upsert operations: e.g., INSERT,,, ON CONFLICT UPDATE with same input will produce same output
-- In modern data architectures, idempotency guarantees that pipeline operations produce identical results whether executed once or multiple times. This property becomes essential when dealing with distributed systems, streaming data, and fault-tolerant architectures where retries are not just possible but necessary for system reliability.

tiktok.com

D) OLTP vs OLAP

OLAP (Online Analytical Processing) – focuses on analysis of historical data
OLTP (Online Transaction Processing) – focuses on transactional data

Key Differences in a Table:
Feature OLTP OLAP
Purpose Transaction processing Analytical processing
Data Volume Smaller, frequently changing Larger, historical data
Typical Operations Inserts, updates, deletes Aggregations, complex queries
Data Model Normalized relational Multidimensional (star, snowflake)
Performance Low latency, high throughput Optimized for complex queries
Examples Banking, e-commerce Data warehousing, business intelligence

E) Columnar vs Row-based

Row-based dbs  transactional processing
So, row databases are commonly used for Online Transactional Processing (OLTP), where a single “transaction,” such as inserting, removing or updating, can be performed quickly using small amounts of data.

Common row oriented databases:
• Postgres
• MySQL

_ Column-based_  analytical processing. Sotes column data in blocks. Great for OLAP.

Common column oriented databases:
• Redshift
• BigQuery
• Snowflake

F) Partitioning

Data partitioning is a technique for dividing large datasets into smaller, manageable chunks called partitions. Each partition contains a subset of data and is distributed across multiple nodes or servers. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.
TYPES
i. Horizontal Partitioning (Row-based) - also called sharding in distributed systems splits tables by rows so every partition has the same columns but different records. Each partition contains a subset of the entire data set. For example, a sales table could be partitioned by year, with each partition containing sales data for a specific year.
 Range Partitioning
 Hash Partitioning
 List Partitioning

ii. _Vertical Partitioning (Column-Based) _– dividing a table’s columns into separate partitions so queries read only the data they need. For example, a user table might be split into partitions with user information and another with billing information

iii. Functional Partitioning - Dividing data based on how it is used by different parts of an application. For example, data related to customer orders might be separated from data related to product inventory.

USE CASES

OLAP operations
Machine Learning pipelines

G) ETL vs ELT

• ETL (Extract, Transform, Load):
• Data is extracted from its source.
• Data is transformed into a usable format, often in a staging area.
• The transformed data is then loaded into the data warehouse.
• ELT (Extract, Load, Transform):
• Data is extracted from its source.
• The raw, untransformed data is loaded directly into the data warehouse.
• Data transformation happens within the data warehouse using its processing power.

H) CAP Theorem/Brewer’s Theorem

The CAP theorem states that a distributed database system has to make a tradeoff between Consistency and Availability when a Partition occurs.
Consistency means that the user should be able to see the same data no matter which node they connect to on the system. For example, your bank account should reflect the same balance whether you view it from your PC, tablet, or smartphone!
Availability means that every request from the user should elicit a response from the system.
Partition refers to a communication break between nodes within a distributed system.
Partition tolerance - This is handled by keeping replicas of the records in multiple different nodes.

I) Windowing in Streaming

Windowing – window input data into fixed sized windows then you process each window separately.
Time in windowing:
i. Processing Time – when you don’t care for the actual time but that event happened
ii. Event Time
Time-Based Windows:
i. Tumbling windows
ii. Hopping windows
iii. Sliding windows
iv. Session Windows
It’s used to divide a continuous data stream into smaller, finite chunks called streaming windows.
E.G; calculating average website traffic per hour

J) DAGs and Workflow Orchestration

-- DAGs (Directed Acyclic Graphs) are a fundamental concept in workflow orchestration, representing tasks and their dependencies in a structured way.
-- Workflow orchestration manages the execution of these DAGs, ensuring tasks are executed in the correct order, handling dependencies, failures and retries.

K) Retry Logic and Dead Letter Queues

-- A Dead Letter Queue (DLQ) acts as a secondary queue in messaging systems, designed to manage messages that fail to process.
-- When a message cannot be delivered, it is redirected to the DLQ instead of being lost or endlessly retried.

media.geeksforgeeks.org

-- Retry logic attempts to redeliver messages that fail to be processed initially. EG. If a message fails to be processed due to a database connection issue, the retry logic will attempt to resend the message after a short delay. If the database is back online, the message will be processed successfully.

L) Backfilling and Reprocessing

-- Backfilling is the process of filling in missing or correcting historical data that was not processed correctly during the initial run. It ensures data consistency, corrects errors, and provides a complete historical record for analysis and reporting.
-- Reprocessing involves re-running a data pipeline, either partially or fully, with updated logic, code, or configurations. It corrects errors, incorporates new insights, or applies changes to historical data.

M) Data Governance

-- A framework that ensures data is reliable, consistent and aligns with business goals. It focuses on data quality, data security, compliance, data lifecycle management and data stewardship.
EG. Imagine a hospital storing patient records. Data governance would dictate how that data is collected, stored, accessed, and protected. Data engineers would build the systems to store and process the data, ensuring that it adheres to the data governance policies regarding access control, data security, and data privacy

N) Time Travel and Data Versioning

-- Time Travel is the ability to access historical versions of datasets at previous points in time. Allows users to query and manipulate data as it existed at any point in the past. Instead of storing each version as a separate entity, time travel maintains a historical record of all changes made to the data.
-- Data versioning is the practice of keeping multiple versions of data objects with each version representing the state of the data object at a specific point in time. It involves the explicit creation of new versions of data objects whenever changes are made.

O) Distributed Data Processing
Handling and analyzing data across multiple interconnected devices. Crucial for managing and processing large-scale datasets, ie.big data.

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Prudence Waithira — Sat, 13 Sep 2025 19:11:03 +0000

Apache Kafka – a distributed streaming platform widely used for building real-time data pipelines and streaming application.
It was originally developed at Linkedn and open sourced in 2011.
Architecture
1.Brokers: server component responsible for storing, replicating, and serving data streams to clients.

2.Kafka cluster: Group of brokers working together, with each broker independently managing partitions of topics. They can be scaled horizontally

Key Functions of a Kafka Broker:
• Data Storage:
Brokers store data as part of topics in partitioned logs.
• Data Serving:
They receive and respond to produce (write) and fetch (read) requests from client applications.
• Data Replication:
Brokers replicate data partitions to other brokers in the cluster, ensuring that data is not lost if a broker fails and providing high availability.
• Cluster Management (via KRaft or ZooKeeper):
Brokers work together in a cluster, coordinating their activities and managing the distribution of topic partitions.
• Fault Tolerance:
By distributing and replicating data across multiple brokers, the cluster remains operational even if some brokers fail.
• Dynamic membership

3.Zookeeper - used to manage cluster metadata and coordinate brokers. Externally from zookeeper cluster
However, Kafka is evolving to Kraft – uses the internal Raft consensus protocol to manage cluster metadata directly within Kafka brokers, eliminating the need for a separate Zookeeper cluster

4.Topics and Partitions:
-- Topics – where messages are published by producers and from which consumers subscribe to retrieve records. Serve as a way to categorize different types of data streams within Kafka. For example, a "user_activity" topic might contain records related to user logins, clicks, and page views, while an "order_processing" topic might contain records about new orders, order updates, and order cancellations.
-- Partitions – kafka topics are divided into one or more partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. They form the basic unit of parallelism and data distribution in Kafka. Each record within a partition is assigned a unique, incremental identifier called an offset, which represents its position within that partition.

Producers and Consumers

a) Producers
Role: Send data records (events) to specific Kafka topics.
Decoupling: Producers don't need to know about consumers; they just publish data to topics.
Batching & Compression: Producers can batch records and compress them for efficiency. Message Partitioning: Producers can determine the partition a message goes into, often by using a key, which ensures related messages are sent to the same partition.
b) Consumers
Role: **
Subscribe to one or more topics and read (consume) the event streams published to them.
**Decoupling:
Consumers also don't need to know about producers; they subscribe to topics to receive messages.
Consumer Groups:
Consumers organize themselves into consumer groups to achieve parallel processing.
Parallel Processing:
Within a consumer group, each partition is assigned to only one consumer, allowing for distributed processing of messages.
Offset Management:
Consumers keep track of their progress by committing offsets, which are essentially pointers to the last message they processed in a partition, allowing them to resume from where they left off.

Message Delivery Semantics

At-most-once Delivery • Description: Messages are delivered zero or one time, meaning they can be lost if the system fails after they are sent but before they are processed. • Behavior: In Kafka, this is achieved by automatically committing consumer offsets as soon as messages are received. If the consumer fails before processing, the messages are lost and won't be read again. • Use Case: Suitable for applications where occasional message loss is acceptable and high throughput is prioritized.
At-least-once Delivery • Description: Guarantees that every message is delivered at least once, but it's possible for a message to be delivered multiple times. • Behavior: Kafka achieves this when the producer waits for acknowledgment before considering a message committed, or when the consumer commits its offset after successfully processing a message. If a failure occurs after processing but before the offset is committed, the message will be re-delivered. • Use Case: Ideal when message loss is unacceptable, such as in financial transactions, but duplicate processing can be handled by making the consumer idempotent.
Exactly-once Delivery • Description: Each message is delivered exactly once, without any duplication or loss. • Behavior: This is the most complex to achieve and requires cooperation between the producer, Kafka, and the consumer. o Kafka to Kafka (Kafka Streams): Kafka Streams API enables exactly-once semantics by leveraging Kafka's transaction API. o Kafka to External System (Sink): Requires the use of an idempotent consumer that ensures it only processes each message once, even if it receives duplicates. • Use Case: Essential for scenarios where data accuracy and non-duplication are critical, such as order processing or accounting.

Retention Policy and Back Pressure Handling

Kafka uses retention policies (time or size-based) to manage disk space by deleting old messages, while back pressure occurs when producers generate data faster than consumers can process it.
Kafka handles back pressure by enabling consumer groups to scale horizontally, throttling producers, and providing consumer-level configurations to control fetch behavior.

Kafka Retention Policies:
• Time-Based Retention: Messages are deleted after a specified duration (e.g., 7 days by default).
• Size-Based Retention: Messages are deleted to keep the total size of a partition below a configured limit.
• Combined Policies: You can combine time and size-based retention for a customized strategy.

Back Pressure in Kafka:

Back pressure occurs when the rate of data production exceeds the rate of data consumption, creating bottlenecks and potential system instability.
Indicators: o An increase in consumer lag (the difference between a consumer's latest offset and the producer's head of the log) signals back pressure on the consumer side.
Causes: o Consumers can't process messages fast enough due to network issues, disk I/O, or processing complexity.

Handling Back Pressure:
a) Scale consumers horizontally
b) Consumer-level Optimizations – batching and message compressions
c) Rate limiting
d) Stream processing frameworks
e) Monitor consumer lag

Serialization and Schema Evolution

-Serialization in Kafka refers to the process of converting data objects into a byte array format suitable for transmission over the network and storage in Kafka topics.
-Deserialization is the reverse process, converting the byte array back into a usable data object. This is crucial as Kafka messages are essentially byte arrays, and applications need to understand the structure of the data within those bytes. Common serialization formats include Avro, JSON, and Protobuf.

Schema evolution addresses the challenge of managing changes to the structure of data over time in Kafka topics. Ensures changes to the data’s schema can be managed without breaking compatibility between producers and consumers, allowing older and newer versions of data to coexist and be processed correctly. The Kafka Schema Registry plays a central role in facilitating both serialization and schema evolution. It acts as a centralized repository for managing and validating schemas.

Kafka Schema Registry

-A centralized repository for managing and validating schemas for data exchanged within Kafka ecosystem.
Key aspects of Kafka Schema Registry:
• Centralized Schema Management:
It provides a single location to store and manage schemas, ensuring all producers and consumers share a common understanding of message formats.
• Data Contract Enforcement:
Schemas act as a data contract, defining the structure and types of data within Kafka messages. This helps prevent breaking changes and ensures data quality.
• Schema Evolution:
Schema Registry supports schema evolution, allowing you to introduce changes to your data formats (e.g., adding new fields) while maintaining compatibility with older consumers. It handles compatibility checks (forward and backward compatibility) to prevent data corruption.
• Serialization and Deserialization:
It provides serializers and deserializers that integrate with Kafka clients, handling the process of converting data to and from a binary format (like Avro) based on the registered schemas.
• Data Governance:
Schema Registry plays a crucial role in data governance by providing visibility into data lineage, enabling audit capabilities, and facilitating collaboration among teams working with Kafka data.
• Underlying Storage:
Schema Registry uses Kafka itself as its durable backend, leveraging Kafka's log-based architecture for storing and managing schema metadata.

Replication and Fault Tolerance

Kafka achieves durability through replication:
• Each partition has multiple replicas across brokers.
• One replica is the leader, handling all read/write requests.
• Other replicas are followers, synchronizing data from the leader.
• The set of replicas in sync is called ISR (In-Sync Replicas).
If the leader fails, Kafka elects a new leader from followers, ensuring no data loss and continuous availability. The recommended replication factor is typically 3 for balance between fault tolerance and overhead.

Kafka Connect and Kafka Streams

A) Kafka Connect:
• Purpose:
Kafka Connect is a framework for reliably streaming data between Apache Kafka and other data systems. It focuses on simplifying the process of getting data into and out of Kafka.
• Functionality:
It uses pre-built or custom-developed "connectors" to interact with various data sources (e.g., databases, file systems, cloud storage) and sinks (e.g., data warehouses, search indexes).
• Use Cases:
Ideal for data integration, ETL (Extract, Transform, Load) operations where the primary goal is to move data between systems with minimal or no complex transformations.
• Key Feature:
Provides a scalable and fault-tolerant way to manage data pipelines without requiring extensive custom code for data movement.

B) Kafka Streams:
• Purpose:
Kafka Streams is a client library for building real-time stream processing applications directly on top of Apache Kafka. It focuses on processing and analyzing data within Kafka topics.
• Functionality:
It allows developers to write Java/Scala applications that consume data from Kafka topics, perform various transformations, aggregations, joins, and then produce results back to other Kafka topics or external systems.
• Use Cases:
Suited for real-time analytics, event-driven microservices, complex event processing, and applications requiring continuous data processing and analysis.
• Key Feature:
Offers powerful abstractions (KStream, KTable) for representing and manipulating streams and tables of data, enabling sophisticated stream processing logic with built-in fault tolerance and scalability.

ksqlDB: SQL for Streaming

ksqlDB offers a SQL-like interface for streaming data on top of Kafka:
• Enables real-time filtering, aggregation, joining, and enrichment.
• Simplifies stream processing without requiring Java/Scala coding.
• Supports creating persistent views/tables on streaming data.
• Used extensively in industries like healthcare for real-time transaction monitoring and anomaly detection.

Transactions and Idempotence

Kafka supports exactly-once semantics (EOS) through:
• Idempotent Producers: Prevent duplicate message sends during retries by assigning sequence numbers.

from kafka import KafkaProducer
import json
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],  # Replace with your Kafka broker addresses
    enable_idempotence=True,              # Enable idempotence
    acks='all',                           # Ensure all replicas acknowledge the write
    retries=10,                           # Number of retries for failed sends
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
try:
    future = producer.send('my_topic', {'message': 'This is an idempotent message'})
    record_metadata = future.get(timeout=10)
    print(f"Message sent successfully to topic: {record_metadata.topic}, partition: {record_metadata.partition}, offset: {record_metadata.offset}")
except Exception as e:
    print(f"Error sending message: {e}")
finally:
    producer.close()

• Transactions: Enable grouped writes and offset commits to be atomic, preventing partial processing. This means either all messages in the transaction are committed and become visible to consumers, or none of them are.
Producers enable idempotence with enable.idempotence=true and transactions with transaction.id=xxx, improving accuracy in critical workflows.

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    transactional_id='my_transactional_producer_id', # Unique ID for the transactional producer
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.init_transactions()

try:
    producer.begin_transaction()
    producer.send('topic_a', {'data': 'message from topic A'})
    producer.send('topic_b', {'data': 'message from topic B'})
    producer.commit_transaction()
    print("Transaction committed successfully.")
except Exception as e:
    producer.abort_transaction()
    print(f"Transaction aborted due to error: {e}")
finally:
    producer.close()

Security in Kafka

Kafka incorporates robust security features:
• Authentication: Supports SASL (Kerberos, OAuth), TLS client certificates.
• Authorization: Fine-grained access control via ACLs.
• Encryption: TLS encryption for data in transit.
These measures protect Kafka clusters from unauthorized access and data breaches.

Operations and Monitoring

Key metrics for Kafka health:
• Consumer Lag: Indicates delays in processing.
• Under-replicated Partitions: Signals insufficient replication.
• Broker Health: Disk, network, JVM metrics.
• Throughput and Latency: Evaluated for performance tuning.

Performance Optimization

a. Batching and Compression
• Batching - Kafka producers group messages into batches before sending them to brokers. This reduces network overhead and improves throughput.
• Compression – Message compression at the producer level. This reduces the amount of data transferred over the network and stored on disk, saving network bandwidth and disk space.
b. Page cache Usage
• By maximizing page cache utilization, Kafka minimizes direct disk I/O, leading to faster read and write operations.
c. Disk and Network Considerations
• Disk:
 Fast Disks eg. SSD for Kafka data directories is crucial for optimal performance, especially for write-heavy workloads.
 RAID Configuration: Employing appropriate RAID configurations can improve disk I/O performance and provide data redundancy.
 Separate Disks: Ideally, separate disks should be used for Kafka logs and operating system files to prevent contention.
• Network:
 High-Bandwidth Network
 Network Interface Cards
 Proper Network Configuration

Scaling Kafka

This involves strategies for handling increased data loads and ensuring optimal performance.
1. Scaling Kafka Partition Count Tuning:
• Partitions and Parallelism:
Partitions are the unit of parallelism in Kafka. Increasing the number of partitions for a topic allows for greater parallelism in both producers (writing messages) and consumers (reading messages).
• Consumer Groups:
Within a consumer group, each partition can only be consumed by one consumer instance at a time. Therefore, the maximum parallelism for a consumer group is limited by the number of partitions in the topic. Adding more consumer instances than partitions will result in idle consumers.
• Overhead:
While increasing partitions can improve throughput, too many partitions can introduce overhead on brokers and consumers due to increased metadata management and potential for more frequent rebalances.
• Rule of Thumb:
A common starting point is 3-5 partitions per consumer instance in your consumer group, adjusting based on data volume and processing requirements.
2. Adding Brokers:
• Horizontal Scaling:
Adding new brokers to a Kafka cluster is a form of horizontal scaling, increasing the overall capacity of the cluster to handle higher data loads and improve fault tolerance.
• Uneven Distribution:
When new brokers are added, existing topic partitions are not automatically distributed to them, leading to an unbalanced cluster where new brokers remain idle while older ones carry the load.
• Replication and High Availability:
Adding brokers allows for more replicas of partitions to be stored, enhancing data durability and high availability in case of broker failures.
3. Rebalancing Partitions:
• Necessity:
Rebalancing is crucial after adding new brokers to distribute partitions evenly across all brokers (old and new), ensuring optimal resource utilization and preventing performance bottlenecks on overloaded brokers.
• Tools:
• kafka-reassign-partitions.sh: This command-line utility for self-managed Kafka clusters allows for manual partition reassignment. It requires creating a JSON file defining the desired partition distribution.
• Cruise Control: An open-source tool that automates partition rebalancing. It continuously monitors cluster performance and intelligently rebalances partitions to maintain an optimal distribution, reducing manual effort.
• Impact:
Rebalancing can temporarily impact performance as data is moved between brokers. Planning and monitoring during rebalancing operations are essential.

Real-World Use Cases

• Netflix: Employs Kafka for real-time event ingestion and stream processing in their data platform to personalize content and monitor services.
• LinkedIn: Originator of Kafka; uses it extensively for data integration, activity tracking, and operational metrics.
• Uber: Uses Kafka for event-driven microservices communication, real-time analytics, and surge pricing algorithms.
These companies benefit from Kafka’s scalability, fault tolerance, and exactly-once semantics to deliver reliable, real-time data-driven services