Forem: Muhammad Ibrahim Anis

Learning Kafka Part 4 (III): Kafka Connect

Muhammad Ibrahim Anis — Fri, 20 Jan 2023 13:23:36 +0000

Sometimes, our source of data might be an external system, like a database, data warehouse or a file system. Also, we might want to send the data in Kafka topics to these external systems. Instead of writing custom implementation, Kafka has a component called Kafka Connect for this solution.

Kafka Connect

Kafka Connect, also known simply as Connect is a framework for connecting Kafka to external systems such as databases, search indexes, message queues and file systems, using connectors.
Connect runs in its own process, separate from the Kafka cluster. Using Kafka Connect requires no programming. It is completely configuration based.

Connectors are ready to use components, which can help us to import data from external systems into Kafka topics and/or export data from Kafka topics into external systems. We can use existing implementation of these connectors or implement our own connectors.

There two types of connectors in Kafka Connect
Source connector collects data from a system. Source systems can be databases, filesystems, cloud object stores etc. For example, JDBCSourceConnector would import data from a relational database into Kafka.

Sink connector delivers data from Kafka topics to external systems, which can also be databases, filesystems, cloud object stores etc. For example, HDFSSinkConnector would export data from Kafka topics to Hadoop Distributed File System.

A deep dive into Connect pipeline.

A connect pipeline is not actually made up of connectors only, but comprise many components, each responsible for carrying out a specific function.

Connectors
Connectors do not perform the actual data copying themselves, their configuration describes the set of data to be copied and are responsible for breaking that job into tasks that can be distributed to Connect workers.

Connectors run in a java process called a worker. Connectors can be run in standalone mode or distributed mode.
In standalone mode, we run a single worker process. In this mode, data only flows through the Connect pipeline as long as this single process is up, and we can’t make any change to the pipeline once it is running. It is intended for testing, temporary or one-time connections, not recommended for production environment.

In distributed mode, a group of workers work in a cluster and collaborate with each other to spread out the workload (like consumers in a consumer group). The Connect pipeline can be reconfigured while running. Having multiple workers means the Pipeline can keep sending or receiving data even if a worker goes down and we can add and remove workers as needed.

Tasks
Tasks contain the code that actually copies data to/from external systems, they receive the configuration information from their parent Connector. Then the Connector pushes/pulls data from the task.

Converters
Since producers and consumers use serializers and deserializers to configure how data should be translated before being sent to or retrieved from Kafka, if Connect is getting data out of Kafka, it needs to be aware of the serializers that were used to produce the data. Also, if Connect is sending data to Kafka, it needs to
serialize it to a format the consumers will be able to understand.

For Connect, we don’t configure a serializer and deserializer separately, like we do for producers and consumers. Instead, we provide a single library, a converter, that can both serialize and deserialize the data to our chosen format.

For source connectors, converters are invoked after the connector in the Connect pipeline. While for sink connectors, converters are invoked before connectors.

Transform
Simple logic to alter each message produced by or sent to Connect. Transform allow us to transform messages as they flow through Connect pipeline. This helps us get the data in right shape for our use case before it gets to either Kafka or the external system. Transformation is an optional component of the Connect pipeline.

While it is possible to perform complex transformations, it is considered best practice to stick to fast and simple logic. A slow or heavy transformation will affect the performance of a Connect pipeline. For advanced transformations, its best to use Kafka streams, a dedicated framework for stream processing.

For source connectors, transformations are invoked after the connector and before the converter, for sink connectors, transformations are invoked after the converter but before the connector.

Error Handling

In an ideal world, once our connect pipeline is up and running, we can call it night. Sadly, the world is never ideal, something somewhere will always go wrong, like our Connect pipeline. An example is when a sink system receives a data in an invalid format (JSON instead of Avro). And when something invariably does go wrong, we want our pipeline to be able to handle it elegantly. Error handling is an important part of a reliable data pipeline. In Kafka Connect, error handling is implemented by the sink connector, not source connector.

Kafka Connect has incorporated the following error handling options.

Fail Fast
By default, our sink connector terminates and stops processing messages when it encounters an error. This is a good option if an error in our sink connector indicates a serious problem.

Ignore Errors
We can optionally configure our sink connector to ignore all errors and never stop processing messages. This is good if we want to avoid downtime but run the risk of missing problems in our connector as we do not receive any message if something goes wrong. (It always does)

Dead Letter Queue
Dead Letter Queue (DLQ) is a service implementation within messaging systems or data pipelines to store messages that are not processed successfully. Instead of crashing the pipeline, or ignoring errors, the system moves it to a Dead letter queue so we can handle them later.

In Connect, the Dead Letter Queue is a separate topic that store messages than cannot be processed. The invalid messages can then be ignored or fixed and reprocessed. This is really the most graceful and recommended way to handle errors.

And this brings us to the end of Kafka connect, and also the end of the part three of this series, where we discussed about the components that interact with a Kafka cluster, we have had a look at producers and consumers, those applications that write to and read messages from Kafka, also Kafka streams for processing these messages in real time, and lastly Kafka Connect, a framework use to connect Kafka to external systems.

Coming up, a look at the medium these components use to communicate with Kafka.

Learning Kafka Part Four (II): Kafka Streams

Muhammad Ibrahim Anis — Thu, 12 Jan 2023 04:17:16 +0000

While moving streaming data from a source system to target system is important, sometimes we might want to process, transform or react to the data immediately it becomes available. This is referred to as stream processing. We can of course use a dedicated stream processing platforms like Apache Storm or Apache Flink, but remember the problem with Zookeeper? Right, we have to set up, integrate and maintain two different systems. What we need is a way to process the data in Kafka itself without an added overhead or complexity. Enter Kafka streams.

Kafka Streams

Kafka Streams is an easy, lightweight, yet powerful data processing Java library within Kafka. With Kafka streams, we dont't just move data to and from Kafka but process these data in real time.

Kafka streams functions as a mix of a Producer and Consumer, we read data from a topic, process or enrich the data, then write the processed data back to Kafka (i.e., to make the data available to downstream consumers).

Features

Small and lightweight
Already Integrated with Kafka
Applications built with it are just normal Java Applications

How it Works

Topology

A Kafka Streams application is structured as a DAG (Directed Acyclic Graph), called a topology.

A Kafka Streams application can be made up of one or more topologies.
A topology defines the computational logic to be applied to a stream of data. A node in a topology represents a single processor and the edges represent streams of data.

There are three types of processors in a topology;

Source processor: a processor that does not have any upstream processor. It produces an input stream to its topology by consuming messages from one or more topics and passing it to downstream processors.
Stream processor: represents a processing step to transform the data in the streams. It receives events from its upstream processor(s), which is either a source or stream processor, apply its transformation, and pass it on to its downstream processor, which can be another stream processor or a sink processor.
Sink processor: another special type of processor that does not have a downstream processor. It sends events received from upstream processors to a specified Kafka topic.

Scenario:
We have a producer that sends the details of orders placed by customers to a Kafka topic, we process these orders, to check if they are valid or not, our topology might look like this;

Note:

**Order placed **is a source processor, its job is to read stream data from a Kafka topic.
Is valid is a stream processor, responsible for checking if the transaction is valid or not.
Valid is a sink processor that writes all the valid orders to a Kafka topic.
Not Valid is also a sink processor that writes the error responses back to a different Kafka topic, where a consumer will read and forward the error messages to customers.

Sub-topology

Kafka also has the concept of sub-topology.
Scenario continued:
The above topology is reading from a single topic, process the data and, based on the condition, write back to two different topics. But what if we want to build a topology that further reads from the Valid order topic, and apply further business logic on the data? Like check if the items ordered are available? our topology might now look like this;

Note:

Valid which is a sink processor for the topology is now the source processor for our sub-topology.
Items Available is a stream processor that checks whether the items are available.
Yes and No are both sink processors that writes to different Kafka topics.

In Kafka streams, data are processed one after another, in the order they arrived. Only a single data is routed through a topology at a time. The data is passed through each processor in a topology before another is allowed in.

Kafka streams provides two ways to represent stream processing topology:

Kafka Streams Processor API: this is the lower-level API that provides us with flexibility and fine control over how we define our processing topology, but comes with more manual coding.
Kafka Streams DSL (Domain Specific Language) API: this is built on top of the Streams Processor API. Provides more abstraction. It is the recommended API for most users because most data processing use cases can be expressed with it. Very similar to Java Streams API. We typically use built in operations like map, filter, mapValues etc to process data.

Stream Processing Concepts

When talking about Kafka streams, or any other stream processing framework, there some important concepts we need to be aware of, like stateless processing, stateful processing, time and windowing;

Stateless processing

In Stateless processing, the event is processed independently of other events. The stream application requires no previous memory of the event. It only needs to look at the current event and perform and operation. Example is the filter operation, it requires no previous memory of the event, it checks the current event and determine if it should pass on to the next processor or not. If our topology is made up of only stateless processors, then our streams application is considered a stateless application.

Stateful processing

Stateful processing on the other hand requires the previous state of the event. It needs to remember information (state) about previously seen event in one or more steps in our topology. Example is the count operation; it requires the number of all previously seen events to keep track of how many they are. If our topology has one stateful processor, regardless of the number of stateless processors, then it is considered to be a stateful application.

Windowing

We can never hope to get a global view of a data stream, as theoretically, it is endless. Windowing allows us to slice up the endless stream of events into chunks or segments called windows for processing. Slicing can either be time-based (events from the last five minutes, two hours or three days) or count based (the last one hundred or five thousand events).

Windowing strategy
There are four types of windowing strategy in Kafka streams;

Tumbling Window
A tumbling window is a fixed, non-overlapping and gapless window characterized by size, usually a time interval. we process the stream data after the time interval. For example, if we configure a tumbling window with a size of five minutes, all the events within the same five-minute window will be grouped together to be processed. Events will only belong to one and only one window.

Hopping window
Hopping window is a type of tumbling window but with advance interval added. The advance interval determines how much the window moves forward (i.e., hop) relative to its previous position. This causes overlaps between windows and events may belong to more than one window.

Sliding window
Sliding window is another time-based window, but its endpoint is determined by user activity. For example, when two events are within a predetermined timeframe, they will be included in the same window.

Session window
A session window is defined by a period of activity separated by a gap of inactivity. This can be used to group events using the event keys. Session window typically has a timeout, which is the maximum duration a session stays open. If there are no events under the key received within this duration, the session is closed and processed. Next time when there is an event under the key, a new session will be opened. Session window differ from the previous types because they aren’t defined by time, but user activity. This can be useful for behavioural analysis like when we want to keep track of a particular user’s activity.

Time

Time is a very important factor in Kafka streams. During processing, each events needs to be associated with a timestamp. Remember when we said slicing can be time based? Which time did we mean? There are three different notions of time in stream processing;

Event time: the time when an event happened at the source. Usually embedded in the events payload
Ingestion time: when an event is added to a topic on a broker. This always occurs after event time
Processing time: when our Kafka streams application process the event. Always happens after event time and ingestion time.

That’s it. The end of this segment on Kafka Streams.
Up next, connecting Kafka to external systems with Kafka Connect.

Learning Kafka Part Four (I): Producers and Consumers

Muhammad Ibrahim Anis — Sun, 08 Jan 2023 08:09:17 +0000

So far in this series, we’ve talked about what Kafka is, what it does and how it does it. But all of our discussions has been centred on the components that lives within Kafka itself, we are yet to discuss about those other components that live outside of Kafka but interacts with it. We’ve caught a glimpse of few like producers and consumers. In the upcoming segments, we will be having a closer look at each of these components, starting from producers and consumers to Kafka Streams and Kafka Connect.

At the basic level, any component that interacts with Kafka is either a producer, a consumer or a combination of both. Let’s start with The Producer.

Producer

A producer is any system or application that writes messages to a Kafka topic. This can be anything, from a car that sends GPS coordinates to a mobile application that sends the details of a transaction.

Producers can send messages to one or more Kafka topics, also, a topic can receive messages from one or more producers.

When writing messages to Kafka, we must include the topic to send the message to, and a value. Optionally, we can also specify a key, partition, and a timestamp.

Batching

By default, producers try to send messages to Kafka as soon as possible. But this means an excessive produce requests over the network and can affect the whole system performance. An efficient way to solve this is by collecting the messages into batches.
Batching is the process of grouping messages that are being sent to the same topic and partition together. This is to reduce the number of requests being sent over the network.

There are two configurations that controls batching in kafka, time and size.

Time is the number in milliseconds a producer waits before sending a batch of message to Kafka. The default value is 0, which means to send messages immediately they are available. By changing it to 20ms for example, we can increase the chances of messages being sent together in a batch.
We can also configure batching by size. Where size is the maximum number in bytes a batch of messages will reach before being sent to Kafka. By default, the number is 16384 bytes. Note that if a batch reaches its maximum size before the end of the time configured, the batch will be sent.

Batching increases throughput significantly but also increases latency.
Sending the same amount of data but in fewer requests improves the performance of the system, there is less CPU overhead required to process a smaller number of requests.
But since in order to batch messages together, the producer must wait a configurable period of time while it groups the messages to send, this wait time is increased latency.

Partitioning

If we didn’t explicitly specify a partition when sending a message to Kafka, a partitioner will choose a partition for us. Usually, this is based on the message key, messages with the same key will be sent to the same partition.

But if the message has no key, then by default, the partitioner will use the sticky partitioning strategy; where it will pick a partition randomly, then send a batch of messages to it, then repeat the process all over again.

Previously, before Kafka 2.4, the default partitioning strategy is round-robin; the producer sends messages to Kafka partitions serially, without grouping them into batches, starting from partition zero upwards, until each partition receives a message, then starts over again from partition zero.

Serialization

But before the producer send these messages, it must be converted to bytes in a process known as serialization, as messages stored on Kafka has no meaning to Kafka itself, so they are stored as bytes.

Kafka comes with out of the box serialization support for primitives (String, Integer, ByteArrays, Long etc), but when the messages we want to send is not of primitive type (which is most times), we have the option of writing a custom serializer for the message or using serialization libraries like Apache Avro, Protocol Buffers (Protobuf) or Apache Thrift.

It is highly recommended to use serialization libraries, as custom serialization means writing a serializer for every different type of message we want to send, and this does not handle future changes gracefully.

Consumer

A consumer is any application that reads messages from Kafka. A Consumer subscribes to a topic or a list of topics, and receives any messages written to the topic(s).

When a single consumer is reading from a topic, it receives messages from all the partitions in the topic.

Apart from reading messages from Kafka, a consumer is similar to a producer except for an important distinction; Consumer Group.

Consumer Group

A consumer generally, but not necessarily, operates as part of a consumer group. A consumer group is a group of consumers that are subscribed to the same topic, and have the same group ID. A consumer group can contain from one to multiple consumers.

When consuming messages, consumers will always try to balance the workload evenly amongst themselves. If the number of consumers in the group is the same as the number of partitions in the topic, each consumer is assigned a partition to consume from, thus increasing throughput.

But if the number of consumers in a consumer group is more than the number of partitions, the consumer(s) without partitions will be idle.

Consumer group ensures high availability and fault tolerance of applications consuming messages. For example, if a consumer in a group goes offline, its partition gets assigned to an available consumer in the group to continue consuming messages.

This design also allows consumers to scale, if a consumer is lagging behind a producer, we can just add another consumer instance to the group. Also with consumer groups, different applications can read the same messages from a topic, we just create a consumer group with a unique group ID for each
application.

It is also important to note that Kafka guarantees message ordering within a partition, but does not guarantee ordering of message across partitions. That is, if a consumer is consuming from a single partition, messages will always be read in the order they were written, First-In-First-Out (FIFO). But if consuming from multiple partitions, messages might get mixed up.

Deserialization

As mentioned earlier, producers serialize messages before sending them to Kafka, as Kafka only work with bytes. Therefore, consumers need to convert these bytes into its original format. This process is called deserialization.

And with that, the end of this segment.
Coming up, the Kafka Streams Library.

Learning Kafka Part Three: Kafka’s Architectural Design

Muhammad Ibrahim Anis — Tue, 03 Jan 2023 00:38:04 +0000

Hello and welcome to the third installment of learning Kafka.

In the preceding sections, we’ve discussed what Kafka is, its features and its core components. In this section we will be looking at how Kafka works, how it does anything. How are messages sent and retrieved from Kafka? How are messages stored? How does Kafka ensure durability, availability, throughput etc. These questions can be answered by looking at some of the architectural designs of Kafka.
These architectural designs are; Publish/Subscribe, Commit Log, Replication, Bytes-Only System.

Publish/Subscribe

Scenario:
Imagine we have two applications, App1 and App2, and we want them to exchange data, the logical way to do this is to have them send the data directly to each other.

What then if we add other applications to this system, all of them sharing data with each other, by continuing being logical, we end up with something like this;

This is not ideal, the applications are tightly coupled together, if a single application fails, the whole system unravels. We can solve this problem by introducing a system that acts as a buffer between these applications, each application sends its data into the system, and also pulls the data it needs from the system.

Publish/Subscribe, commonly called Pub/Sub is a messaging pattern where senders of messages, called publishers, do not send messages directly to receivers, called subscribers, but instead categorizes messages without any knowledge of subscribers or if there is even any subscriber. Likewise, subscribers express interest in one or more category of messages and only receive messages that are of particular interest to them. Publishers and subscribers are completely decoupled from and agnostic of each other.

Kafka is designed after such system. Publishers (called Producers) send messages to Kafka, categorized as topics, subscribers (called Consumers) subscribe to one or more topics. Consumers receive the messages immediately they become available. (More on producers and consumers later in the series).
This design allows Kafka to scale well as we can add as many producers and consumers as we want without changing or adding complexity to the overall system design.

Commit log

At the heart of Kafka is the commit log. A commit log is a time-ordered, immutable and append only data structure. (Whatever data can be stored in and retrieved from is a data structure). Remember partitions? Well, each partition is actually a log. Do we want to add a message? No problem, it will get appended at the end of the log. Want to read a message? Still no problem, start from beginning of the log. Want to modify the message? Sorry, can’t do that, once written, messages are unchangeable. This makes sense, because once an event has happened in real-life, we can’t undo it.

Once these messages have been written to the log, they are then persisted to local storage on the broker. This ensures durability because data is retained even after the broker shuts down. Also, the messages will be available for other consumers to consume, this is different from traditional messaging systems, where messages are only consumed once, then deleted.

Kafka has a configurable retention policy for messages in a log. Messages are retained for a particular period of time, after which they are deleted from storage to free up space. By default, messages are retained for seven days, but we can configure it to suit our need. For example, if we set the retention policy to thirty days instead, messages will be deleted thirty days from the day they were published, it does not matter if they have been consumed or not. We also have the option of configuring the retention policy by size, where messages in a log are deleted when they reach a certain size e.g. 1GB.

Replication

Replication is a fundamental design of Kafka that allows it to provide durability and high availability.
Remember, Kafka is a distributed system with multiple brokers, messages in brokers are organized into topics, and each topic is further partitioned. Replication is the process of copying these partitions to other brokers in the cluster, each partition can have multiple replicas. This is to avoid a single point of failure and to ensure that the data will always be available if a particular broker fails.

Kafka adopts a leader-follower model, each partition has a single replica assigned as leader of the partition, all produced messages go through the leader.

Follower replicas do not serve client requests except when configured otherwise, their main purpose is to copy messages written to the lead replica and stay up to date with it. Followers that are up to date with the leader are referred to in-sync replicas.

In a balanced cluster, the lead replica of each partition is spread across different brokers. When a lead replica goes offline, only in-sync replicas are eligible to become a leader, this is to avoid loss of data as a replica that is out of sync does not have the complete data.

From the above image, the replica for Topic B Partition 2 on Broker 3 is out of sync with the lead replica on Broker 2, as such will not be eligible for leadership if the lead replica goes offline.

We enable the replication factor at the topic level, if we create a topic with three partitions and a replication factor of three, that means there will be three copies of each partition. We end up having a total of nine partitions (3 partitions x 3 replicas) for the topic. Note that the replication factor for a topic cannot be more than the number of brokers in the cluster.
In production, the number of replicas should not be less than two or more than four, the recommended number is three.

Bytes Only System

Kafka only works with bytes. Kafka receives messages as bytes, stores them as bytes, and when responding to a fetch request, returns messages as bytes. So, data stored in Kafka does not have a specific format or meaning to Kafka.

There are couple of advantages with this; bytes occupy less storage space, have faster input/output, and we can send any type of data from simple text to mp3 and videos. But it also adds another layer of complexity as we now have to convert these messages to bytes and from bytes in a process known as serialization and deserialization.

This is yet again the end of another segment.
Next, a look at those components that make up Kafka’s ecosystem.

Learning Kafka Part Two: Core Components of Kafka

Muhammad Ibrahim Anis — Sat, 31 Dec 2022 02:50:38 +0000

Welcome to the second part of Learning Kafka.

Previously, we were introduced to Kafka, also, briefly touched on its origins and some of its features. Furthermore, we also learned some vocabularies that will make our journey much easier, like distributed systems, nodes, durability, scalability etc.

This time, a look at the core components of Kafka, that is to say those components of Kafka that resides within Kafka itself as oppose to any component that interacts with it.
By the end of this section, we will have a better understanding of brokers, topics, partitions, messages, and offsets. We will also touch briefly on Zookeeper and Kraft.

Cluster/Brokers

A cluster is a group of systems working together to achieve a shared goal. And each system in a cluster is called a server or a node. Likewise, a Kafka cluster is a system that consist of several nodes running Kafka. A single node is referred to as a broker.
A broker is responsible for hosting of topics and partitions (more on topics and partitions later), and write messages to storage. Also, each broker must have a unique identifier. Amongst these brokers, one is elected as the controller, while the others are designated as followers.

In addition to the usual broker responsibilities, the controller is also responsible for managing partitions, storing events and general administrative tasks. Technically, all the brokers in a cluster are controllers, but a cluster can only have one active controller at any given time.
One of the main functions of Zookeeper in Kafka is to handle broker election.

Topics/Partitions

In Kafka, a topic is a logical grouping of events (also called messages). We can think of a topic as a folder in a filesystem. Each topic name must be unique.

Topics are further divided into partitions. When we write messages to Kafka, we are actually writing them to partitions. The number of partitions for a topic is stated when creating the topic, though it can be increased later, but it cannot be decreased. A topic can have as many partitions as needed. In a cluster, partitions of the same topic are distributed across multiple brokers, this ensures scalability and high-throughput.

Messages/offsets

Remember events? Well in Kafka, events are referred to as messages. Messages are similar to files in a folder or a row in a table. Folder and table are comparable to a Kafka topic. Messages contain a key (optional), value, and a timestamp. Example of a message;

Key: Thermostat 1 (optional)
Value: Temperature reading 40 oC
Timestamp: 2022-12-24 at 01:48 a.m.

Sending messages individually to Kafka would result in excessive overhead, for efficiency, messages are written to Kafka in batches. A batch is a group of messages, which are being sent to the same topic and partition.

When writing messages to Kafka, we can specify which partition to send the message, if not specified, messages with the same key will be sent to the same partition. For example, all messages from Thermostat 1 will end up in the same partition. If the partition is not specified and the message has no key, then the application sending the message (which is called a producer) will use the sticky partitioning strategy. Sticky partitioning as in the application randomly picks a partition and send a batch of message to it. Then it repeats this process all over again. That way, after a while, messages are evenly distributed among all partitions.

Offset is a unique integer identifier assigned to messages in a partition. A Kafka broker assigns an offset sequentially to every message written to it. Offset is also used by applications reading messages from Kafka (called consumers) to keep track of messages they have consumed. This prevents them from consuming the same message twice.

Even though Zookeeper does not reside within a Kafka cluster, Kraft will. So, to understand what Kraft is and why we need it, we first must know the role Zookeeper plays in Kafka.

Apache Zookeeper

“Zookeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group service.”

So, in English, Zookeeper is a distributed system (yes, another one) that is used to manage and coordinate other distributed systems.
Kafka needs Zookeeper to run, in fact, Kafka will not start without Zookeeper already running.
In Kafka, Zookeeper is responsible for the metadata configuration about the Kafka cluster, like;

Cluster Membership: which broker belongs to the cluster, if its available or not.
Access Control: which client application have read and/or write permission to the Kafka cluster.
Topics Configuration: list of existing topics and the number of partitions for each topic.
Controller Election: keeps track of which broker is currently the controller and handles reelection when the controller shuts down.

KRaft

Even though Zookeeper comes bundled with Kafka, it is a full-fledged Apache Project in its own right. In production, we are most likely going run a Zookeeper Cluster (called an Ensemble) separately from our Kafka Cluster. Zookeeper is lightweight, fast and easy to set up, but using it with Kafka comes with a few limitations.

Running Zookeeper alongside Kafka adds another layer of complexity for tuning, maintenance and monitoring. Instead of maintaining and monitoring a single system, we now have to monitor our Kafka cluster and our Zookeeper ensemble. Also, as the cluster grows, the whole system can grow cumbersome and there is a noticeable lag in performance, especially when a broker(s) in the cluster is restarting.

Kafka Raft or Kraft for short aims to replace Apache Zookeeper with Kafka topics and the Raft consensus to make Kafka self-managed. (Read more on the Raft consensus here). The Kafka cluster metadata configuration is now stored inside Kafka itself in a topic. This makes metadata operations faster and more scalable as the cluster need not communicate with an external system (i.e., Zookeeper).
Kraft has been in testing since Kafka version 2.8 (released April 2021). In version 3.3, it was marked production ready. Zookeeper will be removed totally as a dependency from Kafka in 4.0.

This brings us to the end this part, we have discussed about Kafka cluster, brokers, topics, partitions, messages and offsets, also a brief overview Apache Zookeeper and Kafka Raft.
Next, Kafka’s architectural design.

Learning Kafka Part One: What is Kafka?

Muhammad Ibrahim Anis — Thu, 29 Dec 2022 00:07:53 +0000

Welcome to the first installment of Learning Kafka series. It’s time to meet Kafka proper.

What Exactly is Kafka?

According to Kafka’s documentation, Kafka is described “as an open-source Distributed Event Streaming Platform used by thousands of companies for high-performance data pipelines, data integration and mission critical-applications”.
But Kafka can be captured the three words really; distributed, event and streaming-platform.
Let’s take a closer look at each of these words.

Distributed

A distributed system is nothing but a group (two or more) of systems or computers working together in parallel as a single, logical unit. They appear as a single unit to the end user. A system in this context can be anything from a laptop, a desktop, a server to a compute instance on the cloud.
For example, traditional databases are run on a single instance, whenever we want to query the database, we send request to that single instance directly.

A distributed version of this system will have the same database running on multiple instances at the same time. We would be able to talk to any of these instances and not be able to tell the difference. For example, if we inserted a record into instance 1, then instance 3 must be able to return the record.
A group of these instances is collectively known as a cluster while a single instance in the cluster is called a node or server.

Apache Kafka works in a distributed fashion, although it’s possible to run Kafka on a single node, this means losing out on all the things that makes Kafka……… Kafka.

Event

An event is…… just an event. Okay, that’s not helpful, but an event is really just an event. Sometimes (most-times?) traditional English words can mean a totally different thing in computer, but that’s not the case here. According to Oxford dictionary “An event is a thing that happens, especially something important”. An event in Kafka means just that. A user clicks on a particular link? An event. A traffic light changed from red to green? An event. An administrator logs into a computer? And event. Someone tweets a tweet? (Okay, don’t know if that is correct) An event. We now hopefully get the idea of what an event is. An event is an event. Moving on.

Streaming platform

Before discussing streaming platform, lets first understand what streaming is. Streaming is the unending, continuous generation of data. These data can be from different sources, be of diverse types and comes in different formats. They are also generated by both humans and machines.
A streaming platform is a platform or a system that helps in the gathering and movement of streaming data.
From the above explanations, we can say that Kafka is a group of systems (working together as one) that facilitates the movement of streaming data (called events) from source systems to target systems.

Origin

Developed at LinkedIn in 2010, by team that included Jay Kreps, Jun Rao, and Neha Narkhede, Kafka was used originally for the purpose of tracking LinkedIn users’ activities in real time. It was open-sourced and released to the Apache Software Foundation in 2011, and was graduated to a full Apache Project in 2012.
Kafka is written in Java and Scala and was named after the author Franz Kafka.

Features

Kafka is known for being durable, scalable and fault tolerant. Coupled with its high-throughput and high availability, Kafka has become the most popular choice for event driven systems. A quick refresher;

Durability is the ability of a system to retain and not lose data permanently.
Scalability is the ability of a system to grow and manage increased demands.
Fault tolerance is the ability of a system to continue operating without interruption when one or more of its components fail.
Throughput is the measure of how many units of work, information or request a system can handle in a given amount of time.
Availability is the percentage of time that a system remains operational under normal circumstances.

Use cases

Kafka’s original use case was to track users’ activities like page views, searches and other actions users may take, but its success has seen it evolved to other uses, for example;

Streaming Data Pipelines
One of the most popular use cases for Kafka is building streaming data pipelines. Where data is continuously being moved from source to destination in real time.

Messaging system
Messaging system is a system that enables applications to share data between each other.
Because of its design architecture (which we will cover in part three), Kafka can also be used as a replacement for traditional messaging systems like ActiveMQ and RabbitMQ.

Stream Processing
Instead of just storing and moving streams of data, Kafka can also be used to process, transform and enrich these data in real time with Kafka streams.

And that’s the end of this section, we have discussed what Kafka is, its origin and use cases, we also touched, briefly, on distributed systems, streaming, and events. And a quick explanation on words like throughput, availability, fault tolerance, durability, and scalability. Now that we are better equipped, lets dive even deeper into Kafka.
Up next, the core components of Kafka.

Learning Kafka: Introduction

Muhammad Ibrahim Anis — Tue, 27 Dec 2022 02:54:36 +0000

Hello there and welcome to this series on Apache Kafka called Learning Kafka (I know, not inventive). In this series, we, (that is me and you) will embark on an adventure, to a faraway kingdom, where we’ll meet the protagonist of this series, called Kafka. Kafka was born an orphan on the street of ……… okay, that’s enough.

Now Learning Kafka, though not hard, can be quite complicated. Learning Kafka is definitely complicated for those of us new to data engineering, or systems design or who’ve never before worked with a distributed system. It seems like there is an endless stream (read that again) of terminologies to learn. Streams, producers, consumers, brokers, topics, partitions, offset, replication, connect, cluster, serialization, deserialization, distributed, throughput, latency, and on and on and on. Don’t run away yet, if these words sound like something someone will include in a master’s thesis, hopefully, by the end of this series, you’ll leave with your own master’s certificate.

For those who are familiar with other big data frameworks like Hadoop, Spark, Storm or any other distributed framework, some of these concepts will be familiar or easy to pick up. But for those of us who are fortunate (or unfortunate) to learn Kafka as our first distributed and/or big data framework, it seems like we are not just learning Kafka but a whole new ecosystem. Which, to a point, is true. Because you can’t truly understand Kafka without knowing how distributed system works, or what a Pub/Sub is.

By the end of this series, hopefully you’ll become not only familiar with these terms but also how they relate to Kafka.

One of the reasons learning Kafka is rather daunting, is we lack a somewhat detailed view of Kafka. On the surface, Kafka is a system for building real time data pipelines, good and fine. But when we try to build the promised data pipeline, things get complicated really really quick.

Also, the loosely coupled architecture of Kafka, one of the reasons it is so successful, is also the reason it can be hard to grasp. Because to understand Kafka, we must first understand all of its components independently, then figure out how they relate to each other.

And that is what we will be doing in this series. We’ll take a step back, study in depth Kafka’s design, architecture, components and how they all fit and work together.

This series will be divided into six parts; in part one, we will have a look at what Kafka is, its origin, use cases and features. In part two, an introduction to the core components of Kafka, like brokers, topics, partitions etc. Moving on in part three, a look at the design of Kafka. Part four will be further divided into three segments, with each segment focusing on a single component in Kafka’s ecosystem; Producers and Consumers, Kafka Streams and Kafka Connect, in that order. Part five will take a look at how these components interact with Kafka using APIs and Client Libraries. Finally, rounding up in part six, a look at other third party and community applications that can be integrated with Kafka.

Also, in this series, there will be no hands on or coding examples, no how to do anything, the objective of this series is to get to know Apache Kafka proper.
This series, despite best attempts, can in no way do justice to Kafka, because Kafka is deep. For further reading, more detailed and even more in-depth explanation, you can’t go wrong with any of these books;

Kafka, The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty Link here
Effective Kafka by Emil Koutanov Link here

I hope you’ll enjoy consuming this series. I certainly enjoy producing it.

Coming up, an introduction to our protagonist. Apache Kafka.