Forem: Olena Kutsenko

Using Apache Kafka® and OpenSearch® to explore Mastodon data

Olena Kutsenko — Tue, 21 Mar 2023 17:14:17 +0000

In an earlier blog post we talked about how to stream data from the Mastodon timeline into an Apache Kafka® cluster. If you missed it, read it to learn how to connect to Mastodon and stream data into an Apache Kafka topic. This article assumes that you already have an Apache Kafka environment running with data in a topic that is called Mastodon.

Having magnitudes of data that is coming non-stop from a source (Mastodon timeline in our case!) brings us a new challenge - how to make sense of all this data? For instance, you may want to observe trends, search for particular entries, or filter and aggregate data to understand it.

The biggest advantage of bringing data into an Apache Kafka topic is that Apache Kafka provides a convenient mechanism to plug in other applications and reuse data for such cases as analytics, visualisations, metrics or long term storage. This all can be achieved with Apache Kafka® Connect with almost no code involved.

When talking about tools for search, aggregations and visualisations, a great place to start is OpenSearch®. OpenSearch is an open source search and analytics suite that has a powerful visual interface to work with data. It is straightforward to set up and start using, so why not let OpenSearch analyse the data coming from Mastodon?

To give you a visual picture, below is the architectural diagram connecting all the building blocks used in the previous article and the ones we will add in the current one:

In this post you will learn how to reuse the data you have in Apache Kafka with OpenSearch for visualisations and aggregations.

Set up OpenSearch

Both Apache Kafka and OpenSearch are available in the Aiven platform, so not only can you run them side by side, but also you can use a managed OpenSearch sink connector that connects these services.

Create an Aiven for OpenSearch service from Aiven Console (read more about OpenSearch in Aiven docs).

Once your service is created, make a note of the connection settings, you'll need them in the next section.

Use Kafka Connect to bring Apache Kafka and OpenSearch together

The easiest way to connect Apache Kafka with other tools is to use one of the already available connectors. Conveniently, there is an open-source sink connector for OpenSearch that you can use out of the box.

To add connectors to the running Apache Kafka cluster, enable Apache Kafka Connect in your Aiven for Apache Kafka service or consider using a standalone Apache Kafka Connect service.

Navigate to the Connectors tab to create a new connector. In the long list of available options, select OpenSearch sink.

In the configuration page you can either enter properties manually, or speed it up by using a JSON object for connector configuration. Open the JSON editor by clicking on the pencil icon next to the connector configuration.

Below is an example of the configuration properties needed for the connection. Replace YOUR_OPENSEARCH_HOST, PORT, YOUR_SERVICE_USER and YOUR_SERVICE_PASSWORD with values taken from your OpenSearch service. These are the connection properties you saw when you created the Aiven for OpenSearch service.

{
    "name": "sink_mastadon_json",
    "connection.url": "https://YOUR_OPENSEARCH_HOST:PORT",
    "key.ignore": "true",
    "connector.class": "io.aiven.kafka.connect.opensearch.OpensearchSinkConnector",
    "connection.username": "YOUR_SERVICE_USER",
    "schema.ignore": "true",
    "tasks.max": "1",
    "connection.password": "YOUR_SERVICE_PASSWORD",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "topics": "mastodon",
    "value.converter.schemas.enable": "false"
}

If you want to know more about the available options, check the documentation for the OpenSearch sink Kafka connector .

Once you define the values, copy this data to the Apache Kafka connector configuration and press Apply.

Click to create the connector and wait for the status to be changed to RUNNING, at this point the data is flowing to OpenSearch. If there are any issues during the connection, you'll see an error message, and more information will be available in the logs.

Now the data from your Apache Kafka topic is sinking into the OpenSearch index. The default name of this newly-created index in OpenSearch is the same as the Kafka topic name.

Time to look at the data in OpenSearch with the help of OpenSearch Dashboards!

Log in to OpenSearch Dashboards

To see the data in OpenSearch, open OpenSearch Dashboards using the Host, User and Password details from the "OpenSearch Dashboards" tab in the web console.

Once you're logged in, create an index pattern for Mastodon. An index pattern is a view for one or more indices that will be used together for aggregation. We have just one index, you can leave it as either Mastodon or Mastodon*. Use CreatedAt as the time field and you'll be able to filter your data by time.

Explore data with the Discover panel

When you don't yet know much about the data, the discover panel is a great place to start. Here you can either view complete data objects, or choose specific properties you're interested in.

You can also filter, search and even look at pre-created visualisations for each of the available fields.

For example, if we are only interested in messages that include polls, add a filter to show only those records that have poll.id defined.

Once you apply the setting, you can see the latest polls.

What is interesting, however, is that if you look at poll.voterCount, the vast majority of polls don't get any voters. It seems that opinion polls are not always popular on Mastodon.

Create visualisations for aggregations

The discover panel is fun, but you might want to define a specific aggregation and visualise the result. OpenSearch Dashboards has a variety of options for this. Look at the list of available visualisations. Each of them comes with a set of properties to shape the targeted aggregation. Here are a couple of examples:

Bar visualisations

Bar visualisations are useful to compare different values. To compare median values for the number of followers vs following users across accounts that posted the latest messages, create a horizontal bar visualisation.

Table

If you're curious to know which users have the most followers, create a table to show information about the accounts with the highest number of followers.

Organise visualisations in a dashboard

Once you have multiple visualisations, you can combine them into a dashboard. A dashboard will allow you to apply time constraints and filters to multiple visualisations at once.

The default time span is the last 15 minutes. If you can't see any data on your dashboard, make sure that you have recent data, or use the time field to apply a specific time span.

Find more uses for your data

Apache Kafka Connect offers enormous power to connect multiple systems together by creating data pipelines. In this example you saw how you can bring the data from an Apache Kafka topic to OpenSearch with no code needed. Our goal was to aggregate and visualise the data, which is why we used OpenSearch. In your own scenario, you might want to collect data from an Apache Kafka topic to sink it to a different database, or an object store such as S3, or put it into ClickHouse® for long term storage and analytics.

Tell us what you are building, what connectors you use, and which ones we should add next!

How to stream data from Mastodon public timelines to Apache Kafka with NodeJS and TypeScript

Olena Kutsenko — Tue, 21 Mar 2023 16:59:40 +0000

Mastodon has been rising in popularity over recent months. If you're not yet familiar with this exotic online creature, Mastodon is an open-source social networking software for microblogging. Instead of being a single network, like Twitter, Mastodon is a federated platform that connects independent interconnected servers. This makes it a fully decentralised system. It relies on ActivityPub and uses the ActivityStreams 2.0 data format and JSON-LD. As for the functionality, it resembles closely Twitter - you can read the timeline, post messages and interact with other users.

If you just recently joined Mastodon and are still exploring it, you might find that the scrolling timeline has its limits to understand all that is happening there. Applying some engineering skills will give a better overview on the topics and discussions happening on the platform.

Since Mastodon's timeline is nothing more than a collection of continuously coming events, the data feed is well-suited for Apache Kafka®. Adding Kafka connectors on top of that opens multiple possibilities to use data for aggregations and visualisations.

Continue reading to learn how to bring data from Mastodon to Kafka using TypeScript and a couple of helpful libraries.

Prepare the Apache Kafka cluster

To bring the data from the Mastodon timeline to a topic in Apache Kafka, you'll need an Apache Kafka cluster and some code to stream the data there. For the former, you can use either your own cluster, or a managed version that runs in the cloud, such as Aiven for Apache Kafka. If you don't have an Aiven account yet, sign up for a free trial and create your cluster, the setup only takes a few minutes.

Once your cluster is running, add a topic with the name mastodon. Alternatively, you can use any other name, just remember it, you'll need it a bit later.

To connect securely to the cluster we'll use SSL. Aiven already takes care of the configuration of the server, but you'll need to download three files for the client to establish the connection. Download these files from Aiven's console:

You will need these files later, so put them somewhere safe for now.

Working with the Mastodon API

The Mastodon API has excellent documentation that makes it straightforward to access the public data feeds. You don't need to be registered to retrieve a public feed, which makes it very convenient. Actually, just give it a try right now. Run the line below in your terminal to start retrieving a stream of data from mastodon.social:

curl https://mastodon.social/api/v1/streaming/public

As a response you should see an endless flow of incoming events:

Parsing the response from the server manually is a monotonous and tedious operation. Rather than reinvent the wheel, you can use one of the available libraries for Mastodon. For this example we'll be using masto.js.

Jump into the code

To give you an instant start to bring Mastodon data into an Apache Kafka cluster, clone this repository:

git clone https://github.com/aiven/mastodon-to-kafka

Once you have the contents of the repo locally, follow these steps:

Create a folder certificates/ and add the SSL certificates you downloaded earlier into this folder. We will need these to connect securely to Apache Kafka.
Copy the file .env.example and rename to .env, this file will hold the environment variables.
Set kafka.uri in .env to the address of your cluster. You can take it from the connection information of your Aiven for Apache Kafka service.
Run npm install to install all dependencies (if you don't have npm or NodeJS yet, follow the installation instructions).

Finally, start by running npm run start and you should see a flow of delivery reports for every new message coming from the Mastodon public feed, that is defined in the code (in the next section you'll see how to change it to whichever Mastodon feed you like!).

If things don't work first time, check for error messages printed in the terminal. They will help you navigate the problem.

You can verify that the data is flowing and see what messages you get by enabling Apache Kafka Rest API.

In the contextual menu for the topic, select Apache Kafka REST:

We can step back to see exactly how the code works to send the data from Mastodon to Apache Kafka. This can be divided into two logical steps:

Streaming messages from a public Mastodon timeline.
Sending these messages to Apache Kafka.

In the sections below you can see these steps in detail.

Streaming messages from a public Mastodon timeline

Open the file mastostream.ts. It contains a small module to stream the Mastodon data.

To initialise the Mastodon client you need to call login() from the masto.js client library and provide the required configuration. This is also the place to provide authentication information, however, since we are only interested in public feeds, the URL property is enough. As URL, use your favourite Mastodon server.

const masto = await login({
    url: 'https://mastodon.social/', // choose your favourite mastodon server
});

With the initialised Mastodon client you connect to the public stream API by calling the asynchronous function masto.stream.streamPublicTimeline().

const stream = await masto.stream.streamPublicTimeline();

Finally, you're ready to subscribe to the updates from the public stream provided by the API.

stream.on('update', (status) => {
    console.log(status)
    // next - send status data to Apache Kafka topic
});

Now time to put these building blocks together.

For the sake of encapsulation, you wouldn't want the mastostream module to know directly about the Apache Kafka producer. That's why when putting all the above ingredients together we provide the module mastostream with a more generic callback argument.
This callback function will return the Mastodon status message converted to a string, and the party that triggered the mastostream will receive the data and be able to act on it:

export default async (callback: (status: string) => void) => {
    try {
        const masto = await login({
            url: 'https://fosstodon.org/',
        });

        // Connect to the streaming api
        const stream = await masto.stream.streamPublicTimeline();

        // Subscribe to updates
        stream.on('update', (status) => {
            try {
                callback(JSON.stringify(status));
            } catch (err) {
                console.log('Callback failed', err);
            }
        });
    } catch (err) {
        console.log(err)
    }
};

This is what you need to stream the data from Mastadon! Time to send these messages to an Apache Kafka topic.

Sending messages to Apache Kafka using `node-rdkafka`

Open producer.ts to see the code you need to send the data to an Apache Kafka topic. To work with Apache Kafka you can use one of the existing client libraries, there are several options available. This project uses node-rdkafka, which is a NodeJS wrapper for the Kafka C/C++ library. Check its GitHub repository Readme for installation steps.

With node-rdkafka you can create a producer to send data to the cluster. This is where you'll use the Apache Kafka configuration settings defined in .env earlier and the certificates that you downloaded to prepare to establish a secure connection.


//create a producer
const producer = new Kafka.Producer({
    'metadata.broker.list': process.env["kafka.uri"],
    'security.protocol': 'ssl',
    'ssl.key.location': process.env["ssl.key.location"],
    'ssl.certificate.location': process.env["ssl.certificate.location"],
    'ssl.ca.location': process.env["ssl.ca.location"],
    'dr_cb': true
});

The producer will emit events when things happen, so to understand what is happening and to catch any errors, we subscribe to numerous events including delivery reports.


producer.on('event.log', function (log) {
    console.log(log);
});

//logging all errors
producer.on('event.error', function (err) {
    console.error(err);
});

producer.on('connection.failure', function (err) {
    console.error(err);
});

producer.on('delivery-report', function (err, report) {
    console.log('Message was delivered' + JSON.stringify(report));
});

producer.on('disconnected', function (arg) {
    console.log('producer disconnected. ' + JSON.stringify(arg));
});

One last event, which is especially important to use, is called on ready. This is the moment where the producer is ready to dispatch a message to the topic. This method will rely on the callback provided by the mastostream module that we implemented in the previous section:

producer.on('ready', async () => {
    mastostream((status) => { 
        producer.produce(
            'mastodon',  // topic to send the message to
            null,  // partition, null for librdkafka default partitioner
            Buffer.from(status),  // value
            null,  // optional key
            Date.now()  // optional timestamp
        );
        producer.flush(2000);
    }).catch((error) => {
        throw error;
    });
});

Yet, none of the above will work till you run the connect() method. With the snippet below, run your code and watch the data start to flow!

producer.connect({}, (err) => {
    if (err) {
        console.error(err);
    }
});

This method has an optional second parameter, which is a callback that you can use to be informed about any errors during the connection.

We've now seen all the code and examined how it all works together. By separating the concerns of collecting data from Mastodon, and passing it to Apache Kafka, we have a system that can also be adapted to handle different data sources as needed.

What's next

With the data being constantly collected in the topic you can now use it as an input for other tools and databases, such as OpenSearch®, ClickHouse®, PostgreSQL®. Apache Kafka® Connect connectors will help you bring the data into other systems with no code required. Learn more about Apache Kafka and Kafka Connect and check the full list of sink connectors that are available in Aiven platform to see where you can bring the data for further storage and analysis.

First steps with the Apache Kafka® Java client library

Olena Kutsenko — Tue, 21 Mar 2023 16:35:39 +0000

It's difficult to imagine the development of mission-critical software without relying on an event streaming platform such as Apache Kafka®. But perhaps you're new to Apache Kafka® and want to go deeper. You're in the right place! This article will guide your first steps using Apache Kafka® with Java.

If you can't wait to see the final result, this GitHub repository has the producer and consumer we'll write in the step-by-step guidance provided in this article.

Get equipped with what you need

In this blog post you'll learn how to create an Apache Kafka® producer and a consumer in Java. You'll prepare configuration files needed for a secure connection and write some Java to send messages to the cluster and poll them back.

Before you start writing the code, there are several things you'll need to prepare.

Apache Kafka cluster

First, you'll need Apache Kafka cluster itself. To simplify the setup, you can use Aiven for Apache Kafka®. Aiven for Apache Kafka® is a fully managed solution which builds a cluster with the correct configuration in just minutes, takes care of secure authentication, and other essentials. If you don't have an Aiven account yet, register for a free trial.

Once you're in the console, create a new service: in the Create service dialog select Apache Kafka, the cloud of your choice and the nearest to you cloud region. The Startup service plan is sufficient for today. Set a name for your service, for example apache-kafka-playground.

While deploying the service, you can proceed with other tasks.

Java project with dependencies

The second thing you'll need is a JDK installed on your computer and a basic Java project. This article assumes you have basic knowledge of Java. I used the Java 11 JDK when running this code, but Apache Kafka® supports Java 17, so you have plenty of choice.

You'll also need an official low-level Apache Kafka® client library for Java, a reference client, to create a producer and a consumer. Note that if you plan to work with Java APIs for Kafka Streams or Kafka Connect, you'll need an additional set of libraries.

The most convenient way of including kafka-client in your Java project is by either using Maven or Gradle. Select the latest version of the kafka-client from mvnrepository, choose which build tool you use, copy the dependency and add it to your project.

I used Gradle. I pasted the dependency into the build.gradle file and let Intellij Idea load necessary files by selecting Reload All Gradle Projects.

In addition to Apache Kafka® client, you'll also need several other libraries:

slf4j-simple for logging
JSON to create and parse JSON objects

Set up configuration and authentication for Apache Kafka®

Before creating the producers and consumers, you need to specify several configuration properties. These ensure that information exchanged by Kafka brokers and clients is kept complete, secure, and confidential.

Aiven offers two authentication approaches: TLS and SASL. In this article we'll use TLS for both authentication and encryption. If you want to use SASL, check out the SASL instructions in Aiven's documentation.

Usually, to perform a TLS handshake, you need to configure both Apache Kafka® brokers and the clients. To simplify things Aiven takes care of TLS configuration for the brokers, so you only need to configure the clients. And, as we'll see, even with the clients Aiven does most of the work for you.

To establish a TLS connection between the client and the server three things need to happen:

The client needs to verify the identity of the server.
The server needs to verify the identity of the client.
All messages in transit between the client and server must be encrypted.

To do this, we'll use Java truststore and keystores.

A truststore in Java is a place where you store the certificates of external systems that you trust. These certificates don't contain sensitive information, but they are important to identify and connect to a third-party system. On the other hand, the keystore contains the private access key and its corresponding access certificate, which are needed to authenticate the client. You shouldn't share keystore data with anyone.

If you're adventurous, you can create these files manually (here is the guide how to do this). However, you can also use a convenient shortcut and let Aiven platform do all the job for us.

Run avn service user-kafka-java-creds using the Aiven CLI with the information about the service and the user:

YOUR-SERVICE-NAME - the name of your Apache Kafka service as you defined it during creation
YOUR-USER-NAME - the name of the user who performs the operation (if you're in doubt, run avn service user-list --format '{username}' --project YOUR-PROJECT-NAME YOUR-SERVICE-NAME to see users)
PASSWORD - select a secure password for your keystore and truststore

Now using apply those fields to the command below and run it:

avn service user-kafka-java-creds YOUR-SERVICE-NAME --username YOUR-USER-NAME -d src/main/resources --password PASSWORD

If all goes well you will see six new files appeared in the resources folder. Aiven downloads necessary certificates, creats both keystore and trustore, as well as puts all references into a single file client.properties.

To make it easier to read the settings that are located in client.properties, add a small static method loadProperties into a new class Utils:

package org.example;

import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;

import org.apache.kafka.common.serialization.StringSerializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class Utils {
    public static Properties loadProperties() {
        Properties properties = new Properties();
        try (InputStream input = ProducerOneMessage.class.getClassLoader().getResourceAsStream("client-ssl.properties")) {
            if (input == null) {
                System.out.println("Sorry, unable to find config.properties");
            }
            properties.load(input);
            properties.put("key.serializer", StringSerializer.class.getName());
            properties.put("value.serializer", StringSerializer.class.getName());
        } catch (IOException ex) {
            ex.printStackTrace();
        }
        return properties;
    }
}

Congratulations! You're done with the configuration settings.

Dispatch events to Apache Kafka® cluster

Time to send the data to the Apache Kafka® cluster. For this you need a producer.
In your project create a new Java class called Producer and add the main method there.

To send a message you'll need to do these four steps:

public class Producer {
    public static void main(String[] args) {
        // Step # 1: create a producer and connect to the cluster
        // Step # 2: define the topic name
        // Step # 3: create a message record
        // Step # 4: send the record to the cluster
    }
}

For each of these steps you can rely on the functionality provided by the official Apache Kafka® client library for Java, which you added as a dependency previously.

Here is what you have to import for the Producer class to work:

import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.common.serialization.StringSerializer;
import org.json.JSONObject;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Properties;

it's also a good idea to use an instance of Logger to log events later.

Logger logger = LoggerFactory.getLogger(Producer.class.getName());

Step 1: create a producer and connect to the cluster

The constructor of KafkaProducer from Apache Kafka client library expects a list of properties to establish a connection. You already did most of the heavy lifting to define a connection configuration in the previous section. Now, just reference those entries with the helpful utility method Utils.loadProperties() that you added above.

// step # 1: create a producer and connect to the cluster
// get connection data from the configuration file

Properties properties = Utils.loadProperties();
KafkaProducer<String,String> producer =
        new KafkaProducer<>(properties);

One more thing you need to define is the format to serialize data in. In this example we'll send JSON objects as strings usingStringSerializer. You should also add a serializer for the key. Even though you won't need to use the keys explicitly in the first example, specifying key.serializer is mandatory.

properties.put("key.serializer", StringSerializer.class.getName());
properties.put("value.serializer", StringSerializer.class.getName());

Now that you have a set of properties to establish a connection, you can create an instance of KafkaProducer and pass the properties into its constructor:

KafkaProducer<String,String> producer = new KafkaProducer<String, String>(properties);

At this point you don't send any data to the cluster. However, it's useful to run the Producer to see how the connection with the server is established and if there are any errors:

Step 2: define the topic name

When sending the data to the cluster, you need to define a topic to send the message to.

I created a topic named customer-activity which records activity of customers in an online shop. You can be more creative and choose a different theme for your messages!

String topicName = "customer-activity";

Note that once you select the name of your topic, you need to create it in your Aiven for Apache Kafka® cluster. Even though you can configure Apache Kafka® to create topics automatically upon message arrival, it's best to keep that option disabled to avoid accidentally creating a bunch of unnecessary topics. You can create a topic in Aiven for Apache Kafka® using the handy CLI shortcut avn-cli-service-topic-create or follow these steps to create a topic through the Aiven console.

Here is the configuration of the topic I used, you can see that it contains three partitions and three replications:

Step # 3: create a message record

Messages can be sent in a variety of formats: String, JSON, Avro, protobuf, etc. In fact, Kafka doesn't have any opinion on the structure of data you want to send, which makes the platform very flexible. At times this gets messy, but Karapace, Aiven's open source schema registry, can help you organize your data better if needed.

For simplicity, use JSON for this example and define an object with three properties: a customer name, an operation that was performed and a product that was affected.

JSONObject message = new JSONObject();
message.put("customer", "Judy Hopps🐰");
message.put("product", "Carrot 🥕");
message.put("operation", "ordered");

Create a new ProductRecord instance by passing the topic name and the message to the constructor:

// package the message in the record
ProducerRecord<String, String> record = new ProducerRecord<>(topicName, message.toString());
logger.info("Record created: " + record);

Note, that using <String, String> indicates that the producer expects both the key and the value in String format.

Step # 4: send the record to the cluster

Finally, to send the message to Apache cluster topic, call the send() method of the producer instance and provide it with the record:

producer.send(record);
producer.flush();
producer.close();

To run the producer, call the main() method of Producer class with the help of the IDE. Alternatively, you can use Gradle and set up the tasks to run the producer, as it's done in the accompanying repository. You should see the output similar to this:

The send() method of the producer also accepts a callback interface, which provides us with metadata and information about exceptions. You can introduce it by doing the following changes:

producer.send(record, new Callback() {
    @Override
    public void onCompletion(RecordMetadata metadata, Exception exception) {
        if(exception == null) {
            logger.info("Sent successfully. Metadata: " + metadata.toString());
        } else {
            exception.printStackTrace();
        }
    }
});
producer.flush();
producer.close();

RecordMetadata and Callback will need extra imports:

import org.apache.kafka.clients.producer.RecordMetadata;
import org.apache.kafka.clients.producer.Callback;

Send multiple messages

Great, you successfully sent a single message to the cluster! However, sending messages one by one is tedious. Before moving to the consumer, transform the code to imitate a continuous (even if overly simplified) flow of data.

To do this, let's separate the method to generate messages:


static final String[] operations = {"searched", "bought"};
static final String[] customers = {"Judy Hopps🐰", "Nick Wilde🦊", "Chief Bogo🐃", "Officer Clawhauser😼", "Mayor Lionheart 🦁", "Mr. Big 🪑", "Fru Fru💐"};
static final String[] products = {"Donut 🍩", "Carrot 🥕", "Tie 👔", "Glasses 👓️️", "Phone ☎️", "Ice cream 🍨", "Dress 👗", "Pineapple pizza 🍕"};

public static JSONObject generateMessage() {
    JSONObject message = new JSONObject();

    // randomly assign values
    Random randomizer = new Random();
    message.put("customer", customers[randomizer.nextInt(7)]);
    message.put("product", products[randomizer.nextInt(7)]);
    message.put("operation", operations[randomizer.nextInt(30) < 25 ? 0 : 1]); // prefer 'search' over 'bought'

    return message;
}

And now combine the steps to generate and send data within an endless while loop. Note that using while(true) and Thread.sleep aren't things you want to do in a production environment, but for our purposes they work well:

try (KafkaProducer<String,String> producer = new KafkaProducer<>(properties)) {
    // step # 2: define the topic name
    String topicName = "customer-activity";

    // step # 3: generate and send message data
    while(true) {
        // generate a new message
        JSONObject message = generateMessage();

        // package the message in a record
        ProducerRecord<String, String> record =
                new ProducerRecord<>(topicName, message.toString());
        logger.info("Record created: " + record);

        // send data
        producer.send(record, new Callback() {
            @Override
            public void onCompletion(RecordMetadata metadata, Exception exception) {
                if(exception == null) {
                    logger.info("Sent successfully. Metadata: " + metadata.toString());
                } else {
                    exception.printStackTrace();
                }
            }
        });
        Thread.sleep(1000);
    }
}
}

Now while running the producer, you continuously send records into the cluster:

Consume the data from Apache Kafka topic

Now that the messages are generated and sent by the producer into the cluster, you can create a consumer to poll and process those messages.

Creation of a simple consumer can be divided into three steps:

public class Consumer {
    public static void main(String[] args) {
        // Step # 1: create a consumer and connect to the cluster
        // Step # 2: subscribe consumer to the topics
        // Step # 3: poll and process new data
    }
}

Here are the imports for the code below:

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import org.apache.kafka.common.serialization.StringDeserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;

Step 1: create a consumer and connect to the cluster

Similar to how you configured the producer's properties, you need to specify connection information for the consumer.

// step # 1: create a consumer and connect to the cluster
// get connection data from the configuration file
Properties properties = Utils.loadProperties();
properties.put("key.deserializer", StringDeserializer.class.getName());
properties.put("value.deserializer", StringDeserializer.class.getName());
properties.put("group.id", "first");
properties.put("auto.offset.reset", "earliest"); //choose from earliest/latest/none

In addition to the properties that you used for producer, the consumer has a couple of new ones. First, the consumer needs to be able to deserialize the data that it reads from the cluster, so instead of serialization properties you define deserialization ones:

properties.put("key.deserializer", StringDeserializer.class.getName());
properties.put("value.deserializer", StringDeserializer.class.getName());

You also need to assign the consumer to a consumer group. Do this by specifying a group.id:

properties.put("group.id", "first");

The last thing you should define is the point from which the consumer should start reading the data when it first connects to a topic. You can define a specific offset, or, alternatively, point to either the earliest or the latest message currently present in the topic. Set auto.offset.reset to earliest to read the messages from the start.

properties.put("auto.offset.reset", "earliest");

Using the connection properties that you defined, you're ready to create the consumer:

KafkaConsumer<String,String> consumer = new KafkaConsumer<String, String>(properties);

Step 2: subscribe consumer to the topic

Subscribe the consumer to one or more topics:

String topicName = "customer-activity";
consumer.subscribe(Collections.singleton(topicName));

Step 3: poll and process new data

The last step is to poll data on a regular basis from the Apache Kafka® topic. For this use the poll() method and specify how long the consumer should wait for new messages to arrive.

// step # 3 poll andprocess new data
while (true) {
    // poll new data
    ConsumerRecords<String, String> records = consumer.poll(Duration.ofMillis(100));
    // process new data
    for (ConsumerRecord<String, String> record : records) {
        logger.info("message: " + record.value());
    }
}

Finally, it's time to start the consumer to read all the data sent by the producer. Again, you can either use the help of your IDE to run themain() method, or use the powers of Gradle – see how it's done in the accompanying repository.

Maintain the ordering of messages for every customer

With the producer and consumer created, you can now send and read the data from the Kafka cluster.
However, if you look at the records closely, you might notice that the order of the records as read by consumer is different from when they were sent by producer.

Even though it's a natural side effect of a distributed system, you often want to maintain the order across the messages. This challenge is discussed in detail in a separate blog post, ways to balance your data across Apache Kafka® partitions. In this post, we'll use one of the strategies suggested in that article: preserving the order of messages with the help of a key.

In an online shop, the order of operations performed by the customers is important. A customer first adds the product into the basket, then pays for it and only then you dispatch the item. To maintain the sequence of messages related to each individual customer when balancing data across partitions you can use id of the customer as the key.

For this on the producer side when creating a record, specify the record's key:

// create a producer record
String key = message.get("customer").toString();
String value = message.toString();
ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);

To see the effect of this change on the consumer side, print out the partition and offset of each record when coming from the brokers when you process data:

for (ConsumerRecord<String,String> record : records) {
    logger.info("partition " + record.partition() +
            "| offset " + record.offset() +
            "| " + record.value() );
}

Now you can run the updated producer and consumer. In the consumer output, notice that the data for each individual customer is always added into the same partition. With this, even though messages about customers can be reshuffled, messages related to the same customer maintain their original order.

You can further improve this setup by using each separate shopping trip performed by the customer as a key. Customers perform multiple shopping trips, but each trip is unique and contains the sequence of events that must stay in exactly same order when consumed. A shopping trip contains fewer records than overall activity of a customer and therefore has less probability to lead to unbalanced partitions.

Final thoughts and next steps

In this article we covered the first steps to start using Apache Kafka with the official Java client library. You can find the code used for this article in a github repository.

There is still a lot to uncover when using Apache Kafka, so if you'd like to learn more, check out these articles:

Apache Kafka® simply explained
Apache Kafka® key concepts, A glossary of terms related to Apache Kafka
Ways to balance your data across Apache Kafka® partitions
What is Karapace? Find out more about the magic that is the schema registry!

Or poke around our Apache Kafka documentation and try out Aiven for Apache Kafka.

From a data stream to the data warehouse

Olena Kutsenko — Wed, 09 Nov 2022 07:16:10 +0000

Apache Kafka® and ClickHouse® are quite different, but also have a lot in common. They are both open source, highly scalable, work best with immutable data and allow us to process big loads of data, but they do all of this in quite different ways. That’s why instead of competing, these technologies actually complement each other quite well.

Apache Kafka is amazing at handling real-time data feeds. However, in certain cases we need to come back to older records to analyse and process data at later times. This is challenging because Apache Kafka, a streaming platform, is not optimised to access large chunks of data and act as an OLAP (online analytical processing) engine.

ClickHouse, on the other hand, is a scalable and reliable storage solution designed to handle petabytes of data and, at the same time, a powerful tool for fast online analytical processing, used by many companies for their data analytics.

By combining both technologies we get a performant data warehouse in ClickHouse, that stays up-to-date by constantly getting fresh data from Apache Kafka.

You can think of Apache Kafka topics as rivers where real-time data flows. ClickHouse, on the other hand, is the sea where all data eventually goes.

With that, time to roll up our sleeves and try integrating these two data solutions in practice. Below step by step we'll create the services, integrate them and run some query experiments.

Create services

To simplify the setup we'll be using managed versions of Apache Kafka and ClickHouse, both run by Aiven. If you don't have an Aiven account yet, no worries, registration is just a step away, and you can use a free trial for this experiment.

You can create Aiven for ClickHouse and Aiven for Apache Kafka services directly from Aiven's console. In the examples below I'm using apache-kafka-service and clickhouse-service as names for these services, but you can be more creative ;)

Note, that Aiven for ClickHouse needs at least a startup plan to allow adding integrations.

Once you've created the services, wait until they are completely deployed and are in RUNNING state. Now you're ready for action!

Prepare Apache Kafka

In order to move data from Apache Kafka to ClickHouse we need to have some data in Apache Kafka in the first place. So we start by creating a topic in Apache Kafka. You can do it directly from the Aiven console. Name it measurements. Here we'll send continuous measurements for our imaginary set of devices.

To imitate a continuous flow of new data we'll use a short bash script. In this script we create a JSON object with three properties: the timestamp of the event, the id of the device and a value. Then we send this object into the topic measurements using kcat. To understand how to set up kcat, check this article.

#!/bin/bash

while :
do
    stamp=$(date +%s)
    id=$((RANDOM%100))
    val=$((RANDOM%1000))
    echo "{\"timestamp\":$stamp,\"device_id\":$id,\"value\":$val}" \
        | kcat -F kcat.config -P -t measurements
done

Start the script and leave it running, it'll be continuously creating and sending messages to the topic.

Our work on the Apache Kafka side is done. Now let's move to ClickHouse.

Connect Aiven for ClickHouse to Apache Kafka

You can actually integrate your Aiven for ClickHouse service with any Apache Kafka service, but for us, having two services within the same Aiven project makes the integration straightforward.

To integrate Aiven for ClickHouse with Apache Kafka we need to do two steps:

Establish a connection.
Specify the structure and origin of the integrated data.

We'll do these steps with help from the Aiven CLI.

First, establish the connection by creating an integration of type clickhouse_kafka and specifying the name of your services, Apache Kafka as source and ClickHouse as destination:

avn service integration-create            \
    --integration-type clickhouse_kafka   \
    --source-service apache-kafka-service \
    --dest-service clickhouse-service

Running this command won't return you anything (unless there is a problem). But if you now check the list of available databases in your Aiven for ClickHouse service (with the help of Aiven's console, for example), you'll notice a new one - service_apache-kafka-service. The name of the created database is the combination of service_ and your Apache Kafka service name.

The database is still empty, because we didn't specify yet what kind of data we want to bring from our Apache Kafka service. We can define the datasource in a JSON payload, but first we need to find the id of our integration. You can get it by running this command:

avn service integration-list clickhouse-service | grep apache-kafka-service

In my case, the integration id was 88546a37-5a8a-4c0c-8bd7-80960e3adab0. Yours will be a different UUID.

Knowing the integration id we can set the proper configuration for our connection where we specify that:

we want to bring data from the topic named measurements
we expect the data to be in JSON format (in particular, JSONEachRow)
the data will be transformed into a table with three columns: timestamp, device_id and value:

    avn service integration-update 88546a37-5a8a-4c0c-8bd7-80960e3adab0 \
      --user-config-json '{
          "tables": [
              {
                  "name": "measurements_from_kafka",
                  "columns": [
                      {"name": "timestamp", "type": "DateTime"},
                      {"name": "device_id", "type": "Int8"},
                      {"name": "value", "type": "Int16"}
                  ],
                  "topics": [{"name": "measurements"}],
                  "data_format": "JSONEachRow",
                  "group_name": "measurements_from_kafka_consumer"
              }
          ]
      }'

ClickHouse will track what messages from the topic are consumed using the consumer group that you specify in the group_name field, no extra effort needed on your side. By default, you'll read each entry once. If you want to get your data twice, you can create a copy of the table with another group name.

Consume Kafka messages on the fly from Clickhouse

The setup we did is already sufficient to start reading data from the Apache Kafka topic from within ClickHouse.
The most convenient way to run ClickHouse SQL commands is by using clickhouse-client. If you're unsure how to run it, check this article explaining how to use cli.

I, for example, used docker and ran the client by using the command below. Just replace USERNAME, PASSWORD, HOST and PORT with your values.

docker run -it \
    --rm clickhouse/clickhouse-client \
    --user USERNAME \
    --password PASSWORD \
    --host HOST \
    --port PORT \
    --secure

Once in the client, you can check the list of databases

SHOW DATABASES

You'll see the one we created by establishing the integration service_apache-kafka-service (maybe you named it differently!).

If you get the list of tables from this database, you'll see the name of the table that you specified in the integration settings.

SHOW TABLES FROM `service_apache-kafka-service`

You can also double-check its structure with

DESCRIBE `service_apache-kafka-service`.measurements_from_kafka

Now, you might want to read from this table directly, and it will work. However, remember that we can consume messages only once! So once you read the items, they will be gone. Still, nothing stops you from running the following commands:

SELECT * FROM `service_apache-kafka-service`.measurements_from_kafka LIMIT 100

SELECT count(*) FROM `service_apache-kafka-service`.measurements_from_kafka

However, this is not the most convenient way of consuming the data from Apache Kafka, and apart from debugging won't be used much. Most probably you want to copy and keep the data items in ClickHouse for later. And this is exactly what we'll do in the next section.

Persist Kafka messages in Clickhouse table

To store the data coming from Apache Kafka to ClickHouse we need two pieces:

A destination table, where all data will be stored permanently.
A materialised view, that will be like a bridge between our connector table (measurements_from_kafka) and our destination table.

You can create them with these two queries:

CREATE TABLE device_measurements (timestamp DateTime, device_id Int8, value Int16)
ENGINE = ReplicatedMergeTree()
ORDER BY timestamp;

CREATE MATERIALIZED VIEW materialised_view TO device_measurements AS
SELECT * FROM `service_apache-kafka-service`.measurements_from_kafka;

When we create a materialised view, a trigger is actually added behind the scenes. This trigger will react to any new data items added to our table measurements_from_kafka. Once triggered, the data will go through the materialised view (where you also can transform it if you want) into the table device_measurements

You can check that the data is flowing by running:

SELECT COUNT(*) from device_measurements

We can run a query to count all readings from the devices and see which devices have higher values on average. Here we use a nice and simple visualisation mechanism with the bar function.

SELECT
    device_id,
    count() as readings_number,
    bar(avg(value), 0, 1000, 100) as average_measurement_value
FROM device_measurements
GROUP BY device_id
ORDER BY device_id ASC

Conclusion

Now you are equipped with the skill to bring data into Aiven for ClickHouse and use materialised views to store the data. Here are some more materials you might be interested to read:

What is ClickHouse and how it achieves high performance
Getting started with Aiven for ClickHouse
Documentation on how to integrate Aiven for ClickHouse with Apache Kafka (with some extra details that are omitted here)
Information on other integrations available with Aiven for ClickHouse

Introduction to ClickHouse

Olena Kutsenko — Tue, 08 Nov 2022 11:36:21 +0000

What is ClickHouse?

ClickHouse is a highly scalable open source database management system (DBMS) that uses a column-oriented structure. It's designed for online analytical processing (OLAP) and is highly performant. ClickHouse can return processed results in real time in a fraction of a second. This makes it ideal for applications working with massive structured data sets: data analytics, complex data reports, data science computations...

ClickHouse is most praised for its exceptionally high performance. That performance comes from a sum of many factors:

Column-oriented data storage
Data compression
The vector computation engine
Approximated calculations
The use of physical sparse indices

But performance isn't the only benefit of ClickHouse. ClickHouse is more than a database, it's a sophisticated database management system that supports distributed query processing, partitioning, data replication and sharding. It's a highly scalable and reliable system capable of handling terabytes of data.

In fact, ClickHouse is designed to write huge amounts of data and simultaneously process a large number of reading requests. And you can conveniently use a declarative SQL-like query language.

Main features of ClickHouse

ClickHouse has a booming development community and continues to be actively developed and improved. You can look at the changelog and the road map to see the latest features and future plans. Even with fast growth of the system, every new feature is evaluated performance-wise to make sure it doesn't affect the speed of the system. And many of existing biggest features of ClickHouse are particularly aimed at enhancing its performance and efficiency.

Column-oriented DBMS

As a truly columnar database, ClickHouse stores the values of the same column physically next to each other with no extra data attached to each value. This matters when even an insignificant amount of extra data (such as length of a string, for example) attached to hundreds of millions of items in the column, substantially affects the speed of compression, decompression and reads.

Data compression

To achieve desired performance ClickHouse uses data compression. This includes general-purpose compression, as well as a number of specialised codecs targeting different types of data stored in separate columns.

Query processing across multiple servers

ClickHouse supports distributed query processing with data stored across different shards. Large queries are parallelized across multiple cores and use resources they need.

SQL query syntax

ClickHouse supports SQL syntax similar to ANSI SQL. However, it is not identical, so a migration from another SQL-compatible system might require translations.

Vector computation engine

During data processing, ClickHouse works with chunks of columns (so-called vectors) and operations are performed on the arrays of items, rather than on individual values.

No database locks

ClickHouse updates tables continually without relying on locks when adding new data.

Primary and data skipping indices

Clickhouse keeps data physically sorted by primary key. Secondary indices (also called "data skipping indices") indicate in advance which data won't match filtering criteria and should be skipped (therefore, the name).

Approximated calculations

To gain farther performance boost, for complex queries you can perform calculations on the sample of data finding a trade-off between accuracy and performance. This is relevant, for example, for complex data science calculations.

While ClickHouse can be an excellent choice for many scenarios, it's important to keep in mind its architectural characteristics. Because ClickHouse is pretty unique, it's easy to make mistakes that lead to sub-optimal performance. That's why it is important to understand what stands behind this DBMS and how it functions.

Let's start by looking at its most distinguishable feature - column-oriented structure of the storage.

Why a column-oriented database management system?

To understand better where the need for the column-oriented approach is coming from and why ClickHouse uses it, let's take a closer look at two different types of systems: Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). In particular, pay attention to granularity with which they manipulate the data and to the types of operations that are prevalent in these systems.

OLTP: Online Transaction Processing

OLTP applications perform small but very frequent operations to insert, update and select a modest numbers of rows. In this type of applications we traditionally use row-oriented approach as the most effective way to work with entire individual rows.

OLAP: Online Analytical Processing

OLAP systems are a completely different thing - operations do not target single lines - instead, we work with hundreds of thousands and even millions of records at a time, relying on grouping and aggregation mechanisms. Data in OLAP systems is represented by events and rarely needs to be updated. And, what is important, usually only a fraction of fields is necessary to be retrieved and processed at a time. This makes it very inefficient to read complete rows, like row-oriented systems do.

The bottom line is, in OLTP applications records are being stored for an easy update of individual rows, while in OLAP systems, data is stored primarily for fast read and analysis of massive chunks of data.

Therefore, the row-oriented DBMS could not effectively manage analytical processing of data volumes typical to OLAP applications.

OLAP and column-oriented systems

Column-oriented systems were designed to solve OLAP challenges. In truly columnar databases, the data is physically grouped and stored by columns. This minimizes disk access and improves performance, because processing a specific query only requires reading a fraction of the data. Since each column contains data of the same type, it can use effective compression mechanisms.

Additionally, the columnar approach allows adding or removing new columns with no performance overhead, since it means simply creating or deleting files. In contrast, adding a new column in a row-oriented database would require updating the data in every row.

Understanding the difference between OLAP and OLTP systems, and the distinction between row and columnar approaches is key when making a decision weather to use ClickHouse or not. In the next section we'll look into how this relates to specific system requirements, and what you should pay attention to when making a decision to adopt ClickHouse.

When to use ClickHouse

If used correctly and in suitable scenarios, ClickHouse is a powerful, scalable and fast solution that outperforms its competitors. ClickHouse is made for OLAP applications, and includes a number of optimizations to read data and process complex requests at high speeds.

You'll get the most out of ClickHouse if

you work with enormous volumes of data (measured in terabytes) continuously written and read;
you have tables with the massive number of columns (ClickHouse loves large numbers of columns!), but column values are reasonably short;
your data is well-structured and not yet aggregated;
you insert data in large batches over thousands of lines, a million is a good number;
the vast majority of operations are reads with aggregations;
for reads, you process large number of rows, but fairly low number of columns;
you don't need to modify data later;
you don't need to retrieve specific rows;
you don't need transactions.

For example, Yandex uses over 500 servers with 25 million records coming each day. Another company that uses ClickHouse, Bloomberg, has over a hundred of servers and accepts approximately a trillion new records each day (as of data from 2018).

When not to use ClickHouse

ClickHouse is designed to be fast. However, the optimisations that make ClickHouse the perfect solution for OLAP applications make it suboptimal for other types of projects.

Do not use ClickHouse for OLTP. ClickHouse expects data to remain immutable. Even though it is technically possible to remove big chunks of data from the ClickHouse database, it is not fast. ClickHouse simply isn't designed for data modifications. It's also inefficient at finding and retrieving single rows by keys, due to sparse indexing. Lastly, ClickHouse does not fully support ACID transactions.

ClickHouse is not a key-value DBMS. It is also not designed to be a file storage.

It's not a document-oriented database, either. ClickHouse uses a pre-defined schema that needs to be specified during table creation. The better the schema, the more effective and performant are the queries.

How to get started

I hope that this got you intrigued about ClickHouse and its superpowers. And maybe you wonder how you can start using it on your own. ClickHouse is an open source project and you can follow its documentation to build it yourself.

However, we know that setting up and maintaining ClickHouse cluster can be quite a challenge. Ensuring proper replication of data, fault-tolerance, stability takes plenty of time and energy. That's why Aiven has decided to offer Aiven for ClickHouse, which will provide you with benefits of ClickHouse without the headache overload.

With Aiven for ClickHouse, you can focus on the product you are building, and we'll keep the underlying infrastructure running so smoothly that you can totally forget about it.

So, how can you create Aiven for ClickHouse? Select Aiven for ClickHouse in the Aiven Console when creating a new service. Read detailed instructions over here. Once the server is up and running (which happens in just couple of minutes), add some test data and see how you can work with users, tables and databases. To dive deeper and understand how indexing and data processing works in Clickhouse, check this article.

Ways to balance your data across Apache Kafka partitions

Olena Kutsenko — Tue, 08 Nov 2022 10:57:32 +0000

Apache Kafka® is a distributed system. At its heart is a set of brokers that stores records persistently inside topics. Topics, in turn, are split into partitions. Dividing topics into such pieces allows storing and reading data in parallel. In this way producers and consumers can work with data simultaneously, achieving higher throughput and scalability.

This makes partitions crucial for a performant cluster. Reading data from distributed locations comes with two big challenges:

Message order: Distributed systems split load-intensive tasks into multiple pieces that can be independently processed in parallel. In this way, we get results faster compared to the linear model. Unlike the linear approach, however, distributed systems by design do not guarantee the order of processed data. That’s why for such systems to work successfully, we need to make sure that the data is properly divided into independent chunks and that we understand the effect of this division on the data ordering.
Uneven record distribution: Dividing data across partitions means there's a risk that partition records are distributed unevenly. To prevent this, our system needs to partition records intelligently, so that the data is proportionately balanced across available servers and across their local filesystems.

Below we look deeper into these challenges and mechanisms to balance load over partitions to make the best use of the cluster.

Challenge of message order

To understand what is happening with record ordering, take a look at the example visualized below. There you can see the data flow for a topic that is divided into three partitions. Messages are pushed by a producer and later retrieved by a consuming application one by one.

When consuming data from distributed partitions, we cannot guarantee the order in which consumers go through the list of partitions. That's why the sequence of the messages read by a consumer ends up different from the original order sent by the producer.

Reshuffling records can be totally fine for some scenarios, but for other cases you might want to read the messages in the same order as they were pushed by the producer.

The solution to this challenge is to rely on the order of the records within a single partition, where the data is guaranteed to maintain the original sequence.

And that's why, when building the product architecture, we should carefully weigh up the partitioning logic and mechanisms used to ensure that the sequence of the messages remains correct when consumers read the data.

Ways to partition data based on different scenarios

The way messages are divided across partitions is always defined in the logic of the client, meaning that it is not the topic which specifies this logic, but the producers, who push the data into the cluster. In fact, if needed, different producers can have separate partitioning approaches.

There are a variety of tools you can use to distribute data across partitions. To understand these alternatives we'll look at several scenarios.

Scenario # 1: the order of messages is not important

It's possible that, in your system, it is not necessary to preserve the order of messages. Lucky you! You can rely on the default partitioning mechanism provided by Apache Kafka and no additional logic is needed for the producers.

As an example of this scenario, imagine a service to send SMS messages. Your organization uses SMS to notify customers, and the messages are divided across multiple partitions so that they can be consumed by different processing applications in parallel. We want to distribute the work and process the messages as fast as possible. However, the order in which the SMS messages reach the recipients is not important.

In such cases, Apache Kafka uses a sticky partitioning approach (introduced as a default partitioner from version 2.4.0). This default method batches records together before they're sent to the cluster. After the batch is full or the "linger time" linger.ms is reached, a batch is sent and a new one is created for a different partition. This approach helps decrease latency when producing messages.

Here's a code snippet written in Java which sends a single message into a randomly assigned partition. This is a default behavior and doesn't need any additional logic from your side.


   // add necessary properties to connect 
   // to the cluster and set up security protocols
   Properties properties = new Properties(); 

   // create a producer
   KafkaProducer<String,String> producer =
          new KafkaProducer<String, String>(properties);
   String topicName = "topic-name";

   // generate new message
   String message = "A message";

   // create a producer record
   ProducerRecord<String, String> record =
           new ProducerRecord<>(topicName, message);

   // send data
   producer.send(record);
   logger.info("Sent: " + message);

Scenario # 2: the order is important for groups of records defined with a key

Even though some scenarios do not require maintaining message sequence, the majority of cases do. Imagine, for example, that you run an online shop where customers trigger different events through your applications, and information about their activity is stored in a topic in an Apache Kafka cluster. In this scenario, the order of events for every single customer is important, while the order of events across the customers is irrelevant.

That's why our goal is to preserve the correct sequence of the messages related to every individual customer. We can achieve this if we store the records for every individual customer consistently in a dedicated partition.

The default partitioner can already do it for you, if you define a proper key for each of the messages.

Every record body in an Apache Kafka topic consists of two parts - the value of the record and an optional key. The key plays a dramatic role in how messages are distributed across the partitions - all messages with the same key are added to the same partition.

For our example, the most obvious choice for a key is the id of a customer, which we can use to partition the data. This is visualized below where, for simplicity, we assume that we have three customers (John,Claire and Burt) and three partitions.

Once the data with the key John is stored in a partition, Apache Kafka remembers to send all future messages with the identical key into the same partition.

This visualization includes just three customers, one for each partition. In real life you might need to store data for multiple customers (or devices, or vehicles, etc.) in a single partition.

The code snippet below shows how to use a key when creating a record:

   // create a producer record
   String key = message.get("customerId").toString();
   String value = message.toString();
   ProducerRecord<String, String> record = new ProducerRecord<>(topicName, key, value);

What's important to note is that Apache Kafka doesn't use a string representation of the key. Instead it converts the key into a hash value, which means that there is a probability of a hash collision, when two different keys create the same hash resulting in data assigned to the same partition. Is this something you need to avoid? Scroll down to read about the custom partitioner!

Scenario # 3: partition numbers are known in advance

Sometimes you want to control which message goes to which partition. For example, maybe the target partition depends on the day of the week when the data is generated. Assuming your system has seven partitions:

   // create a producer record
   String key = message.get("customer").toString();
   String value = message.toString();
   LocalDate today = LocalDate.now();
   Integer partitionNumber = today.getDayOfWeek().getValue();
   ProducerRecord<String, String> record = new ProducerRecord<>(topicName, partitionNumber, key, value);

Scenario # 4: achieve maximum flexibility

The tools we've looked at above will help in many use cases. In some situations, however, you might need higher flexibility and might want to customize the logic of partitioning even farther. For this, Apache Kafka provides a mechanism to plug in a custom partitioner, that divides the records across partitions based on the content of a message or some other conditions.

You can use this approach if you want to group the data within a partition according to a custom logic. For example, if you know that some sources of data bring more records than others, you can group them so that no single partition is significantly bigger or smaller than others. Alternatively, you might want to use this approach if you want to base partitioning on a group of fields, but prefer to keep the key untouched.

In a custom partitioner you have access to both key and value of the record before deciding into which partition you want to put the message. To create a custom partitioner you'll need to implement a partitioner class and define the logic of its methods. Here is an example of a custom partitioner written in Java:

public class customPartitioner implements Partitioner {

    public void configure(Map<String, ?> configs) {
    }

    public int partition(String topic, Object key, byte[] keyBytes, Object value, byte[] valueBytes, Cluster cluster) {
        // get the list of available partitions
        List<PartitionInfo> partitions = cluster.partitionsForTopic(topic);
        int numPartitions = partitions.size();

        int partition = ...;

        return partition;
    }

    public void close() {
    }
}

Once you've defined the custom partitioner, reference it in your producer:

    Properties properties = new Properties();
    properties.put("partitioner.class", "customPartitioner");

    KafkaProducer<String, String> producer = new KafkaProducer<>(properties);

Now every arriving record is analyzed by a custom partitioner before it is put into a designated partition.

Scenario # 5: round robin and uniform sticky partitioning

There are two more built-in partitioners that you can consider. The first is RoundRobinPartitioner which acts according to its name - iterating over all partitions and distributing items one by one ignoring any provided key values. Round robin, unfortunately, is known to cause uneven distribution of records across partitions. Furthermore, it is less performant compared to the default sticky mechanism, where records are combined into batches to speed up producing time.

Another built-in partitioner is UniformStickyPartitioner, which acts similarly to DefaultPartitioner but ignores the key value.

Challenge of uneven record distribution

When defining partitioning logic, carefully evaluate how your partitions will be growing over time. You need to understand if there is a risk that a selected mechanism will result in uneven message distribution.

There are a variety of scenarios when uneven distribution can happen.

For example, when the default partitioner sends a huge batch of data to a single partition. When using the default partitioner, consider the proper settings for "linger time" and a maximum size of the batch that fits your particular scenarios. For example, if your product is frequently used during the day, but almost no records come in at night, it is common to set "linger time" low and batch size high. However, with these settings there is a probability that if you have an unexpected surge of data, this influx of records is added to a single batch and sent to a single partition, leading to uneven message distribution.

Another case of uneven message distribution can happen when you distribute records by keys, but the amount of data related to some keys is significantly bigger than for others. For instance, imagine that you run an image gallery service and divide data across partitions by user id. If some of your users use the service significantly more frequently, they produce significantly more records, increasing the size of some partitions.

Similar to the scenario above, if you rely on days and times to distribute the data, some dates - such as Black Friday or Christmastime - can generate considerably more records.

Additionally, uneven distribution can happen when you move data from other data sources with the help of Kafka Connect. Make sure that the data is not heavily written to a single partition, but distributed evenly.

Overall, uneven message distribution is a complex problem that's easier to prevent than to solve later. Rebalancing messages across partitions is a challenging task because in many scenarios partitions preserve the necessary order of the messages, and rebalancing can destroy the correct sequence.

Conclusions

Apache Kafka provides a set of tools to distribute records across multiple partitions. However, the responsibility for a durable architecture, and selection of the strategy to distribute the messages, lies on the shoulders of the engineers building the system.

If you'd like to learn more about Apache Kafka, check out these articles:

Or poke around our Apache Kafka® documentation and try out Aiven for Apache Kafka

Apache Kafka Simply Explained

Olena Kutsenko — Wed, 06 Jul 2022 12:22:37 +0000

Apache Kafka is a de facto standard for data streaming. There is a lot of interest and usage of Apache Kafka, but the learning curve can be steep. So here is my attempt to explain it using simple and friendly vocabulary:

https://aiven.io/blog/kafka-simply-explained

The article covers the following topics:

What is Apache Kafka?
Where Apache Kafka is used
Apache Kafka’s way of thinking
How Apache Kafka coordinates events
Topics and messages
Brokers and partitions
Apache Kafka connectors

Hoping that you’ll find it useful!