Forem: Hamed Karbasi

ClickHouse Advanced Tutorial: Apply CDC from MySQL to ClickHouse

Hamed Karbasi — Thu, 15 Jun 2023 20:48:30 +0000

Introduction

Suppose that you have a database handling OLTP queries. To tackle intensive analytical BI reports, you set up an OLAP-friendly database such as ClickHouse. How do you synchronize your follower database (which is ClickHouse here)? What challenges should you be prepared for?

Synchronizing two or more databases in a data-intensive application is one of the usual routines you may have encountered before or are dealing with now. Thanks to Change Data Capture (CDC) and technologies such as Kafka, this process is not sophisticated anymore. However, depending on the databases you’re utilizing, it could be challenging if the source database works in the OLTP paradigm and the target in the OLAP. In this article, I will walk through this process from MySQL as the source to ClickHouse as the destination. Although I’ve limited this article to those technologies, it’s pretty generalizable to similar cases.

System Design Overview

Contrary to what it sounds, it’s quite straightforward. The database changes are captured via Debezium and published as events on Apache Kafka. ClickHouse consumes those changes in partial order by Kafka Engine. Real-time and eventually consistent.

CDC Architecture

Case Study

Imagine that we have an orders table in Mysql with the following DDL:

CREATE TABLE `orders` (
    `id` int(11) NOT NULL,
    `status` varchar(50) NOT NULL,
    `price` varchar(50) NOT NULL,
    PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Users may create, delete and update any column or the whole record. We want to capture its changes and sink them to ClickHouse to synchronize them.

We’re going to use Debezium v2.1 and the ReplacingMergeTree engine in ClickHouse.

Implementation

Step 1: CDC with Debezium

Most databases have a log that every operation is written there before applying on data (Write Ahead Log or WAL). In Mysql, this file is called Binlog. If you read that file, parse and apply it to your target database, you’re following the Change Data Capture (CDC) manifest.

CDC is one of the best ways to synchronize two or multiple heterogeneous databases. It’s real-time, eventually consistent, and prevents you from the other methods imposing more costs like batch-backfills with Airflow. No matter what happens on the source, you can capture it in order and be consistent with the original (eventually, of course!)

Debezium is a well-known tool for reading and parsing the Binlog. It simply integrates with Kafka Connect as a connector and produces every change on a Kafka topic.

To do so, you’ve to enable log-bin on the MySQL database and set up Kafka Connect, Kafka, and Debezium accordingly. Since it is well-explained in other articles like this or this, I’ll only focus on the Debezium configuration customized for our purpose: Capture the changes while being functional and parsable by ClickHouse.

Before showing the whole configuration, we should discuss three necessary configs:

Extracting New Record State

Debezium emits every record concluding of before and after states for every operation by default which is hard to parse by ClickHouse Kafka Table. Additionally, it creates tombstone records (i.e., a record with a Null value) in case of a delete operation (Again, unparsable by Clickhouse). The entire behavior has been demonstrated in the table below.

Records state for different operations in the default configuration.

We use the ExtractNewRecod transformer in the Debezium configuration to handle the problem. Thanks to this option, Debezium only keeps the after state for the create/update operations and disregards the before state. But as a drawback, It drops the Delete record containing the previous state and the tombstone record mentioned earlier. In other words, you won’t capture the delete operation anymore. Don’t worry! We’ll tackle it in the next section.

"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"

The picture below shows how the state before is dropped and after is flattened by using the ExtractNewRecord configuration.

Left: Record without ExtractNewRecord config; Right: Record with ExtractNewRecord config

Rewriting Delete Events

To capture delete operations, we must add the rewrite config as below:

"transforms.unwrap.delete.handling.mode":"rewrite"

Debezium adds field __deleted with this config, which is true for the delete operation and false for the others. Hence, a deletion would contain the previous state as well as a __deleted: true field.

The field __deleted is added after using the rewrite configuration

Handling Non-Primary Keys Update

Providing the mentioned configurations, updating a record (every column except the primary key) emits a simple record with the new state. Having another relational database with the same DDL is OK since the updated record replaces the previous one in the destination. But in the case of ClickHouse, the story goes wrong!

In our example, the source uses id as the primary key, and ClickHouse uses id and status as order keys. Replaces and uniqueness only guarantees for the records with the same id and status! So what happens if the source updates the status column? We end up with duplicate records implying equal ids but different statuses in ClikHouse!

Fortunately, there is a way. By default, Debezium creates a delete record and a create record for updating on primary keys. So if the source updates the id, it emits a delete record with the previous id and a create record with the new id. The previous one with the __deleted=ture field replaces our stall record in CH. Then the records implying deletion can be filtered in the view. We can extend this behavior to other columns with the below option:

"message.key.columns": "inventory.orders:id;inventory.orders:status"

Now by putting together all the above options and the usual ones, we’ll have a fully functional Debezium configuration capable of handling any change desired by ClickHouse:

{
    "name": "mysql-connector",
    "config": {
        "connector.class": "io.debezium.connector.mysql.MySqlConnector",
        "database.hostname": "mysql",
        "database.include.list": "inventory",
        "database.password": "mypassword",
        "database.port": "3306",
        "database.server.id": "2",
        "database.server.name": "dbz.inventory.v2",
        "database.user": "root",
        "message.key.columns": "inventory.orders:id;inventory.orders:status",
        "name": "mysql-connector-v2",
        "schema.history.internal.kafka.bootstrap.servers": "broker:9092",
        "schema.history.internal.kafka.topic": "dbz.inventory.history.v2",
        "snapshot.mode": "schema_only",
        "table.include.list": "inventory.orders",
        "topic.prefix": "dbz.inventory.v2",
        "transforms": "unwrap",
        "transforms.unwrap.delete.handling.mode": "rewrite",
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
    }
}

Important: How to choose the Debezium key columns?

By changing the key columns of the connector, Debezium uses those columns as the topic keys instead of the default Primary key of the source table. So different operations related to a record of the database may end up at the other partitions in Kafka. As records lose their order in different partitions, it can lead to inconsistency in Clikchouse unless you ensure that ClickHouse order keys and Debezium message keys are the same.

The rule of thumb is as below:

Design the partition key and order key based on your desired table design.
Extract the source origin of the partition and sort keys, supposing they are calculated during materialization.
Union all of those columns
Define the result of step 3 as the message.column.keys in the Debezium connector configuration.
Check if the Clickhouse sort key has all those columns. If not, add them.

Step 2: ClickHouse Tables

ClickHouse can sink Kafka records into a table by utilizing Kafka Engine. We need to define three tables: Kafka table, Consumer Materilizaed table, and Main table.

Kafka Table

Kafka table defines the record structure and Kafka topic intended to be read.

CREATE TABLE default.kafka_orders
(
    `id` Int32,
    `status` String,
    `price` String,
    `__deleted` Nullable(String)
)
ENGINE = Kafka('broker:9092', 'inventory.orders', 'clickhouse', 'AvroConfluent')
SETTINGS format_avro_schema_registry_url = 'http://schema-registry:8081'

Consumer Materializer

Every record of the Kafka Table is only read once since its consumer group bumps the offset, and we can’t read it twice. So, we need to define a main table and materialize every Kafka table record to it via the view Materializer:

CREATE MATERIALIZED VIEW default.consumer__orders TO default.stream_orders
(
    `id` Int32,
    `status` String,
    `price` String,
    `__deleted` Nullable(String)
) AS
SELECT
    id AS id,
    status AS status,
    price AS price,
    __deleted AS __deleted
FROM default.kafka_orders

Main Table

The main table has the source structure plus the __deleted field. I’m using a Replacing Merge Tree since we need to replace stall records with their deleted or updated ones.

CREATE TABLE default.stream_orders
(
    `id` Int32,
    `status` String,
    `price` String,
    `__deleted`String
)
ENGINE = ReplacingMergeTree
ORDER BY (id, price)
SETTINGS index_granularity = 8192

View Table

Finally, we need to filter every deleted record (since we don’t want to see them) and have the most recent one in case of having different records with the same sort key. This can be tackled by using the Final modifier. But to avoid using filter and final in every query, we can define a simple View to do the job implicitly:

CREATE VIEW default.orders
(
    `id` Int32,
    `status` String,
    `price` String,
    `__deleted` String
) AS
SELECT *
FROM default.stream_orders
FINAL
WHERE __deleted = 'false'

Note: It’s inefficient to use Final for every query, especially in production. You can use aggregations to see the last records or wait for ClickHouse to merge records in the background.

Conclusion

In this article, we saw how we could synchronize the ClickHouse database with MySQL via CDC and prevent duplication using a soft-delete approach.

ClickHouse Advanced Tutorial: Performance Comparison with MySQL

Hamed Karbasi — Thu, 08 Jun 2023 10:14:24 +0000

1. Introduction

1.1. OLTP

1.2. OLAP

1.3. MySQL

1.4. ClickHouse
2. Comparison Case Study

2.5. System Specification

2.6. Benchmark Flow

2.7. Queries
3. Results

3.8. Dataset Load

3.9. Table Size

3.10. Read Queries Execution

3.11. Update Query Execution
4. Conclusion

Introduction

Nothing is perfect. In terms of databases, you can't expect the best performance for every task and query from your deployed database. However, the vital step as a software developer is to know their strengths and weaknesses and how to deal with them.

In this post, I will compare Clickhouse as a representative of the OLAP database and MySQL of the OLTP. This will help us to choose better solutions for our challenges according to our conditions and desires. Before jumping into the main context, let's discuss OLTP, OLAP, MySQL, and ClickHouse.

OLTP

OLTP stands for Online Transaction Processing and is used for day-to-day operations, such as processing orders and updating customer information. OLTP is best for short, fast transactions and is optimized for quick response times. It is essential to ensure data accuracy and consistency and provide an efficient way to access data.

OLAP

OLAP stands for Online Analytical Processing and is used for data mining and analysis. It enables organizations to analyze large amounts of data from multiple perspectives and identify trends and patterns. OLAP is best for complex queries and data mining and can provide impossible insights with traditional reporting tools.

MySQL

MySQL is a popular open-source database management system. It is used to store and manage data and is utilized by websites and applications to store and manage information. MySQL is a relational database management system that holds data in tables and allows users to query the data. It also provides features such as triggers, stored procedures, and views. MySQL is easy to use and has a wide range of features that can be used to create powerful and efficient applications.

ClickHouse

ClickHouse is an open-source column-oriented database management system developed by Yandex. It is designed to provide high performance for analytical queries.
ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. You can know more about this database by visiting the first part of this series:

ClickHouse Basic Tutorial: An Introduction

Hamed Karbasi ・ Apr 13 '23

#clickhouse #database #tutorial

Now we can talk about the performance comparison.

Comparison Case Study

I've followed the Clickbench repository methodology for the case study. It uses the hits dataset obtained from the actual traffic recording of one of the world's largest web analytics platforms. hits contain about 100M rows as a single flat table. This repository studies more than 20 databases regarding dataset load time, elapsed time for 43 OLAP queries, and occupied storage. You can access their visualized results here.

ClickHouse / ClickBench

ClickBench: a Benchmark For Analytical Databases

https://benchmark.clickhouse.com/

Discussion: https://news.ycombinator.com/item?id=32084571

Overview

This benchmark represents typical workload in the following areas: clickstream and traffic analysis, web analytics, machine-generated data, structured logs, and events data. It covers the typical queries in ad-hoc analytics and real-time dashboards.

The dataset from this benchmark was obtained from the actual traffic recording of one of the world's largest web analytics platforms. It is anonymized while keeping all the essential distributions of the data. The set of queries was improvised to reflect the realistic workloads, while the queries are not directly from production.

Goals

The main goals of this benchmark are:

Reproducibility

You can quickly reproduce every test in as little as 20 minutes (although some systems may take several hours) in a semi-automated way. The test setup is documented and uses inexpensive cloud VMs. The test process is documented in the form of a shell script, covering…

View on GitHub

To investigate ClickHouse and MySQL performance specifically, I separated 10M rows of the table and chose some of the predefined queries that can make our point more clear. Those queries are mainly in OLAP manner, so they only show ClickHouse strengths compared to MySQL (i.e., MySQL loses in all those queries). Hence, I added other queries showing the opposite (OLTP queries). Although I've limited the benchmark to these two databases, you can generalize the concept to other row-oriented and column-oriented DBMSs.

Disclaimer: This benchmark only clarifies the main difference between column-oriented and row-oriented databases regarding their performance and use cases. It should not be considered a reference for your use cases. Hence, you should perform your benchmarks with your queries to achieve the best decision.

System Specification

Databases are installed on Ubuntu 22.04 LTS on a system with the below specifications:

CPU: Intel® Core™ i7-10510U CPU @ 1.80GHz × 8
RAM: 16 GiB
Storage: 256 GiB SSD

Benchmark Flow

The database is created.
The table is created with the defined DDL.
Data (hits.tsv) is loaded into the table, and its time is measured.
Queries are run, and each query's elapsed time is measured.

Queries

Query Number	Statement	Type
1	`SELECT COUNT(*) FROM hits;`	OLAP
2	`SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;`	OLAP
3	`SELECT URL, COUNT(*) AS PageViews FROM hits WHERE CounterID = 62 AND EventDate >= '2013-07-01' AND EventDate <= '2013-07-31' AND DontCountHits = 0 AND IsRefresh = 0 AND URL <> '' GROUP BY URL ORDER BY PageViews DESC LIMIT 10;`	OLAP
4	`SELECT WatchID, ClientIP, COUNT(*) AS c, SUM(IsRefresh), AVG(ResolutionWidth) FROM hits GROUP BY WatchID, ClientIP ORDER BY c DESC LIMIT 10;`	OLAP
5	`SELECT EventTime, WatchID FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';`	OLTP
6	`SELECT Title, URL, Referer FROM hits WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';`	OLTP
7	`UPDATE hits SET Title='my title', URL='my url', Referer='my referer' WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';`	OLTP

Results

I'll study the results under four categories:

Dataset Load
Table Size
Read Queries Execution
Update Query Execution: I've discussed the update query (query number 7) separately since it needs more discussion and attention.

Dataset Load

ClickHouse	MySQL	Ratio
65s	11m35s	x10.7

Thanks to the LSM and sparse indexes, ClickHouse load time is much faster than MySQL, which uses BTree. However, ClickHouse inserts efficiency is observable in bulk inserts instead of many individual inserts. This behavior comes from the fact that it creates immutable parts for each insert and is unwilling to change, remove or create its data for a few rows.

Table Size

ClickHouse (GiB)	MySQL (GiB)	Ratio
1.3	6.32	x4.86

The column-oriented structure gives the ability of Data Compression, something that is not available in row-oriented databases. That is why ClickHouse can do a practical favor to the teams storing a high amount of data, reducing the storage cost.

Read Queries Execution

Query Number	ClickHouse (s)	MySQL (s)	Ratio
1	0.005	7.79	x1558
2	0.030	16.0	x533.3
3	0.193	4.35	x22.5
4	2.600	180.93	x69.58
5	0.01	0.00	x0
6	0.011	0.00	x0

ClickHouse's sparse index and column-oriented structure have outperformed MySQL in all OLAP queries (numbers 1 to 4). That's why BI and Data Analysts would be more than happy with ClickHouse for their daily reports.

However, MySQL wins the battle when it comes to OLTP queries (numbers 5 and 6). Btree (equipped by MySQL) indeed performs better for pointy queries in which you demand short transactions requiring few rows.

Update Query Execution

For the update query (number 7), we should execute a different query in ClickHouse as it doesn't support updates in a naive way, and the Alter command has to be used:



ALTER TABLE hits UPDATE JavaEnable=0 WHERE CounterID = 38 AND EventDate = '2013-07-15' AND UserID = '1387668437822950552' AND WatchID = '8899477221003616239';

Additionally, ClickHouse applies the update asynchronously. To have the result immediately, you've to perform an optimize command:



OPTIMIZE TABLE hits FINAL;

By performing query number 7 statement shown in the queries table for MySQL and the two above SQL statements for ClickHouse, we achieve the below results:

Query Number	ClickHouse (s)	MySQL (s)	Ratio
7	26	0.00	0

Again, ClickHouse mutation hatred makes it a loser for real-time updates (and similarly deletes) compared to MySQL. Consequently, other methods like deduplication using ReplacingMergeTree can be utilized to handle updates. You can find valuable resources in the below links:

Conclusion

In this post, I benchmarked MySQL and ClickHouse databases to study some of their strengths and weaknesses that may help us choose a suitable solution. To summarize:

MySQL performs better on pointy and OLTP queries.
ClickHoues performs better on OLAP queries.
ClickHouse is not designed for frequent updates and deletes. You have to handle them with deduplication methods.
ClickHouse reduces the storage cost thanks to its column-oriented structure.
ClickHouse bulk inserts load time operates far better than MySQL.

ClickHouse Basic Tutorial: Keys & Indexes

Hamed Karbasi — Fri, 02 Jun 2023 16:04:08 +0000

In the previous parts, we saw an introduction to ClickHouse and its features. Furthermore, we learned about its different table engine families and their most usable members. In this part, I will walk through the special keys and indexes in ClickHouse, which can help reduce query latency and database load significantly.

It should be said that these concepts are only applicable to the default table engine family: Merge-Trees.

Primary Key

ClickHouse indexes are based on Sparse Indexing, an alternative to the B-Tree index utilized by traditional DBMSs. In B-tree, every row is indexed, which is suitable for locating and updating a single row, also known as pointy-queries common in OLTP tasks. This comes with the cost of poor performance on high-volume insert speed and high memory and storage consumption. On the contrary, the sparse index splits data into multiple parts, each group by a fixed portion called granules. ClickHouse considers an index for every granule (group of data) instead of every row, and that's where the sparse index term comes from. Having a query filtered on the primary keys, ClickHouse looks for those granules and loads the matched granules in parallel to the memory. That brings a notable performance on range queries common in OLAP tasks. Additionally, as data is stored in columns in multiple files, it can be compressed, resulting in much less storage consumption.

The nature of the spars-index is based on LSM trees allowing you to insert high-volume data per second. All these come with the cost of not being suitable for pointy queries, which is not the purpose of the ClickHouse.

Structure

In the below figure, we can see how ClickHouse stores data:

ClickHouse Data Store Structure

Data is split into multiple parts (ClickHouse default or user-defined partition key)
Parts are split in granules which is a logical concept, and ClickHouse doesn't split data into them as the physical. Instead, it can locate the granules via the marks. Granules' locations (start and end) are defined in the mark files with the mrk2 extension.
Index values are stored in the primary.idx file, which contains one row per granule.
Columns are stored as compressed blocks in .bin files: One file for every column in the Wide and a single file for all columns in the Compact format. Being Wide or Compact is determined by ClickHouse based on the size of the columns.

Now let's see how ClickHouse finds the matching rows using primary keys:

ClickHouse finds the matching granule marks utilizing the primary.idx file via the binary search.
Looks into the mark files to find the granules' location in the bin files.
Loads the matching granules from the bin files into the memory in parallel and looks for the matching rows in those granules using binary search.

Case Study

To clarify the flow mentioned above, let's create a table and insert data into it:



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
ORDER BY (project_id, created_date)



INSERT INTO projects 
SELECT * FROM generateRandom('project_id Int32, name String, created_date Date', 10, 10, 1)
LIMIT 10000000;

First, if you don't specify primary keys separately, ClickHouse will consider sort keys (in order by) as primary keys. Hence, in this table, project_id and created_date are the primary keys. Every time you insert data into this table, it will sort data first by project_id and then by created_date.

If we look into the data structure stored on the hard drive, we face this:

Physical files stored in a part

We have five parts, and one of them is: all_1_1_0. You can visit this link if you're curious about the naming convention. As you can see, columns are stored in bin files, and we see mark files named as primary keys along with the primary.idx file.

Filter on the first primary-key

Now let's filter on project_id, which is the first primary key, and explain its indexes:

Index analysis of a query on first primary key

As you can see, the system has detected project_id as a primary key and ruled out 1224 granules out of 1225 using it!

Filter on second primary-key

What if we filter on created_date: the second PK:



EXPLAIN indexes=1
SELECT * FROM projects WHERE created_date=today()

Index analysis of a query on second primary key

The database has detected created_date as a primary key, but it hasn't been able to filter any granules. Why?
Because ClickHouse uses binary search only for the first key and generic exclusive search for other keys, which is much less efficient than the former. So how can we make it more efficient?

If we substitute project_id and created_date in the sort keys while defining the table, you will achieve better results in filtering for the non-first keys since the created_date has lower cardinality (uniqueness) than the project_id:



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
ORDER BY (created_date, project_id)



EXPLAIN indexes=1
SELECT * FROM projects WHERE project_id=700

Index analysis of a query on second primary key on an improved sort keys table

If we filter on the project_id, the second key, now ClickHouse, would use only 909 granules instead of the whole data.

So to summarize, always try to order the primary keys from low to high cardinality.

Order Key

I mentioned earlier that if you don't specify the PRIMARY KEY option, ClickHouse considers sort keys as the primary keys. However, if you want to set primary keys separately, it should be a subset of the sort keys. As a result, additional keys specified in the sort keys are only utilized for sorting purposes and don't play any role in indexing.



CREATE TABLE default.projects
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
PRIMARY KEY (created_date, project_id)
ORDER BY (created_date, project_id, name)

In this example, created_date and project_id columns are utilized in the sparse index and sorting, and name column is only used as the last item for sorting.

Use this option if you wish to use a column in the ORDER BY part of the query since it will eliminate the database sorting effort while running it.

Partition Key

A partition is a logical combination of parts in ClickHouse. It considers all parts under no specific partition by default. To find out more, look into the system.parts table for that projects table defined in the previous section:



SELECT
    name,
    partition
FROM
    system.parts
WHERE
    table = 'projects';

Parts structure in an unpartitioned table

You can see that the projects table has no particular partition. However, you can customize it using the PARTITION BY option:



CREATE TABLE default.projects_partitioned
(

    `project_id` UInt32,

    `name` String,

    `created_date` Date
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(created_date)
PRIMARY KEY (created_date, project_id)
ORDER BY (created_date, project_id, name)

In the above table, ClickHouse partitions data based on the month of the created_date column:

Parts structure in a partitioned table

Index

ClickHouse creates a min-max index for the partition key and uses it as the first filter layer in query running. Let's see what happens when we filter data by a column existent in the partition key:



EXPLAIN indexes=1
SELECT * FROM projects_partitioned WHERE created_date='2020-02-01'

Index analysis on a partitioned table

You can see that database has chosen one part out of 16 using the min-max index of the partition key.

Usage

Partitioning in ClickHouse aims to bring data manipulation capabilities to the table. For instance, you can delete or move parts belonging to partitions older than a year. It is way more efficient than an unpartitioned table since ClickHouse has split data based on the month physically on the storage. Consequently, such operations can be performed easily.

Although Clickhouse creates an additional index for the partition key, it should never be considered a query performance improvement method because it loses the performance battle to define the column in the sort keys. So if you wish to enhance the query performance, contemplate those columns in the sort keys and use a column as the partition key if you have particular plans for data manipulation based on that column.

Finally, don't get partitions in ClickHouse wrong with the same term in the distributed systems where data is split on different nodes. You should use shards and distributed tables if you're inclined to achieve such purposes.

Skip Index

You may have recognized that defining a column in the last items of the sort key cannot be helpful, mainly if you only filter on that column without the sort keys. What should you do in those cases?

Consider a dictionary you want to read. You can find words using the table of contents, sorted by the alphabet. Those items are the sort keys in the table. You can simply find a word starting with W, but how can you find pages containing words related to wars?

You can put marks or sticky notes on those pages making your effort less the next time. That's how Skip Index works. It helps the database filter granules that don't have desired values of some columns by creating additional indexes.

Case Study

Consider the projects table defined in the Order By section. created_date and project_id were defined as primary keys. Now if we filter on the name column, we'll encounter this:



EXPLAIN indexes=1
SELECT * FROM projects WHERE name='hamed'

Index analysis on a query on non-indexed column

The result was expected. Now what if we define a skip index on it?



ALTER TABLE projects ADD INDEX name_index name TYPE bloom_filter GRANULARITY 1;

The above command creates a skip index on the name column. I've used the bloom filter type because the column was a string. You can find more about the other kinds here.

This command only makes the index for the new data. Wishing to create for already inserted, you can use this:



ALTER TABLE projects MATERIALIZE INDEX name_index;

Let's see the query analysis this time:

Index analysis on a query on skip-indexed column

As you can see, the skip index greatly affected granules' rule-out and performance.

While the skip index performed efficiently in this example, it can show poor performance in other cases. It depends on the correlation of your specified column and sort keys and settings like index granularity and its type.

Conclusion

In conclusion, understanding and utilizing ClickHouse's primary keys, order keys, partition keys, and skip index is crucial for optimizing query performance and scalability. Choosing appropriate primary keys, order keys, and partitioning strategies can enhance data distribution, improve query execution speed, and prevent overloading. Additionally, leveraging the skip index feature intelligently helps minimize disk I/O and reduce query execution time. By considering these factors in your ClickHouse schema design, you can unlock the full potential of ClickHouse for efficient and performant data solutions.

How to Schedule Database Backups with Cronjob and Upload to AWS S3

Hamed Karbasi — Wed, 24 May 2023 15:53:02 +0000

Introduction

The update procedure is a vital operation every team should consider. However, doing it manually can be exhausting. As a toil, you can automate it simply by creating a cronjob to take the backup and upload it to your desired object storage.

This article explains how to automate your database backup with a Cronjob and upload it to the Cloud S3. Postgres has been considered as the database, but you can generalize it to any other database or data type you want.

Step 1: Create configmap

To perform the backup and upload, we need a bash script. It should first login into the cloud via the oc login command. Then gets your desired database pod name. Executes the dump command, zips it, and downloads it via the rsync command. Finally, it uploads it to the AWS S3 object storage.

During running the script and before the dumping, it gets a prompt from the user as an Are you sure? which you can bypass by the ‍‍-y‍ option.

All credentials like OKD Token or Postgres Password are passed to the application as environment variables.

By putting this bash script in a config map, it can be mounted as a volume in the cronjob. Remember to replace PROJECT_HERE with your project name and customize the variables in the bash script according to your project specifications.

kind: ConfigMap
apiVersion: v1
metadata:
  name: bash-script
  namespace: PROJECT_HERE
data:
  backup.sh: >
    #!/bin/bash

    # This file provides a backup script for postgres

    # Variables: Modify your variables according to the okd projects and database secrets

    NAMESPACE=PROJECT_HERE

    S3_URL=https://s3.mycompany.com

    STATEFULSET_NAME=postgres

    BACKUP_NAME=backup-$(date "+%F")

    S3_BUCKET=databases-backup

    # Exit the script anywhere faced the error 

    set -e

    # Define the confirm option about user prompt (yes or no)

    confirm=""

    # Parse command-line options

    while getopts "y" opt; do
        case $opt in
        y)
            confirm="y"
            ;;
        \?)
            echo "Invalid option: -$OPTARG" >&2
            exit 1
            ;;
        esac
    done

    # Login to OKD

    oc login ${S3_URL} --token=${OKD_TOKEN}

    POD_NAME=$(oc get pods -n ${NAMESPACE} | grep ${STATEFULSET_NAME} | cut -d' '
    -f1)

    echo The backup of database in pod ${POD_NAME} will be dumped in ${BACKUP_NAME}
    file.

    DUMP_COMMAND='PGPASSWORD="'${POSTGRES_USER_PASSWORD}'" pg_dump -U
    '${POSTGRES_USER}' '${POSTGRES_DB}' > /bitnami/postgresql/backup/'${BACKUP_NAME}

    GZIP_COMMAND='gzip /bitnami/postgresql/backup/'${BACKUP_NAME}

    REMOVE_COMMAND='rm /bitnami/postgresql/backup/'${BACKUP_NAME}.gz

    # Prompt the user for confirmation if the -y option was not provided

    if [[ $confirm != "y" ]]; then
        read -r -p "Are you sure you want to proceed? [y/N] " response
        case "$response" in
        [yY][eE][sS] | [yY])
            confirm="y"
            ;;
        *)
            echo "Aborted"
            exit 0
            ;;
        esac
    fi

    # Dump the backup and zip it

    oc exec -n ${NAMESPACE} "${POD_NAME}" -- sh -c "${DUMP_COMMAND} && ${GZIP_COMMAND}"

    echo Transfer it to current local folder

    oc rsync -n ${NAMESPACE} ${POD_NAME}:/bitnami/postgresql/backup/ /backup-files
    &&
        oc exec -n ${NAMESPACE} "${POD_NAME}" -- sh -c "${REMOVE_COMMAND}"

    # Send backup files to AWS S3

    aws --endpoint-url "${S3_URL}" s3 sync /backup-files
    s3://${S3_BUCKET}

Step 2: Create secrets

Database, AWS, and OC credentials should be kept as secrets. First, we’ll create a secret containing the AWS CA Bundles. After downloading the bundle, you can make a secret file from it:

$ oc create secret -n PROJECT_HERE generic certs --from-file ca-bundle.crt

You should replace PROJECT_HERE with your project name.

Now let’s create another secret for other credentials. Consider that you should specify AWS_CA_BUNDLE with=/certs/ca-bundle.crt

kind: Secret
apiVersion: v1
metadata:
  name: mysecret
  namespace: PROJECT_HERE

data:
  AWS_CA_BUNDLE: 
  OKD_TOKEN: 
  POSTGRES_USER_PASSWORD: 
  POSTGRES_USER:
  POSTGRES_DB:  
  AWS_SECRET_ACCESS_KEY: 
  AWS_ACCESS_KEY_ID: 

type: Opaque

Step 3: Create cronjob

To create the cronjob, we need a docker image capable of running oc and aws commands. You can find this image and its Docker file here if you are inclined to customize it.

Now let’s create the cronjob:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
  namespace: PROJECT_HERE 
spec:
  schedule: 0 3 * * *
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: hamedkarbasi/aws-cli-oc:1.0.0
            command: ["/bin/bash", "-c", "/backup-script/backup.sh -y"]
            envFrom:
              - secretRef:
                  name: mysecret
            volumeMounts:
              - name: script
                mountPath: /backup-script/backup.sh
                subPath: backup.sh
              - name: certs
                mountPath: /certs/ca-bundle.crt
                subPath: ca-bundle.crt
              - name: kube-dir
                mountPath: /.kube
              - name: backup-files
                mountPath: /backup-files
          volumes:
            - name: script
              configMap:
                name: backup-script
                defaultMode: 0777
            - name: certs
              secret: 
                secretName: certs
            - name: kube-dir
              emptyDir: {}
            - name: backup-files
              emptyDir: {}
          restartPolicy: Never

Again, you should replace PROJECT_HERE with your project name and the schedule parameter, with your desired run job frequency. By putting all manifests in a folder named backup, we can apply it to Kubernetes:

$ oc apply -f backup

This cronjob will be run at 3:00 AM every night, dumping the database and uploading to the AWS S3.

Conclusion

In conclusion, automating database backups to AWS S3 using cronjob can save you time and effort while ensuring your valuable data is stored securely in the cloud. Following the steps outlined in this guide, you can easily set up a backup schedule that meets your needs and upload your backups to AWS S3 for safekeeping. Remember to test your backups regularly to ensure they can be restored when needed, and keep your AWS credentials and permissions secure to prevent unauthorized access. With these best practices in mind, you can have peace of mind knowing that your database backups are automated and securely stored in the cloud.

ClickHouse Basic Tutorial: Table Engines

Hamed Karbasi — Sat, 29 Apr 2023 08:38:56 +0000

In this part, I will cover ClickHouse table engines. Like any other database, ClickHouse uses engines to determine a table's storage, replication, and concurrency methodologies. Every engine has pros and cons, and you should choose them by your need. Moreover, engines are categorized into families sharing the main features. As a practical article, I will deep dive into the most usable ones in every family and leave the others to your interest.

Now, let's start with the first and most usable family:

Merge-Tree Family

‌As it said, this is the most possible choice when you want to create a table in ClickHouse. It's based on the data structure of the Log Structured Merge-Tree. LSM trees are optimized for write-intensive workloads. They are designed to handle a large volume of writes by buffering them in memory and then periodically flushing them to disk in sorted order. This allows for faster writes of massive data and reduces the likelihood of disk fragmentation. They are considered an alternative to the B-Tree data structure which is common in traditional relational databases like MySQL.

Note: For all engines of this family, you can use Replicated as a prefix to the engine name to create a replication of the table on every ClickHouse node.

Now let's investigate common engines in this family.

MergeTree

Here is an example of a merge-tree DDL:



CREATE TABLE inventory
(
    `id` Int32,
    `status` String,
    `price` String,
    `comment` String
)
ENGINE = MergeTree
PRIMARY KEY (id, price)
ORDER BY (id, price, status)

Merge-tree tables use sparse indexing to optimize queries.
Briefly, in sparse indexing, data is split into multiple parts. Every part is sorted by the order by keys (referred to as sort keys), where the first key has the highest priority in sorting. Then every part is broken down into groups called granules whose first and last items for primary keys are considered as marks. Since these marks are extracted from the sorted data, primary keys should be a subset of sort keys. Then for every query containing a filter on primary keys, ClickHouse performs a binary search on those marks to find the target granules as fast as possible. Finally, ClickHouse loads target granules in memory and searches for the matching rows.

Note: You can omit the PRIMARY KEY in DDL, and ClickHouse will consider sort keys as primary keys.

ReplaingMergeTree

DDL

In this engine, rows with equal order keys are replaced by the last row. Consider the below engine:



CREATE TABLE inventory
(
    `id` Int32,
    `status` String,
    `price` String,
    `comment` String
)
ENGINE = ReplacingMergeTree
PRIMARY KEY (id)
ORDER BY (id, status);

Suppose that you insert a row in this table:



INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed');

Now let's insert another row with the same sort keys:



INSERT INTO inventory VALUES (23, 'success', '2000', 'Cancelled');

Now the latter row will replace the previous one. Note that if you get select rows, you may face both of them:



SELECT * from inventory WHERE id=23;

That's because Clickhouse performs the replacement process while merging the parts, which happens in the background asynchronously and not immediately. To see the final result immediately, you can use the FINAL modifier:



SELECT * from inventory FINAL WHERE id=23;

Note: You can specify a column as version while defining the table to replace rows accordingly.

Usage

Replacing Merge Tree is widely used for deduplication. As ClickHouse performs poorly in frequent updates, you can update a column by inserting a new row with the equal sort keys, and ClickHouse will remove the stalled rows in the background. Surely it's challenging to update sort keys because it won't delete the old rows in that situation. In that case, you can use Collapsing Merge Trees, explained in the next part.

CollapsingMergeTree

In this engine, you can define a sign column and ask the database to delete stall rows with sign=-1 and keep the new row with sign=1.

DDL



CREATE TABLE inventory
(
    `id` Int32,
    `status` String,
    `price` String,
    `comment` String,
    `sign` Int8
)
ENGINE = CollapsingMergeTree(sign)
PRIMARY KEY (id)
ORDER BY (id, status);

Let's insert a row in this table:



INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed', 1);

Now to update the row:



INSERT INTO inventory VALUES (23, 'success', '1000', 'Confirmed', -1), (23, 'success', '2000', 'Cancelled', 1);

To see the results:



SELECT * FROM inventory;

To see the final results immediately:



SELECT * FROM inventory FINAL;

Usage

Collapsing Merge Trees can handle updates and deletes in a more controlled manner. For example, you can update sorts keys by inserting the same row with sign=-1 and the row with new sort keys with sign=1. There are two challenges with this engine:

Since you need to insert the old row with sign=1, you need to inquire about it by fetching from the database or another data store.
In case of inserting multiple rows accidentally or deliberately, with the sign equal to 1 or -1, you may face unwanted results. That's why you should consider all situations explained here.

AggreragatingMergeTree

Using this engine, you can materialize the aggregation of a table into another one.

DDL

Consider this inventory table. We need to have the maximum price per every item id and the sum of its number of items in another table.



CREATE TABLE inventory
 (
    `id` Int32,
    `status` String,
    `price` Int32,
    `num_items` UInt64
) ENGINE = MergeTree ORDER BY (id, status);

Now let's materialize its results into another table via AggregatingMergeTree:



CREATE MATERIALIZED VIEW agg_inventory
(
    `id` Int32,
    `max_price` AggregateFunction(max, Int32),
    `sum_items` AggregateFunction(sum, UInt64)
)
ENGINE = AggregatingMergeTree() ORDER BY (id)
AS SELECT
    id,
    maxState(price) as max_price,
    sumState(num_items) as sum_items
FROM inventory2
GROUP BY id;

Now let's insert rows into it and see the results:



INSERT INTO inventory2 VALUES (3, 100, 2), (3, 500, 4);

SELECT id, maxMerge(max_price) AS max_price, sumMerge(sum_items) AS sum_items 
FROM agg_inventory WHERE id=3 GROUP BY id;

Usage

This engine helps you reduce the response time of heavy, fixed analytics queries by calculating them in writing time. That will end up decreasing in database load in query time too.

Log Family

Lightweight engines with minimum functionality. They're the most effective when you need to quickly write many small tables (up to approximately 1 million rows) and read them later. Additionally, there are no indexes in this family. However, Log and StripeLog engines can break down data into multiple blocks to support multi-threading while reading data.

I will only look into the TinyLog engine. To check the others, you can visit this link.

TinyLog

This table is mainly used as a write-once method. i.e., you will write data once and read it as often as you want. As ClickHouse reads data in a single stream, it's better to keep the size of the table up to 1M rows.



CREATE TABLE log_location

 (

    id Int32,

    long String,

    lat Int32

) ENGINE = TinyLog;

Usage

You can use this engine as an intermediate state for batch operations.

Integration Family

The engines in this family are widely used to connect with other databases and brokers with the ability to fetch or insert data.

I'll cover MySQL and Kafka Engines, but you can study the others here.

MySQL Engine

With this engine, you can connect with a MySQL database through ClickHouse and read its data or insert rows.



CREATE TABLE mysql_inventory

(

    id Int32,

    price Int32

)

ENGINE = MySQL('host:port', 'database', 'table', 'user', 'password')

Kafka Engine

Using this engine, you can make a connection to a Kafka Cluster and read its data with a defined consumer group. This engine is broadly used for CDC purposes.

To learn more about this feature, read this article specifically on this topic.

Conclusion

In this article, we saw some of the most important engines of the ClickHouse database. It is clear that ClickHouse provides a wide range of engine options to suit various use-cases. The Merge Tree engine is the default engine and is suitable for most scenarios, but it can be replaced with other engines like AggregatingMergeTree, TinyLog, etc.

It's important to note that choosing the right engine for your use-case can significantly improve performance and efficiency. Therefore, it's worth taking the time to understand the strengths and limitations of each engine and select the one that best meets your needs.

Step-by-Step Guide: Deploying Kafka Connect via Strimzi Operator on Kubernetes

Hamed Karbasi — Tue, 25 Apr 2023 12:19:43 +0000

Strimzi is almost the richest Kubernetes Kafka operator, which you can utilize to deploy Apache Kafka or its other components like Kafka Connect, Kafka Mirror, etc. This article will provide a step-by-step tutorial about deploying Kafka Connect on Kubernetes. I brought all issues I encountered during the deployment procedure and their best mitigation.

Note: Consider that this operator is based on Apache Kafka, not the Confluent Platform. That's why you may need to add some confluent artifacts like Confluent Avro Converter to get the most out of it.

This article is based on Strimzi v0.29.0. Thus you're able to install the following versions of Kafka Connect:

Strimzi: 0.29.0
Apache Kafka & Kafka Connect: Up to 3.2
Equivalent Confluent Platform: 7.2.4

Note: You can convert Confluent Platform version to Apache Kafka version and vice versa with the provided table here.

Installation

Openshift GUI and Kubernetes CLI

If you're using Openshift, navigate to Operators > installed Operators > Strimzi > Kafka Connect.

Now you will face a form containing the Kafka connect configurations. You can get the equivalent Yaml file of the form by clicking on Yaml View. Any update on the form view will be applied to the Yaml view on the fly. Although the form view is quite straightforward, It's strongly recommended not to use it for creating the instance directly. Use it only for converting your desired configuration to a Yaml file and then deploy the operator with the kubectl apply command. So to summarize:

Enter the configuration in the form view
Click on Yaml view
Copy its contents to a Yaml file on your local (e.g. kafka-connect.yaml)
Run: kubectl apply -f kafka-connect.yaml

Now the Kafka-Connect kind should be deployed or updated. The deployed resources consist of Deployment and pods, Service, config maps, and secrets.

Let's get through the minimum configuration and make it more advanced, step by step.

Minimum Configuration

To deploy a simple minimum configuration of Kafka Connect, you can use the below Yaml:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: my-connect-cluster
  namespace: <YOUR_PROJECT_NAME>
spec:
  config:
    config.storage.replication.factor: -1
    config.storage.topic: okd4-connect-cluster-configs
    group.id: okd4-connect-cluster
    offset.storage.replication.factor: -1
    offset.storage.topic: okd4-connect-cluster-offsets
    status.storage.replication.factor: -1
    status.storage.topic: okd4-connect-cluster-status
  bootstrapServers: kafka1, kafka2
  version: 3.2.0
  replicas: 1

You can have the Kafka Connect Rest API on port 8083 exposed on the pod. You can expose it on a private or internal network by defining a route on OKD.

REST API Authentication

With the configuration explained here, you can add authentication to the Kafka Connect REST proxy. Unfortunately, that doesn't work on the Strimzi operator, as discussed here. So to provide security on Kafka Connect, you've two options:

Use the Kafka Connector operator API. Strimzi operator lets you have a Connector kind defined in a YAML file. However, it may not be practical for some use cases since updating, pausing, and stopping connectors via the REST API is necessary.
Put the insecure REST API behind an authenticated API Gateway like Apache APISIX or any other tool or self-developed application.

JMX Prometheus Metrics

To expose JMX Prometheus Metrics, useful for observing connectors statuses in Grafana, add the below configuration:

  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        key: jmx-prometheus
        name: configs
  jmxOptions: {}

It uses a pre-defined config for Prometheus export. You can use this config:

startDelaySeconds: 0
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false
rules:
- pattern : "kafka.connect<type=connect-worker-metrics>([^:]+):"
  name: "kafka_connect_connect_worker_metrics_$1"
- pattern : "kafka.connect<type=connect-metrics, client-id=([^:]+)><>([^:]+)"
  name: "kafka_connect_connect_metrics_$2"
  labels:
    client: "$1"
- pattern: "debezium.([^:]+)<type=connector-metrics, context=([^,]+), server=([^,]+), key=([^>]+)><>RowsScanned"
  name: "debezium_metrics_RowsScanned"
  labels:
    plugin: "$1"
    name: "$3"
    context: "$2"
    table: "$4"
- pattern: "debezium.([^:]+)<type=connector-metrics, context=([^,]+), server=([^>]+)>([^:]+)"
  name: "debezium_metrics_$4"
  labels:
    plugin: "$1"
    name: "$3"
    context: "$2"

Service for External Prometheus

If you are intended to deploy Prometheus in companion with Strimzi to collect the metrics, follow the instructions here. However, in the case of using external Prometheus, the story goes another way:

Strimzi operator only creates port mapping in Service for these ports:

8083: Kafka Connect REST API
9999: JMX port

Sadly it doesn't create a mapping for port 9404, the Prometheus exporter HTTP port. So we've to create a service on our own:

kind: Service
apiVersion: v1
metadata:
  name: kafka-connect-jmx-prometheus
  namespace: kafka-connect
  labels:
    app.kubernetes.io/instance: kafka-connect
    app.kubernetes.io/managed-by: strimzi-cluster-operator
    app.kubernetes.io/name: kafka-connect
    app.kubernetes.io/part-of: strimzi-kafka-connect
    strimzi.io/cluster: kafka-connect
    strimzi.io/kind: KafkaConnect
spec:
  ports:
    - name: tcp-prometheus
      protocol: TCP
      port: 9404
      targetPort: 9404
  type: ClusterIP
  selector:
    strimzi.io/cluster: kafka-connect
    strimzi.io/kind: KafkaConnect
    strimzi.io/name: kafka-connect-connect
status:
  loadBalancer: {}

Note: This method only works for single-pod deployments since you should define a route for the service and even in the case of headless service, the route returns one IP of a pod at a time. Hence, Prometheus can't scrape all pods metrics. That's why it is recommended to use Podmonitor and Prometheus on Cloud. This issue is discussed here

Plugins and Artifacts

To add plugins and artifacts, there are two ways:

Operator Build Section

To add plugins, you can use the operator build section. It gets the plugin or artifact addresses, downloads them in the build stage (The operator creates the build config automatically), and adds them to the plugin directory of the image.

It supports jar, tgz, zip, and maven. However, in the case of Maven, a multi-stage Dockerfile is created, which is problematic to Openshift, and it faces failure in the build stage. Hence, you should only use other types that don't need compile stage (i.e., jar, zip, tgz) and end up with a single-stage Dockerfile.

For example, to add the Debezium MySQL plugin, you can use the below configuration:

spec:  
  build:
    output:
      image: 'kafkaconnect:1.0'
      type: imagestream
    plugins:
      - artifacts:
          - type: tgz
            url: >-
              https://repo1.maven.org/maven2/io/debezium/debezium-connector-mysql/2.1.4.Final/debezium-connector-mysql-2.1.4.Final-plugin.tar.gz
        name: debezium-connector-mysql

Note: Strimzi operator is only able to download public artifacts. So if you wish to download a privately secured artifact that is not accessible by Kubernetes, you've to give up this method and follow the next one.

Changing Image

The operator is able to use your desired image instead of its default one. Thus you can add your desired artifacts and plugins by building an image manually or via CI/CD. One of the other reasons why you may want to use this method is that Strimzi uses Apache Kafka image, not the Confluent Platform. So the deployments don't have Confluent useful packages like Confluent Avro Converter, etc. So you need to add them to your image and configure the operator to use your docker image.

For example, If you want to add your customized Debezium MySQL Connector plugin from Gitlab Generic Packages and Confluent Avro Converter to the base image, first use this Dockerfile:

ARG CONFLUENT_VERSION=7.2.4

# Install confluent avro converter
FROM confluentinc/cp-kafka-connect:${CONFLUENT_VERSION} as cp
# Reassign version
ARG CONFLUENT_VERSION
RUN confluent-hub install --no-prompt confluentinc/kafka-connect-avro-converter:${CONFLUENT_VERSION}

# Copy privious artifacts to the main strimzi kafka image
FROM quay.io/strimzi/kafka:0.29.0-kafka-3.2.0
ARG GITLAB_TOKEN
ARG CI_API_V4_URL=https://gitlab.snapp.ir/api/v4
ARG CI_PROJECT_ID=3873
ARG DEBEZIUM_CONNECTOR_MYSQL_CUSTOMIZED_VERSION=1.0
USER root:root

# Copy Confluent packages from previous stage
RUN mkdir -p /opt/kafka/plugins/avro/
COPY --from=cp /usr/share/confluent-hub-components/confluentinc-kafka-connect-avro-converter/lib /opt/kafka/plugins/avro/

# Connector plugin debezium-connector-mysql
RUN 'mkdir' '-p' '/opt/kafka/plugins/debezium-connector-mysql' \
    && curl --header "${GITLAB_TOKEN}" -f -L \
    --output /opt/kafka/plugins/debezium-connector-mysql.tgz \
    ${CI_API_V4_URL}/projects/${CI_PROJECT_ID}/packages/generic/debezium-customized/${DEBEZIUM_CONNECTOR_MYSQL_CUSTOMIZED_VERSION}/debezium-connector-mysql-customized.tar.gz \
    && 'tar' 'xvfz' '/opt/kafka/plugins/debezium-connector-mysql.tgz' '-C' '/opt/kafka/plugins/debezium-connector-mysql' \
    && 'rm' '-vf' '/opt/kafka/plugins/debezium-connector-mysql.tgz'

USER 1001

Build the image. Push it to the image stream or any other docker repository and configure the operator by adding the below line:

spec:  
  image: image-registry.openshift-image-registry.svc:5000/kafka-connect/kafkaconnect-customized:1.0

Kafka Authentication

Depending on its type, you need to use different configurations to add Kafka authentication. However, to bring an example, here you can see the configuration for Kafka with SASL/Plaintext mechanism and scram-sha-512:

spec:
  authentication:
    passwordSecret:
      password: kafka-password
      secretName: mysecrets
    type: scram-sha-512
    username: myuser

No need to say that you must provide the password in a secret file named mysecret.

Handling File Credentials

Since connectors need credentials to access databases, you've to define them as secrets and access them with environment variables. However, if there are too many of them, you can put all credentials in a file and address them in the connector with the $file modifier:

1- Put all credentials as the value of a key named credentials in a secret file.

Credentials file:

USERNAME_DB_1=user1
PASSWORD_DB_1=pass1

USERNAME_DB_2=user2
PASSWORD_DB_2=pass2

Secret file:

kind: Secret
apiVersion: v1
metadata:
  name: mysecrets
  namespace: kafka-connect
data:
  credentials: <BASE64 YOUR DATA>

2- Configure the operator with the secret as volume:

spec:
  config:
    config.providers: file
    config.providers.file.class: org.apache.kafka.common.config.provider.FileConfigProvider  
  externalConfiguration:
    volumes:
      - name: database_credentials
        secret:
          items:
            - key: credentials
              path: credentials
          optional: false
          secretName: mysecrets

3- Now in the connector, you can access PASSWORD_DB_1 with the below command:

"${file:/opt/kafka/external-configuration/database_credentials/credentials:PASSWORD_DB_1}"

Put it all together

If we put all configurations together, we'll have the below configuration for Kafka Connect:

Service, route and build configuration are ommited since we've discussed earlier in the article.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaConnect
metadata:
  name: kafka-connect
  namespace: kafka-connect
spec:
  authentication:
    passwordSecret:
      password: kafka-password
      secretName: mysecrets
    type: scram-sha-512
    username: myuser
  config:
    config.providers: file
    config.providers.file.class: org.apache.kafka.common.config.provider.FileConfigProvider
    config.storage.replication.factor: -1
    config.storage.topic: okd4-connect-cluster-configs
    group.id: okd4-connect-cluster
    offset.storage.replication.factor: -1
    offset.storage.topic: okd4-connect-cluster-offsets
    status.storage.replication.factor: -1
    status.storage.topic: okd4-connect-cluster-status
  bootstrapServers: 'kafka1:9092, kafka2:9092'
  metricsConfig:
    type: jmxPrometheusExporter
    valueFrom:
      configMapKeyRef:
        key: jmx-prometheus
        name: configs
  resources:
    limits:
      memory: 1Gi
    requests:
      memory: 1Gi
  readinessProbe:
    failureThreshold: 10
    initialDelaySeconds: 60
    periodSeconds: 20
  jmxOptions: {}
  livenessProbe:
    failureThreshold: 10
    initialDelaySeconds: 60
    periodSeconds: 20
  image: image-registry.openshift-image-registry.svc:5000/kafka-connect/kafkaconnect-customized:1.0
  version: 3.2.0
  replicas: 2
  externalConfiguration:
    volumes:
      - name: database_credentials
        secret:
          items:
            - key: credentials
              path: credentials
          optional: false
          secretName: mysecrets

Conclusion

In conclusion, deploying Kafka Connect using the Strimzi Operator can be a powerful and efficient way to manage data integration in your organization. By leveraging the flexibility and scalability of Kafka, along with the ease of use and automation provided by the Strimzi Operator, you can streamline your data pipelines and improve your data-driven decision-making. In this article, I've covered the key steps involved in deploying Kafka Connect via the Strimzi Operator, including creating its minimal custom resource definition (CRD), REST API Basic authentication issue, Kafka Authentication, JMX Prometheus metrics, plugins and artifacts and handling file credentials. Following these steps, you can easily customize your Kafka Connect deployment to meet your specific needs.

ClickHouse Basic Tutorial: An Introduction

Hamed Karbasi — Thu, 13 Apr 2023 23:01:20 +0000

This is the first part of the ClickHouse Tutorial Series. In this series, I cover some practical and vital aspects of the ClickHouse database, a robust OLAP technology many enterprise companies utilize.

In this part, I'll talk about the main features, weaknesses, installation, and usage of ClickHouse. I'll also refer to some helpful links for those who want to dive into broader details.

What is ClickHouse

ClickHouse is an open-source column-oriented database developed by Yandex. It is designed to provide high performance for analytical queries. ClickHouse uses a SQL-like query language for querying data and supports different data types, including integers, strings, dates, and floats. It offers various features such as clustering, distributed query processing, and fault tolerance. It also supports replication and data sharding. ClickHouse is used by companies such as Yandex, Facebook, and Uber for data analysis, machine learning, and more.

Main Features

The main features of Clickhouse Database are:

Column-Oriented

Data in ClickHouse is stored in columns instead of rows, bringing at least two benefits:

Every column can be sorted in a separate file; hence, stronger compression happens on each column and the whole table.
In range queries common in analytical processing, the system can access and process data easier since data is sorted in some columns (i.e., columns defined as sort keys). Additionally, it can parallelize processes on multi-cores while loading massive columns.

Row-Oriented Database (Gif by ClickHouse)

Columnar Database (Gif by ClickHouse)

Note: It should not get mistaken with Wide-Column databases like Cassandra as they store data in rows but enable you to denormalize intensive data in a table with many columns leading to a No-SQL structure.

Data Compression

Thanks to compression algorithms (zstd and LZ4), data occupies much less storage, even more than 20x smaller! You can study some of the benchmarks on ClickHouse and other databases storage here.

ClickHouse columnar structure leads to storing and reading columns more efficiently (Graph by Altinity)

Scalability

ClickHouse scales well both vertically and horizontally. It can be scaled by adding extra replicas and extra shards to process queries in a distributed way. ClickHouse supports multi-master asynchronous replication and can be deployed across multiple data centers. All nodes are equal, which allows for avoiding having single points of failure.

Weaknesses

To mention some:

Lack of full-fledged UPDATE/DELETE implementation: ClickHouse is unsuited for modification and mutations. So you'll come across poor performance regarding those kinds of queries.
OLTP queries like pointy ones would not make you happy since ClickHouse is easily outperformed by traditional RDBMSs like MySQL with those queries.

Rivals and Alternatives

To name a few:

Apache Druid
ElasticSearch
SingleStore
Snowflake
TimescaleDB

Surely, each one is suitable for different use cases and has its pros and cons, but I won't discuss their comparison here. However, you can study some valuable benchmarks here and here.

Quick Start

Installation

I only cover the Docker approach here. For other methods on different distros, please follow ClicHouse's official Installation.

The docker-compose file:



version: '2'
services:
  clickhouse:
    container_name: myclickhouse
    image: clickhouse/clickhouse-server:latest
    ports:
      - "8123:8123"
      - "9000:9000"
    volumes:
      - ./clickhouse-data:/var/lib/clickhouse/  
    restart: unless-stopped

And then run it by:



docker compose up -d

As you can see, two ports have been exposed:

8123: HTTP API Port for HTTP requests, used by JDBC, ODBC, and web interfaces.
9000: Native Protocol port (ClickHouse TCP protocol). Used by ClickHouse apps and processes like clickhouse-server, clickhouse-client, and native ClickHouse tools. Used for inter-server communication for distributed queries.

It's up to your client driver to choose one of them. For example, DBeaver uses 8123, and Python ClickhHouse-Driver uses 9000.

To continue the tutorial, we use ClickHouse-Client available on the installed server:



docker exec -it myclickhouse clickhouse-client

Database and Table Creation

Create database test:



CREATE DATABASE test;

create table orders:



CREATE TABLE test.orders
(`OrderID` Int64,
`CustomerID` Int64,
`OrderDate` DateTime,
`Comments` String,
`Cancelled` Bool)
ENGINE = MergeTree
PRIMARY KEY (OrderID, OrderDate)
ORDER BY (OrderID, OrderDate, CustomerID)
SETTINGS index_granularity = 8192;

In the next parts, we'll talk about other configurations like Engine, PRIMARY KEY, ORDER BY, etc.

Insert Data

To insert sample data:



INSERT INTO test.orders 
VALUES (334, 123, '2021-09-15 14:30:00', 'some comment', 
false);

Read Data

Just like any other SQL query:



SELECT OrderID, OrderDate FROM test.orders;

Conclusion

In the first part of the ClickHouse Tutorial Series, we discussed the traits, features, and weaknesses of ClickHouse. Then we saw how to set up an instance with minimum configuration, create a database and table, insert data into it, and read from it.

Useful Links

Getting Started with ClickHouse

Tomasz Wegrzanowski ・ Dec 24 '22

#clickhouse #nosql #database

Introduction to ClickHouse

Olena Kutsenko ・ Nov 8 '22

#clickhouse #datawarehouse #columnar #database

Forem: Hamed Karbasi

ClickHouse Advanced Tutorial: Apply CDC from MySQL to ClickHouse

Introduction

System Design Overview

Case Study

Implementation

Step 1: CDC with Debezium

Extracting New Record State

Rewriting Delete Events

Handling Non-Primary Keys Update

Important: How to choose the Debezium key columns?

Step 2: ClickHouse Tables

Kafka Table

Consumer Materializer

Main Table

View Table

Conclusion

ClickHouse Advanced Tutorial: Performance Comparison with MySQL

Table of Contents

Introduction

OLTP

OLAP

MySQL

ClickHouse

ClickHouse Basic Tutorial: An Introduction

Hamed Karbasi ・ Apr 13 '23

Comparison Case Study

ClickHouse / ClickBench

ClickBench: a Benchmark For Analytical Databases

ClickBench: a Benchmark For Analytical Databases

Overview

Goals

Reproducibility

System Specification

Benchmark Flow

Queries

Results

Dataset Load

Table Size

Read Queries Execution

Update Query Execution

Conclusion

ClickHouse Basic Tutorial: Keys & Indexes

Primary Key

Structure

Case Study

Filter on the first primary-key

Filter on second primary-key

Order Key

Partition Key

Index

Usage

Skip Index

Case Study

Conclusion

How to Schedule Database Backups with Cronjob and Upload to AWS S3

Introduction

Step 1: Create configmap

Step 2: Create secrets

Step 3: Create cronjob

Conclusion

ClickHouse Basic Tutorial: Table Engines

Merge-Tree Family

MergeTree

ReplaingMergeTree

DDL

Usage

CollapsingMergeTree

DDL

Usage

AggreragatingMergeTree

DDL

Usage

Log Family

TinyLog

Usage

Integration Family

MySQL Engine

Kafka Engine

Conclusion

George Hosu ・ May 4, 2019 ・
george3d6.Medium