Forem: marijaselakovic

From data storage to data analysis: Tutorial on CrateDB and pandas 2

marijaselakovic — Wed, 03 May 2023 06:55:15 +0000

Introduction

Pandas is an open-source data manipulation and analysis library for Python. It is widely used for handling and analyzing data in a variety of fields, including finance, research, etc.

One of the key benefits of pandas is its ability to handle and manipulate large datasets, making it a valuable tool for data scientists and analysts. The library provides easy-to-use data structures and functions for data cleaning, transformation, and analysis, making it an essential part of the data analysis workflow.

Using CrateDB and pandas together can be a powerful combination for handling large volumes of data and performing complex data analysis tasks. In this tutorial, we will showcase using the real-world dataset how to use CrateDB and pandas together for effective data analysis.

Requirements

To follow along with this tutorial, you will need:

A running instance of CrateDB 5.2.
Python 3.x with the pandas 2 and crate 0.31 packages installed.
A real-world dataset in CSV format. In this tutorial, we will be using the shop customer data available on Kaggle.

Setting up CrateDB

Before we can start using CrateDB, we need to set it up. You can either download and install CrateDB locally via Docker or tarball or use a CrateDB Cloud instance with an option of the free cluster.

Once you have a running instance of CrateDB, create a new table to store the customer data dataset. Here is an SQL command to create a table:


 sql
CREATE TABLE IF NOT EXISTS "doc"."customer_data" (
   "customerid" INTEGER,
   "gender" TEXT,
   "age" INTEGER,
   "annualincome" INTEGER,
   "spendingscore" INTEGER,
   "profession" TEXT,
   "workexperience" INTEGER,
   "familysize" INTEGER
)

After creating the table, you can import the customer data dataset into CrateDB using the COPY FROMcommand:



COPY "doc"."customer_data" FROM 'file:///path/to/Customers.csv' 
WITH (format='csv', delimiter=',')

Once you have CrateDB running, you can start exploring data with pandas.

Querying data with CrateDB and pandas

The first step is to import the pandas library and specify the query you want to execute on CrateDB. In our example, we want to fetch all customer data.

To read data from CrateDB and work with it in a pandas DataFrame use read_sql method as illustrated below.



import pandas as pd

query = "SELECT * FROM customer_data"
df = pd.read_sql(query, 'crate://localhost:4200')

In the above code, we establish a connection to a local CrateDB instance running on localhost on port 4200, execute a SQL query, and return the results as a pandas DataFrame. You can further modify the query to retrieve only the columns you need or to filter the data based on some condition.

Analyze the data

Now that data are loaded into the pandas DataFrame, we can perform various analyses and manipulations on it. For instance, we can group the data by a certain column and calculate the average value of another column:



avg_income = df.groupby("profession")["annualincome"].mean()

In this example, we group the data in the DataFrame by the profession column and calculate the average annual income for each profession. You can plot the data about average incomes using df.plot() method, specifying the type of plot (a bar chart), and the columns to use for the x and y axes:



import matplotlib.pyplot as plt

income_by_profession.plot(kind='bar', legend=True, rot=0)
plot.show()

We also use plt.show() from matplotlib to display the plot:

Wrap up

That's it! You should now have a good idea of how to use CrateDB and pandas together to analyze large datasets stored in CrateDB. This allows you to take advantage of the powerful data manipulation capabilities of pandas to analyze and visualize your data.
To learn more about updates, features, and other questions you might have, join our CrateDB community.

Guide to bitwise operators in CrateDB

marijaselakovic — Thu, 02 Mar 2023 09:30:44 +0000

Bitwise operators are useful because they allow you to perform efficient and concise operations on individual bits within integer values, which can be very useful in a variety of SQL queries.

CrateDB continues to provide many valuable features. In the 5.2 version, we added support for bitwise operators. Now, you may wonder when this feature is handy. There are at least a couple of scenarios:

If you want to store multiple pieces of information in a single column. For example, you can use a bitwise OR operator to combine multiple flags into a single value.

Simplifying conditional statements. Bitwise operators can be used to check the state of individual bits within a value. For example, you can use a bitwise AND operator to check if a particular bit is set or not.

Easier data manipulation. Bitwise operators can be used to set or manipulate specific bits within a value. For example, you can use a bitwise OR operator to set a particular bit to 1, or a bitwise XOR to swap bit value (1 to 0 or 0 to 1).

CrateDB supports three bitwise operators:

BITWISE AND (&)

BITWISE OR (|)

BITWISE XOR (#)

Now, let’s take a look at each operator and some interesting examples.

Bitwise AND

The Bitwise AND operator compares each bit of two values and returns a new value with the bits that are set in both of the original values.

The syntax for this operator is as follows:

SELECT value1 & value2

Here, value1 and value2 are the two values that you want to compare using the bitwise AND operator. These values can be any valid expressions or constants in SQL, such as columns, variables, or literals.

For example, let’s imagine a table of employees with a column status that stores a bitmask value representing the status of each employee. The status value is stored as a byte and each bit represents a different aspect of the employee status. To save storage space, we recommend a byte data structure for storing up to 7 states simultaneously. The first bit always represents a sign and we don't use negative values to encode states. Similarly, use short data structure for storing up to 15 states, integer for up to 31 states, long for up to 63 states, and for more than 63 states that use Bit String type. To learn more about data types supported in CrateDB, check out our documentation.

The first bit in status value says whether the employee is working full-time, the second bit says if the employee is remotely, and so on. For instance, if the employee is working full-time, the status value will be 1 (B'01') if remote the status value will be 2 (B'10') and if both, the status value will be 3 (B'11').

Now, let’s create a table and populate it with sample data

CREATE TABLE employees (name text, status BYTE, comment TEXT)
INSERT INTO employees (name ,status, comment) VALUES 
('Ana', 1, 'Ana is working full-time from office'),
('Mary', 3, 'Mary is working full-time remotely'),
('Sara', 2, 'Sara is working part-time remotly');

To select all employees who are working full-time:

SELECT name, comment FROM employees WHERE status & 1 = 1 # Ana, Mary

This query will select all rows from the employees table where the first bit (full-time status) is set to 1.

To find all employees who are working full-time and remotely:

SELECT name, comment FROM employees where status & 3 = 3 # Mary

You can also use the bitwise AND operator in combination with other logical operators, such as OR and NOT, to create more complex queries.

Bitwise OR

The bitwise OR operator is used to compare two binary values and return a new binary value where the resulting bit is set to 1 if either of the input bits is 1. It is represented by the symbol |.

To understand how the bitwise OR operator works, let's consider two binary values: 1011 and 1100. The bitwise OR operation would compare each bit position of the two values and return a new binary value based on the following rules:

If either of the input bits is 1, the resulting bit is set to 1.

If both of the input bits are 0, the resulting bit is set to 0.

Applying these rules to the example above, we get the following result: 1011 | 1100 = 1111

The resulting binary value is 1111, which is equivalent to 15 in decimal.

Considering the example with employees table, let’s say we would like to select all employees that are working full-time, remotely or both. Here's how you can use the bitwise OR operator in the query:

SELECT name, comment FROM employees where status & (1 | 2) > 0;

The following query illustrates how to change a flag specifying whether an employee is working full-time without changing the existing flags:

UPDATE employees
SET status = (status | 1 /* FULL-TIME */) 
WHERE name='Sara'

It’s important to note that the bitwise OR operator only works with binary values. If you want to perform a logical OR operation with non-binary values, you can use the OR operator in SQL.

Bitwise XOR

The bitwise XOR operator, represented by the symbol #, compares two binary values and returns a new value based on the following rules:

If both values are 0, the result is 0.
If both values are 1, the result is 0.
If one value is 1 and the other is 0, the result is 1.
Using these rules in the example from above, the result of the XOR operator is: 1011 # 1100 = 0111

One interesting example of using the bitwise XOR operator is changing the status of an employee from “working remotely” to “working in the office“ and vice versa.

UPDATE employees
SET status = status # 2
WHERE name = 'Ana';

In this example, the status of the employee with the name Ana will be toggled from “working remotely“ to “working in the office“ or vice versa, depending on the current value of the status field.

Wrap up

In summary, the bitwise operators in CrateDB allow you to perform bitmasking operations on values in your database and use the resulting values to filter or modify rows in your tables. By combining this operator with other logical operators, you can create powerful queries that can manipulate and extract specific data from your database. If you have any further questions or would like to learn more about CrateDB, check out our documentation and join the CrateDB community.

Find the latest reported values with ease. Introducing max_by and min_by aggregations in CrateDB 5.2

marijaselakovic — Thu, 23 Feb 2023 08:42:04 +0000

CrateDB 5.2 adds two new aggregation functions: max_by and min_by
These aggregation functions allow users to quickly and easily search the value of one column based on the minimum or maximum value of another column, making them useful for analyzing trends, identifying outliers, or simply understanding the range of values within a dataset. An example use case is getting the latest measurement by using the time column and max_by(measurement, time).

MIN_BY and MAX_BY functions allow you to find the minimum or maximum value in a given column based on the values in another column. For example, if you have a table with two columns, product, category and price, you can use the min_by(product, price) to find the row with the product with the lowest price in each category:

SELECT min_by(product, price) AS cheapest_product
FROM product_list
GROUP BY category;

max_by(returned_value, maximized_value) and min_by(returned_value, minimized_value) return the value of the first column for which the value of the second column is maximized or minimized. If multiple rows maximize or minimize the result of the second column, the output will be non-deterministic and CrateDB can return any value from the list of resulting rows.

Both max_by and min_by can be used for numerical and non-numerical data.

Load the dataset

Let’s start with examples using the dataset about power consumption. First, create a table with the schema below:

CREATE TABLE IF NOT EXISTS doc.power_consumption (
   "ts" TIMESTAMP WITH TIME ZONE,
   "Global_active_power" REAL,
   "Global_reactive_power" REAL,
   "Voltage" REAL,
   "Global_intensity" REAL,
   "Sub_metering_1" REAL,
   "Sub_metering_2" REAL,
   "Sub_metering_3" REAL,
   "meter_id" TEXT,
   "location" GEO_POINT,
   "city" TEXT
);

To import data, use the following COPY FROM command:

COPY doc.power_consumption
FROM 'https://srv.demo.crate.io/datasets/power_consumption.json'
RETURN SUMMARY;

The dataset illustrates the consumption data for a few years and shows the differences between several measured utilities. For instance, the column "Sub_metering_1" shows how much energy is consumed in the kitchen. Similarly, columns "Sub_metering_2" and "Sub_metering_3" show the consumed energy in laundry and climate control systems. The full description of the dataset can be found here.

Example queries
Given the data set, let’s find the ids of house meters that had the highest consumption at one point in time for the kitchen and laundry:

SELECT max_by(meter_id, "Sub_metering_1") as max_kitchen,  
       max_by(meter_id, "Sub_metering_2") as max_laundry  
FROM doc.power_consumption;

The result of this query should contain the following meter ids:

+-------------+-------------+
| max_kitchen | max_laundry |
+-------------+-------------+
| 84007B127R  | 840070504U  |
+-------------+-------------+
SELECT 1 row in set (7.423 sec)

Another example would be to find the meter id for the house with the lowest unused power:

SELECT min_by(meter_id, "Global_reactive_power") AS min_unused 
FROM doc.power_consumption;

The return value will tell us for which house meter we had the lowest value of unused power:

+------------+
| min_unused |
+------------+
| 84007B008L |
+------------+
SELECT 1 row in set (0.197 sec)

You can also combine these functions with WHERE or GROUP BY clauses in CrateDB. For example, let's find for each meter id, the consumption of other, unmapped appliances when the unused power was the lowest:

SELECT
  meter_id,
  min_by("Global_active_power","Global_reactive_power") AS total_consumption 
FROM doc.power_consumption 
GROUP BY meter_id 
LIMIT 10;

The query result will list the consumption for each meter id:

+------------+-------------------+
| meter_id   | total_consumption |
+------------+-------------------+
| 840073190N |             0.202 |
| 840071457E |             0.258 |
| 840072897V |             0.14  |
| 840072655G |             0.218 |
| 840072219H |             0.274 |
| 840071893D |             1.342 |
| 840075260N |             0.246 |
| 840076398A |             0.226 |
| 840072328B |             0.212 |
| 840071760J |             0.222 |
+------------+-------------------+
SELECT 10 rows in set (0.067 sec)

Performance and alternatives

As you have seen above min_by and max_by provide a very concise and convenient way to easily find the value of one column based on the minimum or maximum value of another column. Not only is it a convenient feature, but it also provides significant performance gains to alternative queries one had to write in earlier versions of CrateDB.

Let us look at another example and see how much easier and faster it is to get the right results in CrateDB 5.2. We start with our often-used dataset containing IoT device data:

CREATE TABLE IF NOT EXISTS devices.readings ( 
       "time" TIMESTAMP WITH TIME ZONE NOT NULL, 
       device_id TEXT, 
       battery_level BIGINT, 
       battery_status TEXT
    ) CLUSTERED BY (device_id) INTO 8 SHARDS;

We want to find the latest reported battery_level and battery_status for each device in our dataset holding 30 Million total records. In CrateDB 5.1 and earlier versions, one could fallback to a 2-step approach and use a JOIN like so:

SELECT
    r.device_id,
    r.time,
    r.battery_level,
    r.battery_temperature
FROM devices.readings r
    JOIN (SELECT
            MAX(time) time,
            device_id
        FROM devices.readings
        GROUP BY device_id) max_r
    ON max_r.time = r.time
    AND max_r.device_id = r.device_id;

Not only makes the nested structure query adjustments more difficult and one needs to remember the pattern, but also the performance - due to the expensive JOIN - is not really that great with a runtime of roughly 9 seconds:

Runtime (in ms):
    mean:    8982.507 ± 57.494
    min/max: 8578.380 → 9843.830

That is an improvement of 85% in query speed (8.9s → 1.3s), also using a simpler syntax.

Wrap up

Overall, the max_by and min_by functions in CrateDB provide an easy and efficient way to find the maximum or minimum value of a given column in a table based on the values in a different column. These functions can be used in a variety of scenarios to quickly and easily find the highest or lowest values in a set of data.

If you like this blog post and want to learn more about CrateDB, check out our documentation and join the CrateDB community!

Guide to sharding and partitioning best practices in CrateDB

marijaselakovic — Fri, 17 Feb 2023 14:36:47 +0000

Sharding and partitioning are very important concepts when it comes to system scaling. When defining your strategy, you should account upfront for any future growth, given the significant burden of moving data and restructuring the tables. In this article, we will give you a thorough understanding of how sharding and partitioning work in CrateDB. We will start by covering the basic definitions, discussing the principles behind shard distribution and replication in CrateDB, and how to avoid common bottlenecks.

Partition, shard, and Lucene index

A table in CrateDB is a collection of data. It consists of a specified number of columns and any number of rows. Every table must have a schema describing the table structure. Very often, the table is divided into independent, smaller parts based on a particular column. This “smaller part“ is called partition. The table becomes a partitioned table if, during the table creation, a partition column is defined. In this case, when a new record is inserted, a new partition is created if the partition for the same column value doesn’t exist yet. Partitioning is done for easier maintenance of large tables and improving the performance of particular SQL operations. However, a bad selection of a partition column can lead to too many partitions, which can slow the system's performance.

Now let’s take a look into the concept of shard. In CrateDB, the shard is a division of a table or a table partition based on a configurable number and stored on a node in the cluster. When a node is added or removed from the cluster or when the data distribution becomes unbalanced, CrateDB automatically redistributes the shards across the nodes in the cluster to ensure an even data distribution. If the number of shards is not defined during table creation, CrateDB will apply a sensible default value depending on the number of nodes.

Finally, each shard in CrateDB represents a Lucene index. A Lucene index is a collection of Lucene segments where each segment represents an inverted index, doc value, or k-d trees. Take a look at our previous article to get a better overview of the Lucene index and Lucene segments. The diagram below illustrates the best connection between partitions, shards, and the Lucene index.

How the data is distributed

CrateDB uses row-based sharding to split data across multiple shards. This means that data is split based on the values in specific columns of the table, and distributed across multiple nodes. The column that the data should be sharded on is called the routing column. When you insert, update, or delete a row, CrateDB will use the routing column value to determine which shard to access. The number of shards and the routing column can be specified in the CLUSTERED clause when creating a table:

CREATE TABLE product (
     product_id INT PRIMARY KEY,
     name TEXT,
     amount INT
) CLUSTERED INTO 4 SHARDS;

In the above example, the product table is sharded into four shards. If the primary key is set, as illustrated, the routing column can be ignored, as CrateDB uses the primary key for routing by default. However, if the primary key or the routing column is not set, the internal document ID is used.

To distribute data across the cluster, CrateDB uses a hash-based approach based on the simple formula:

shard number = hash(routing column) % total primary shards

As a result, all documents with the same value of CLUSTERED BY column will be stored in the same shard. With a hash function, CrateDB will try to distribute the data roughly equally, even if the original data values are not evenly distributed.

Shard replication

Shard replication in CrateDB is a feature that allows you to replicate data across multiple nodes in a cluster. This can be useful for increasing data availability, improving performance, and reducing the risk of data loss. You can configure the number of replicas for each shard with the number_of_replicas table setting:

CREATE TABLE product (
     product_id INT PRIMARY KEY,
     name TEXT,
     amount INT
) WITH (number_of_replicas = 1);

When there are multiple copies of the same shard, CrateDB will mark one copy as the primary shard and treat the rest as replica shards. When data is written to a shard, it is first written to the primary shard. The primary shard then replicates the data to one or more replica shards on other nodes in the cluster. This process is done in real-time, ensuring that data is always up-to-date across all replicas.

In the event of a node failure, CrateDB will automatically promote one of the replica shards to become the new primary shard. Having more shard replicas means a lower chance of permanent data loss and more throughput as queries will utilize the extra replica shards so that the primary shard is not congested with many requests. In terms of the cost you pay, you will have higher disk space utilization and inter-node network traffic which leads to increased latency of inserts and updates. Additionally, CrateDB supports automatic failover, where the system automatically detects a failed node and promotes a replica shard to take its place as the primary shard.

It is also possible to specify the number of replicas for a specific table by using ALTER TABLE command. For instance:

ALTER TABLE product SET (number_of_replicas = 2);

Note that changing the number of replicas after a table is created will cause the cluster to redistribute the shards and may take some time to complete (if you want only new partitions to be affected, use the ONLY keyword).

Automatic creation of new partitions

Tables in CrateDB are designed to be dynamic and expandable, allowing for the addition of new rows and columns as needed. This means that you can create a table with an unlimited number of rows. Every time when new data is inserted, CrateDB dynamically creates a new table partition based on the partition column as illustrated by the following example:

CREATE TABLE sales (
  "name" STRING,
  "ts" TIMESTAMP,
  "month" TIMESTAMP GENERATED ALWAYS AS date_trunc('month', ts),
  "value" DOUBLE PRECISION
) CLUSTERED INTO 3 SHARDS
  PARTITIONED BY (month);

For every unique value in the month column, a new partition will be created. In our example, the table can have up to twelve partitions, one for each month in the year. If more columns are used for partitioning a new partition will be created for every unique combination of values. The partition column can also be a generated column: columns whose values are calculated based on other columns. For instance, if you have a column containing a timestamp value, you can partition the data by a column that extracts the day value from the timestamp.

The automatic creation of new partitions allows for horizontal scaling and enables the database to handle large amounts of data by distributing it across multiple partitions. Each partition’s shard is stored on a separate node in a CrateDB cluster, which helps to improve query performance and reduce the load on individual nodes.

How to avoid too many shards

If the routing column is badly chosen you can end up with too many shards in the cluster, affecting the overall stability and performance negatively. To find out how many shards your tables need, you need to consider the type of data you are processing, required queries, and hardware configuration. However, if you end up with too many shards you will have to manually reduce the number of shards by merging and moving them to the same node which is a time-consuming and tedious operation. To get an idea of how many shards your cluster needs, check out our recent tutorial.

The general rule for avoiding performance bottlenecks is to have as least as many shards for a table as there are CPUs in the cluster. This increases the chances that a query can be parallelized and distributed maximally. However, if most nodes have more shards per table than they have CPUs, you could actually see performance degradation. Each shard comes with a cost in terms of open files, RAM, and CPU cycles. Having too many small shards can negatively impact performance and scalability for a few reasons:

Increased overhead in terms of managing and maintaining your cluster.
Reduced performance because it takes longer for CrateDB to gather the results, especially true for queries that need to join data from multiple shards
Limited scalability because it becomes more difficult to scale your cluster if you have too many shards.
Increased complexity as too many shards can make it more difficult to understand and troubleshoot your data distribution.

Finally, for performance reasons, consider one thousand shards per node the highest recommended configuration. Also, a single table should not have more than one thousand partitions. If you exceed these numbers you will experience a failing cluster check.

Takeaway

Sharding and partitioning in CrateDB are two key concepts that help to improve the scalability and performance of your database. In this article, we explored the basic principles of sharding and partitioning in CrateDB, and how they can be used to improve performance and scalability in your database. If you have any further questions or would like to learn more about CrateDB, check out our documentation and join the CrateDB community.

Monitoring an on-premises CrateDB cluster with Prometheus and Grafana

marijaselakovic — Thu, 19 Jan 2023 12:28:56 +0000

If you are running CrateDB in a production environment, you have probably wondered what would be the best way to monitor the servers to identify issues before they become problematic and to collect statistics that you can use for capacity planning.

We recommend pairing two well-known OSS solutions, Prometheus which is a system that collects and stores performance metrics, and Grafana which is a system to create dashboards.

For a CrateDB environment, we are interested in:

CrateDB-specific metrics, such as the number of shards or number of failed queries
and OS metrics, such as available disk space, memory usage, or CPU usage

For what concerns CrateDB-specific metrics we recommend making these available to Prometheus by using the Crate JMX HTTP Exporter and Prometheus SQL Exporter. For what concerns OS metrics, in Linux environments, we recommend using the Prometheus Node Exporter.

Things are a bit different of course if you are using containers, or if you are using the fully-managed cloud-hosted CrateDB Cloud, but let’s see how all this works on an on-premises installation by setting all this up together.

First we need a CrateDB cluster

First things first, we will need a CrateDB cluster, you may have one already and that is great, but if you do not we can get one up quickly.

You can review the install documentation at https://crate.io/docs/crate/tutorials/en/latest/self-hosted/index.html and https://crate.io/docs/crate/howtos/en/latest/clustering/multi-node-setup.html.

In my case, I am using Ubuntu and I did it like this, first I ssh to the first machine and run:

nano /etc/default/crate

This is a configuration file that will be used by CrateDB, we only need one line to configure memory settings here (this is a required step otherwise we will fail bootstrap checks):

CRATE_HEAP_SIZE=4G

We also need to create another configuration file:

mkdir /etc/crate
nano /etc/crate/crate.yml

In my case I used the following values:

network.host: _local_,_site_

This tells CrateDB to respond to requests both from localhost and the local network.

discovery.seed_hosts:
    - ubuntuvm1:4300
    - ubuntuvm2:4300

This lists all the machines that make up our cluster, here I only have 2, but for production use, we recommend having at least 3 nodes so that a quorum can be established in case of network partition to avoid split-brain scenarios.

cluster.initial_master_nodes:
    - ubuntuvm1
    - ubuntuvm2

This lists the nodes that are eligible to act as master nodes during bootstrap.

auth.host_based.enabled: true
auth:
  host_based:
    config:
      0:
        user: crate
        address: _local_
        method: trust
      99:
        method: password

This indicates that the crate super user will work for local connections but connections from other machines will require a username and password.

gateway.recover_after_data_nodes: 2
gateway.expected_data_nodes: 2

And this requires both nodes to be available for the cluster to operate in this case, but with more nodes, we could have set recover_after_data_nodes to a value smaller than the total number of nodes.

Now let’s install CrateDB:

wget https://cdn.crate.io/downloads/deb/DEB-GPG-KEY-crate
apt-key add DEB-GPG-KEY-crate
add-apt-repository "deb https://cdn.crate.io/downloads/deb/stable/ $(lsb_release -cs) main"
apt update
apt install crate -o Dpkg::Options::="--force-confold"

(force-confold is used to keep the configuration files we created earlier)

Repeat the above steps on the other node.

Setup of the Crate JMX HTTP Exporter

This is very simple, on each node run the following:

cd /usr/share/crate/lib
wget https://repo1.maven.org/maven2/io/crate/crate-jmx-exporter/1.0.0/crate-jmx-exporter-1.0.0.jar
nano /etc/default/crate

then uncomment the CRATE_JAVA_OPTS line and change its value to:

CRATE_JAVA_OPTS="-javaagent:/usr/share/crate/lib/crate-jmx-exporter-1.0.0.jar=8080"

and restart the crate daemon:

systemctl restart crate

Prometheus Node Exporter

This can be set up with a one-liner:

apt install prometheus-node-exporter

Prometheus SQL Exporter

The SQL Exporter allows running arbitrary SQL statements against a CrateDB cluster to retrieve additional information. As the cluster contains information from each node, we do not need to install the SQL Exporter on every node. Instead, we install it centrally on the same machine that also hosts Prometheus.

Please note that it is not the same to set up a data source in Grafana pointing to CrateDB to display the output from queries in real-time as to use Prometheus to collect these values over time.

Installing the package is straight-forward:

apt install prometheus-sql-exporter

For the SQL exporter to connect to the cluster, we need to create a new user sql_exporter. We grant the user reading access to the sys schema. Run the below commands on any CrateDB node:

curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"CREATE USER sql_exporter WITH (password = '\''insert_password'\'');"}'
curl -H 'Content-Type: application/json' -X POST 'http://localhost:4200/_sql' -d '{"stmt":"GRANT DQL ON SCHEMA sys TO sql_exporter;"}'

We then create a configuration file in /etc/prometheus-sql-exporter.yml with a sample query that retrieves the number of shards per node:

jobs:
- name: "global"
  interval: '5m'
  connections: ['postgres://sql_exporter:insert_password@ubuntuvm1:5433?sslmode=disable']
  queries:
  - name: "shard_distribution"
    help: "Number of shards per node"
    labels: ["node_name"]
    values: ["shards"]
    query: |
      SELECT node['name'] AS node_name, COUNT(*) AS shards
      FROM sys.shards
      GROUP BY 1;
    allow_zero_rows: true

  - name: "heap_usage"
    help: "Used heap space per node"
    labels: ["node_name"]
    values: ["heap_used"]
    query: |
      SELECT name AS node_name, heap['used'] / heap['max']::DOUBLE AS heap_used
      FROM sys.nodes;

  - name: "global_translog"
    help: "Global translog statistics"
    values: ["translog_uncommitted_size"]
    query: |
      SELECT COALESCE(SUM(translog_stats['uncommitted_size']), 0) AS translog_uncommitted_size
      FROM sys.shards;

  - name: "checkpoints"
    help: "Maximum global/local checkpoint delta"
    values: ["max_checkpoint_delta"]
    query: |
      SELECT COALESCE(MAX(seq_no_stats['local_checkpoint'] - seq_no_stats['global_checkpoint']), 0) AS max_checkpoint_delta
      FROM sys.shards;

  - name: "shard_allocation_issues"
    help: "Shard allocation issues"
    labels: ["shard_type"]
    values: ["shards"]
    query: |
        SELECT IF(s.primary = TRUE, 'primary', 'replica') AS shard_type, COALESCE(shards, 0) AS shards
        FROM UNNEST([true, false]) s(primary)
        LEFT JOIN (
          SELECT primary, COUNT(*) AS shards
          FROM sys.allocations
          WHERE current_state <> 'STARTED'
          GROUP BY 1
        ) a ON s.primary = a.primary;

Please note: There exist two implementations of the SQL Exporter: burningalchemist/sql_exporter and justwatchcom/sql_exporter. They don't share the same configuration options.
Our example is based on the implementation that is shipped with the Ubuntu package, which is justwatchcom/sql_exporter.

To apply the new configuration, we restart the service:

systemctl restart prometheus-sql-exporter

The SQL Export can also be used to monitor any business metrics as well, but be careful with regularly running expensive queries. Below are two more advanced monitoring queries of CrateDB that may be useful:

/* Time since the last successful snapshot (backup) */
SELECT (NOW() - MAX(started)) / 60000 AS MinutesSinceLastSuccessfulSnapshot
FROM sys.snapshots
WHERE "state" = 'SUCCESS';

Prometheus setup

You would run this on a machine that is not part of the CrateDB cluster and it can be installed with:

apt install prometheus --no-install-recommends

Please note that by default this will right away become available on port 9090 without authentication requirements, you can use policy-rcd-declarative to prevent the service from starting immediately after installation and you can define a YAML web config file with basic_auth_users and then refer to that file in /etc/default/prometheus.

For a large deployment where you also use Prometheus to monitor other systems, you may also want to use a CrateDB cluster as the storage for all Prometheus metrics, you can read more about this at CrateDB Prometheus Adapter.

Now we will configure Prometheus to scrape metrics from the node explorer from the CrateDB machines and also metrics from our Crate JMX HTTP Exporter:

nano /etc/prometheus/prometheus.yml

Where it says:

- job_name: 'node'
  static_configs:
    - targets: ['localhost:9100']

We replace this with the below configuration, which reflects port 8080 (Crate JMX Exporter), port 9100 (Prometheus Node Exporter), port 9237 (Prometheus SQL Exporter), as well as port 9100 (Prometheus Node Exporter).

- job_name: 'node'
  static_configs:
    - targets: ['ubuntuvm1:9100', 'ubuntuvm2:9100']
- job_name: 'cratedb_jmx'
  static_configs:
    - targets: ['ubuntuvm1:8080', 'ubuntuvm2:8080']
- job_name: 'sql_exporter'
  static_configs:
    - targets: ['localhost:9237']

Restart the prometheus daemon if it was already started (systemctl restart prometheus).

Grafana setup

This can be installed on the same machine where you have Prometheus and can be installed with:

echo "deb https://packages.grafana.com/oss/deb stable main" | tee -a /etc/apt/sources.list.d/grafana.list
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
apt update
apt install grafana
systemctl start grafana-server

If you now point your browser to http://<Grafana host>:3000 you will be welcomed by the Grafana login screen, the first time you can log in with admin as both the username and password, make sure to change this password right away.

Click on "Add your first data source", then click on "Prometheus", and enter the URL http://<Prometheus host>:9090.

If you had configured basic authentication for Prometheus this is where you would need to enter the credentials.

Click "Save & test".

An example dashboard based on the discussed setup is available for easy importing on grafana.com. In your Grafana installation, on the left-hand side, hover over the “Dashboards” icon and select “Import”. Specify the ID 17174 and load the dashboard. On the next screen, finalize the setup by selecting your previously created Prometheus data sources.

Alternative implementations

If you decide to build your own dashboard or use an entirely different monitoring approach, we recommend still covering similar metrics as discussed in this article. The list below is a good starting point for troubleshooting most operational issues:

CrateDB metrics (with example Prometheus queries based on the Crate JMX HTTP Exporter)
- Thread pools rejected: sum(rate(crate_threadpools{property="rejected"}[5m])) by (name)
- Thread pool queue size: sum(crate_threadpools{property="queueSize"}) by (name)
- Thread pools active: sum(crate_threadpools{property="active"}) by (name)
- Queries per second: sum(rate(crate_query_total_count[5m])) by (query)
- Query error rate: sum(rate(crate_query_failed_count[5m])) by (query)
- Average Query Duration over the last 5 minutes: sum(rate(crate_query_sum_of_durations_millis[5m])) by (query) / sum(rate(crate_query_total_count[5m])) by (query)
- Circuit breaker memory in use: sum(crate_circuitbreakers{property="used"}) by (name)
- Number of shards: crate_node{name="shard_stats",property="total"}
- Garbage Collector rates: sum(rate(jvm_gc_collection_seconds_count[5m])) by (gc)
- Thread pool queue size: crate_threadpools{property="queueSize"}
- Thread pool rejected operations: crate_threadpools{property="rejected"}
Operating system metrics
- CPU utilization
- Memory usage
- Open file descriptors
- Disk usage
- Disk read/write operations and throughput
- Received and transmitted network traffic

Wrapping up

We got a Grafana dashboard that allows us to check live and historical data around performance and capacity metrics in our CrateDB cluster, this illustrates one possible setup. You could use different tools depending on your environment and preferences. Still, we recommend you use the interface of the Crate JMX HTTP Exporter to collect CrateDB-specific metrics and that you always also monitor the health of the environment at the OS level as we have done here with the Prometheus Node Exporter.

Advanced downsampling with the LTTB algorithm

marijaselakovic — Thu, 19 Jan 2023 12:24:43 +0000

Downsampling is the process of transforming data, reducing its resolution so that it requires less space, while preserving some of its basic characteristics so that it is still usable.

There are a number of reasons why we may want to do this.

We may for instance have sensors reporting readings on a high frequency, imagine every 5 seconds, and we may have accumulated historical data over a period of time. This may start to consume lots of disk space.

Looking at data from last year there may never be a need to go down to the 5 seconds level of precision, but we may want to keep data at a lower level of granularity for future reference.

There is another case where we may want to downsample data. Let’s consider again the same scenario with a reading every 5 seconds. A week of data would require 120,960 data points, but we may be presenting this information to users as a graph which may be only 800 pixels wide on the screen. It should be possible then to transfer just 800 points, saving network bandwidth and CPU cycles on the client side. The trick resides in getting the right points.

There are many ways to tackle this problem, sometimes we can just do something like keep a reading per day and drop the rest (think stock prices at closing time), and sometimes we can get good results with averages, but there are cases where these simple approaches result on a distorted view of the data.

Sveinn Steinarsson at the University of Iceland looked at this problem and came out with an algorithm called Largest-Triangle-Three-Buckets, or LTTB for short, and this is published under the MIT license in sveinn-steinarsson/flot-downsample: Downsample plugin for Flot charts. (github.com).

Let’s take a look at how we can use this with CrateDB and the kind of results we can get.

Getting some sample data

I will be using Python and Jupyter Notebook for this demonstration, but you can use any language of your choice.

I will start by installing a few dependencies:

pip install seaborn
pip install crate[sqlalchemy]

I will now import the demo data from Sveinn Steinarsson’s GitHub repo into a table called demo in my local CrateDB instance:

import urllib.request,json,pandas as pd
txt=urllib.request.urlopen("https://raw.githubusercontent.com/sveinn-steinarsson/flot-downsample/8bb3db00f035dfab897e29d01fd7ae1d9dc999b2/demo_data.js").read()
txt=txt[15:(len(txt)-1)]
txt=txt.decode("utf-8")
data=json.loads(txt)
df=pd.DataFrame(data[1])
df.columns=['n','reading']
df.to_sql('demo', 'crate://localhost:4200', if_exists='append', index=False)

And taking advantage I already have this data set of 5000 points in a data frame I will plot it using seaborn to see what it looks like:

import seaborn
seaborn.lineplot(df,x="n",y="reading")

So this is the actual profile of this noisy-looking signal.

A basic approach using averages

Let’s try downsampling this down 50 times to 100 data points with a simple query in CrateDB:

avg_query = """

    SELECT (n/50)*50 AS bucket,avg(reading) AS reading
    FROM demo 
    GROUP BY bucket

"""
dfavg_data = pd.read_sql(avg_query, 'crate://localhost:4200')
seaborn.lineplot(dfavg_data,x="bucket",y="reading")`

This is not wrong if we know we are looking at an average, but if we are doing this downsampling behind the scenes to save space, bandwidth, or CPU cycles we risk now giving a very wrong impression to the users, notice for instance that the range on the y-axis only goes as far as 1.5 instead of 10+.

Enter LTTB

The reference implementation of the LTTB algorithm is written in JavaScript and this is very handy because CrateDB supports JavaScript user-defined functions out of the box.

Take a look at the linked script, compared to the original JavaScript we have just made some very minor adjustments so that the variables in input and output are in a format that is easy to work with in SQL in CrateDB. This is using some very useful CrateDB features that are not available in other database systems, arrays and objects, this particular version requires CrateDB 5.1.1 but it is perfectly possible to make this work on earlier versions with some minor changes.

In this case we are working with an x-axis with integers, but this approach works equally well using timestamps, just changing the script where it says ARRAY(INTEGER) to ARRAY(TIMESTAMP WITH TIME ZONE).

Another cool feature of CrateDB we can use here is the overloading of functions, we can deploy multiple versions of the same function, one accepting an xarray which is an array of integers and another one which is an array of timestamps and both will be available to use on SQL queries.

We can check the different versions deployed by querying the routines tables:

SELECT routine_schema, specific_name
FROM information_schema.routines 
WHERE routine_name='lttb_with_parallel_arrays';

Now that we have our UDF ready let’s give it a try:

lttb_query = """

    WITH downsampleddata AS 
     (  SELECT lttb_with_parallel_arrays(   
            array(SELECT n FROM demo ORDER BY n),                           
            array(SELECT reading FROM demo ORDER BY n)
            ,100) AS lttb)
    SELECT unnest(lttb['0']) as n,unnest(lttb['1']) AS reading
    FROM downsampleddata;

"""
dflttb_data = pd.read_sql(lttb_query, 'crate://localhost:4200')
seaborn.lineplot(dflttb_data,x="n",y="reading")

As you can see this dataset of just 100 points resembles much better the original profile of the signal, we are looking at a plot where the range is now -10 to 10 same as in the original graph, the noisy character of the curve has been preserved and we can see the spikes all along the curve.

We hope you find this interesting, do not hesitate to let us know your thoughts.

Fetching large results sets from CrateDB

marijaselakovic — Thu, 19 Jan 2023 09:49:26 +0000

As a distributed database system with support for the standard SQL query language, CrateDB is great to run aggregations server-side, working on huge datasets, and getting summarized results back; there are however cases where we may still want to retrieve lots of data from CrateDB, to train a machine learning model for instance.

CrateDB collects results in memory before sending them back to the clients, so trying to run a SELECT statement that returns a very large result set in one go can trigger circuit breakers or result in an OutOfMemoryError, and getting all results in a single operation can also be a challenge client-side, so we need a mechanism to fetch results in manageable batches as we are ready to process them.

One option, for cases where we are looking at a single table, and we know already that we need all the records that satisfy a condition, is to do a bulk export with the COPY TO command which accepts a WHERE clause. It happens, however, that in many cases we may want to run more complex queries, or simply storing the results in files may not fit well into our application.

The case for pagination

A common requirement is also what is called pagination, which is to present results to users with pages with a set number of results each, allowing them to move between these pages. In this case, it is common that many users will only look at the first few pages of results, so we want to implement that in the most efficient way possible.

Let’s imagine we have a table called “observations” with the following data:

ts	device	reading
2021-10-14T09:39:19	dev2	-1682
2022-02-02T00:33:47	dev1	827
2022-06-11T21:49:53	dev2	-713
2022-06-29T23:23:28	dev1	1059
2022-07-01T09:22:56	dev2	-689
2022-07-10T02:43:36	dev2	-570
2022-09-18T03:28:02	dev1	303
2022-10-14T20:34:10	dev1	1901

We will work with very small number of records here to visualize how different techniques work but imagine that we have thousands or even millions of records. In particular, I will show examples here of retrieving 2 rows at a time but depending on the use case you would probably be looking at retrieving 50, 1000, or even 5000 rows at a time.

Using `LIMIT` + `OFFSET`

SELECT date_format(ts),device,reading 
FROM doc.observations 
WHERE ts BETWEEN '2022-01-01 00:00' AND '2022-10-01 00:00'
ORDER by ts
LIMIT 2;

This returns:

+-----------------------------+--------+---------+
| date_format(ts)             | device | reading |
+-----------------------------+--------+---------+
| 2022-02-02T00:33:47.000000Z | dev1   |     827 |
| 2022-06-11T21:49:53.000000Z | dev2   |    -713 |
+-----------------------------+--------+---------+

We could then re-issue the query with LIMIT 2 OFFSET 2 and we would get:

+-----------------------------+--------+---------+
| date_format(ts)             | device | reading |
+-----------------------------+--------+---------+
| 2022-06-29T23:23:28.000000Z | dev1   |    1059 |
| 2022-07-01T09:22:56.000000Z | dev2   |    -689 |
+-----------------------------+--------+---------+

There are a number of issues to be aware of with this approach.

Each new query is considered a new request and looks at current data. Consider what happens if the observation for 11 June 2022 above were to be deleted after we run the first query, but before we run the second one with OFFSET 2. Skipping 2 rows we are now skipping the observation from 29 June 2022, and the users will never see it.

Another issue is that there is not always an efficient way for CrateDB to skip the rows so, for certain queries, as the OFFSET value goes up, we may find that execution times grow larger and larger as the engine is actually going through the rows that need to be skipped and just discarding them server-side.

Using `LIMIT` with a `WHERE` clause on a watermark field

Continuing from the example above, after we get the initial 2 rows, instead of using OFFSET 2 we could run a query like this:

SELECT date_format(ts),device,reading 
FROM doc.observations 
WHERE ts > '2022-06-11T21:49:53.000000Z' AND ts <='2022-10-01 00:00'
ORDER by ts
LIMIT 2;

That 11 June value is the last value we observed so far on the ts column that in this case, we know to be always increasing, this approach is very efficient, but it can only be used if there is a suitable field in the data which is not always the case.

Also compared to the LIMIT + OFFSET approach we discussed earlier, we cannot use this to let the users jump to a given page of results without first having obtained all the results for the previous pages, we cannot for instance jump directly to page 10 as we do not know what is the last reading of ts at page 9.

Some people call this approach above “cursor pagination”, but the most common concept behind “cursors” is something a bit different which we are going to discuss now.

Cursors

A cursor is like having a bookmark pointing to a specific record in the result set of a query, this is a generic approach that is implemented efficiently and does not require us to have a special anchor/watermark column.

In CrateDB we can use cursors at the protocol level or with SQL commands.

Cursors in CrateDB are INSENSITIVE, meaning that the client can take all the time it needs to retrieve the results, and the data will always reflect the status of the tables as it was at the time the cursor was declared, ignoring any records that were updated, deleted, or newly inserted.

Using cursors in Python

In Python one way to work with cursors is with asyncpg, taking advantage of CrateDB's compatibility with the PostgreSQL wire protocol.
First, we need to install the library:

pip install asyncpg

Then we can use it like this:

import asyncio
import asyncpg

# If you are using jupyter-notebook 
# remove this function definition line and the indentation in the block of code that follows
async def main():
    # we will then establish a connection    
    conn = await asyncpg.connect(host='localhost', user='crate')

    # and we need a “transaction” context, 
    # there are no transactions as such in CrateDB, 
    # but this gives a scope where the cursor lives:  
    async with conn.transaction():

        # and now we can declare the cursor 
        # specifying how many rows we want asyncpg to fetch at a time from CrateDB, 
        # and we can iterate over the results: 
        query = "SELECT ts,device,reading FROM doc.observations WHERE ts BETWEEN '2022-01-01 00:00' AND '2022-10-01 00:00';"
        async for record in conn.cursor(query, prefetch=1000):
            print(record)

# Remove this line if you are using jupyter-notebook
asyncio.run(main())

Just to clarify, our Python code works with one record at a time, but behind the scenes asyncpg is requesting 1000 records at a time from CrateDB.

Using cursors in Java

In Java, we can use the PostgreSQL JDBC driver.
In a Maven project add this to your pom.xml:

<dependencies>
    <dependency>
        <groupId>org.postgresql</groupId>
        <artifactId>postgresql</artifactId>
        <version>42.5.0</version>
    </dependency>
</dependencies>

Then we can use it like this:

import java.sql.*;
/* ... */   
/* we first establish a connection to CrateDB */
try (Connection conn = DriverManager.getConnection("jdbc:postgresql://localhost/", "crate","")) {
    try (Statement st = conn.createStatement()) {
        /* We will then open the cursor 
         * defining how many rows we want to retrieve at a time, 
         * in this case 1,000: 
         */
        st.setFetchSize(1000);
        String query = "SELECT ts,device,reading ";
        query += "FROM doc.observations ";
        query += "WHERE ts BETWEEN '2022-01-01 00:00' AND '2022-10-01 00:00';"; 
        try (ResultSet resultSet = st.executeQuery(query)) {
            /* and while there are rows available, we will iterate over them: */
            while (resultSet.next()) {                      
                System.out.println(resultSet.getDate("ts").toString());                     
            }
        }
    }
}

This works like in the Python case above, in our Java code we see one row at a time, but the rows are retrieved from CrateDB 1,000 at a time and kept in memory on the client side.

Using cursors with SQL commands

An approach that works with all clients is to use SQL commands supported since CrateDB 5.1.0.

First, we need to issue this command:

BEGIN;

This is a SQL command that would normally start a transaction, there are no transactions as such in CrateDB, but this will create a scope on which cursors can be created.

DECLARE observations_cursor NO SCROLL CURSOR FOR 
    SELECT ts,device,reading 
    FROM doc.observations 
    WHERE ts BETWEEN '2022-01-01 00:00' AND '2022-10-01 00:00';

This associates a cursor name with a query and determines the point in time at which data is “frozen” from the point of view of the cursor.

FETCH 10 FROM observations_cursor;

This retrieves 10 rows from the query, and when issued again it will retrieve the next 10 rows and so on. We can retrieve a different number of records each time and we know we have reached the end of the result set when FETCH returns zero rows.

Once the cursor is not needed anymore it can be closed with either END;, CLOSE ALL;, CLOSE observations_cursor;, COMMIT;, COMMIT TRANSACTION;, or COMMIT WORK;.

Take a look at this short animation showing an example of how this works: Mathias Fußenegger on Twitter

We hope you find this useful, and we will be happy to hear about your experience in the Community.

Scaling CrateDB clusters up and down to cope with peaks in demand

marijaselakovic — Thu, 19 Jan 2023 09:46:44 +0000

Many organizations size their database infrastructure to handle the maximum level of load they can anticipate, but very often load is seasonal, in some cases around specific events on certain days of the year. Compromises are often made where infrastructure sits idle most of the year and performance is not as good as desired when requests peak.

CrateDB allows clusters to be scaled both up and down by adding and removing nodes, this allows significant savings during quiet periods, and also the provisioning of extra capacity during particular periods of high activity to have optimal performance.

When nodes are added or removed CrateDB automatically rebalances shards, but in cases where we have very large volumes of historical data and new nodes are only added for a short period of time, we may want to avoid any of the historical data being relocated to the temporary nodes. The approach described below explains how to achieve this using shard allocation filtering on a table that is partitioned by day, the idea is that the period of high activity can be foreseen, so the scaling up will take place the day before a big event, and the scaling down someday after the event has ended.

This same approach can be applied to multiple tables. It is particularly relevant for the larger tables, and smaller tables can be kept on the baseline nodes, but it is always good to consider the impact on querying performance if the small tables will be queried during the big event JOINed to the big tables that will have data on the temporary nodes.

Preparing the test environment

In this example, we will imagine that the surge in demand we are preparing for is related to the 2022 FIFA Men’s World Cup running from 20/11/2022 to 18/12/2022.

We will start with a 3 nodes cluster, on which we will create a test table and populate it with some pre-World Cup data:

CREATE TABLE test (
  ts TIMESTAMP,
  recorddetails TEXT, 
  "day" GENERATED ALWAYS AS date_trunc('day',ts) 
  )
PARTITIONED BY ("day") 
CLUSTERED  INTO 4 SHARDS 
WITH (number_of_replicas=1);

INSERT INTO test (ts) VALUES ('2022-11-18'),('2022-11-19');

The shards will initially look like this:

Before deploying extra nodes

We want to make sure that the addition of the temporary nodes does not result on data from the large tables getting rebalanced to use these nodes.

We will be able to identify the new nodes by using a custom attribute (node.attr.storage=temporarynodes) (see further down for details on how to configure this), so the first step is to configure the existing partitions so that they do not consider the new nodes as suitable targets for shard allocation.

In CrateDB 5.1.2 or higher we can achieve this with:

/* this applies the setting to all existing partitions and new partitions */
ALTER TABLE test SET ("routing.allocation.exclude.storage" = 'temporarynodes');

/* then we run this other command so that the setting does not apply to new partitions */
ALTER TABLE ONLY test RESET ("routing.allocation.exclude.storage");

No data gets reallocated when running this, and there is no impact on querying or ingestion.

Starting in CrateDB 5.2 this setting is visible in settings['routing'] in information_schema.table_partitions.

Deploying the extra nodes

We want to deploy the new nodes setting a custom attribute.
If using containers add a line to args in your YAML file with:

- -Cnode.attr.storage=temporarynodes

Otherwise add this to crate.yml (typically on /etc/crate)

node.attr.storage=temporarynodes

Please note the word storage in this context does not have any special meaning for CrateDB, it is just a name that we have chosen in this case for the custom attribute.

Starting with CrateDB 5.2 these node attributes will be visible in sys.nodes.

We need to calculate how many shards will be created each day (that is on each partition) during the special event, since our test table is CLUSTERED INTO 4 SHARDS WITH (number_of_replicas=1) we would have 4 (shards per partition) x 2 (primary + copy) = 8

We then need to see what is the ceiling of the number we got (8) divided by the total number of nodes (baseline + temporary), if that is for instance 3+2=5 then we have ceiling(8/5)=2.

That means that if a maximum of 2 of the new shards created each day during the event goes to each node then the new data will be balanced across all nodes.

With the nodes ready, on the day before the event, we need to configure the special tables so that any new partitions follow this rule:

ALTER TABLE ONLY test SET ("routing.allocation.total_shards_per_node" = 2);

This setting (total_shards_per_node) is visible at partition level in settings['routing'] in information_schema.table_partitions.

No data gets reallocated when running this and there is no impact on querying or ingestion.

This setting can be checked by using:

SHOW CREATE TABLE test;

During the event

Let’s now simulate the arrival of data during the event:

INSERT INTO test (ts) VALUES 
('2022-11-20'),('2022-11-21'),('2022-11-22'),('2022-11-23'),
('2022-11-24'),('2022-11-25'),('2022-11-26'),('2022-11-27'),
('2022-11-28'),('2022-11-29'),('2022-11-30'),('2022-12-01'),
('2022-12-02'),('2022-12-03'),('2022-12-04'),('2022-12-05'),
('2022-12-06'),('2022-12-07'),('2022-12-08'),('2022-12-09'),
('2022-12-10'),('2022-12-11'),('2022-12-12'),('2022-12-13'),
('2022-12-14'),('2022-12-15'),('2022-12-16'),('2022-12-17'),
('2022-12-18')

We can see that data from before the event stays on the baseline nodes while data for the days of the event gets distributed over all nodes:

The same can be checked programmatically with this query:

SELECT   table_partitions.table_schema,
         table_partitions.table_name,
         table_partitions.values['day']::TIMESTAMP,
         shards.primary,
         shards.node['name']
FROM sys.shards
JOIN information_schema.table_partitions ON shards.partition_ident=table_partitions.partition_ident 
ORDER BY 1,2,3,4,5;

The day the event ends

On the last day of the event, we need to configure the table so that the next partition goes to the baseline nodes:

ALTER TABLE ONLY test SET ("routing.allocation.exclude.storage" = 'temporarynodes');

ALTER TABLE ONLY test RESET ("routing.allocation.total_shards_per_node");

A day after the event has ended

New data should now again go the baseline nodes only.

Let’s confirm it:

INSERT INTO test (ts) VALUES ('2022-12-19'),('2022-12-20')

When we are ready to decommission the temporary nodes, we need to move the data collected during the days of the event.

In CrateDB 5.1.2 or higher we can achieve this with:

ALTER TABLE test SET ("routing.allocation.exclude.storage" = 'temporarynodes');
ALTER TABLE test RESET ("routing.allocation.total_shards_per_node");

The data movement takes place one replica at a time and there is no impact on querying of the event’s data while it is being moved, new data also continues to flow to the baseline nodes and its ingestion and querying are also not impacted.

We can monitor the progress of the data relocation querying sys.shards and sys.allocations.

Once all shards have moved away from the temporary nodes, we can decommission them gracefully:

ALTER CLUSTER DECOMMISSION 'nodename';

Once this is done, the machines can safely be shutdown.

When the time comes for the next event

If desired, new nodes can be deployed reusing the same names that were used for the temporary nodes before.

Ingesting with CrateDB

marijaselakovic — Wed, 18 Jan 2023 13:58:08 +0000

In this post, we would like to explore CrateDB’s writing performance through a series of batched insert benchmarks.

We don’t discuss performance very often because we see a couple of issues in the benchmarking space:

Overall lack of transparency
Reproducibility issues
Use case issues

To eliminate all of these issues, we would like to start a series where we transparently explore our own performance limits and disclose all the used tools and infrastructure.

Let's start with writing performance. It is one of CrateDB's biggest strengths when utilized correctly. We will demonstrate this through a series of batched insert benchmarks performed on different tiers of CrateDB Cloud clusters.

We applied many learnings from our previous benchmarking adventure: Scaling ingestion to one million rows per second. If you haven't read it yet, we encourage you to do so!

Content

What we want to achieve
Infrastructure
CrateDB Clusters
Benchmarking
Results
Conclusion

What we want to achieve

We like to be transparent with our customers and users about the performance they can expect from our product. We want to introduce a reliable and verifiable way of measuring performance with a workload that suits CrateDB. We also want to make clear that this is only one use case and one type of workload.

If you want to see how we can handle a different use case or workload, don’t hesitate to contact and challenge us!

Additional steps we took to get as close as possible to the real-world use case:

The data generated by the benchmarking tool must be representative, including numerical and textual columns in a non-trivial structure.
All the components of the infrastructure we used are disclosed and easily replicable.
We measured the performance with an out-of-the-box configuration of CrateDB Cloud clusters.

Infrastructure

For the ingesting tool/benchmark VM, we decided to go with the c5.12xlarge EC2 instance. We don’t need all the 48 CPUs this instance offers, but it’s the most affordable instance with 12 Gigabit networking, which can be necessary when ingesting into higher-tier CrateDB Cloud clusters.

As for the CrateDB Cloud clusters, this benchmark was performed on AWS deployment. We also offer clusters running on Azure. The performance is comparable in both environments. These are the instances that we currently use:

AWS - m5.4xlarge
Azure - D16s v5

CrateDB Clusters

For the CrateDB Clusters we used our Cloud platform. It offers fully-managed clusters that can be deployed in a matter of seconds. It comes with 4 tiers of clusters, depending on the workload that you’re expecting:

These are per-node resources. When deploying multi-node clusters, simply multiply these by the number of nodes. Storage is a big part of fast writing. Offered storage depends on the tier of the cluster, but in general, we offer storage of up to 8 TiB of enterprise-level SSD per node.

If you want to follow this step of the setup, deploy your CrateDB Cloud cluster here.

Benchmarking

We used the nodeIngestBench for all the benchmarking. It is a multi-process Node.js script that runs high-performance ingest benchmarks on CrateDB. It uses a data model that was adapted from Timescale’s Time Series Benchmark Suite (TSBS). One thing that we want to make clear is that nodeIngestBench is a write benchmark. The data structure that it creates is unsuitable for any performance-indicative reading tests because of its high cardinality (due to random data) and no partitioning.

CREATE TABLE IF NOT EXISTS doc.cpu ( 
 "tags" OBJECT(DYNAMIC) AS ( 
   "arch" TEXT, 
   "datacenter" TEXT, 
   "hostname" TEXT, 
   "os" TEXT, 
   "rack" TEXT, 
   "region" TEXT, 
   "service" TEXT, 
   "service_environment" TEXT, 
   "service_version" TEXT, 
   "team" TEXT 
 ), 
 "ts" TIMESTAMP WITH TIME ZONE, 
 "usage_user" INTEGER, 
 "usage_system" INTEGER, 
 "usage_idle" INTEGER, 
 "usage_nice" INTEGER, 
 "usage_iowait" INTEGER, 
 "usage_irq" INTEGER, 
 "usage_softirq" INTEGER, 
 "usage_steal" INTEGER, 
 "usage_guest" INTEGER, 
 "usage_guest_nice" INTEGER 
) CLUSTERED INTO <number of shards> SHARDS 
WITH (number_of_replicas = <number of replicas>);

It creates a single doc.cpu table where each row is made of 10 randomly generated numerical metrics. The rows get ingested in batched inserts. As mentioned previously, default settings are used, which also means that all columns are indexed automatically. One notable setting is shard replication. It comes at a slight writing performance expense, but we encourage the usage of replication in every use case, and practically all of our customers do use it.

Example of a single benchmark run on a 3-node CR4 cluster:

node appCluster.js \ 
 --batchsize 10000 \ 
 --max_rows 200000000 \ 
 --shards 42 \ 
 --concurrent_requests 10 \ 
 --processes 3 \ 
 --replicas 1

batchsize: The number of rows that are passed in a single INSERT statement. The higher the number, the fewer inserts you need to perform to ingest all of the generated data. At the same time, there comes the point of diminishing returns. We found the best value at around 10 000 rows per INSERT.
max_rows: The maximum number of rows that will be generated. This is the parameter that allows to control total runtime of the benchmark.
shards: The number of shards the table will be split into. Each shard can be written independently, so we aim for a number that allows for enough concurrency. As a rule of thumb, we create one shard per CPU.
concurrent_requests: The number of INSERT statements that each child process will run concurrently as asynchronous operations. For single-node and 3-node clusters, we found the value of ten concurrent requests to be ideal
processes: The main node process will start this number of child processes (workers) that generate INSERT statements in parallel. Generally, we found the best performance with one process per CrateDB node.
replicas: Replicas are pretty much pointless on single-node clusters. On multi-node clusters, however, they provide a great way to avoid data loss in case of node failure. For 3-node cluster we created one replica.

As for the single-node benchmarks, this is an example configuration for a CR4 single-node cluster:

node appCluster.js \ 
 --batchsize 10000 \ 
 --max_rows 20000000 \ 
 --shards 14 \ 
 --concurrent_requests 10 \ 
 --processes 1

Results

We ran the benchmark on both single-node and 3-node clusters to see how they compare.

The first thing we noticed is that lower-tier 3-node clusters (CR1 and CR2) offer a smaller improvement over their single-node counterpart than CR3 and CR4. This is because of the increased intra-cluster network traffic and overhead created by enabled replicas.

Here we have a comparison of the performance gained by scaling from one node to three nodes. You can see that performance gain varies from around 30% in smaller clusters to around 60% in higher clusters. However, as mentioned before, performance isn’t the only thing you gain by scaling your cluster to multi-node. Data loss prevention and high availability are big topics for many customers, and multi-node is the only way to get that.

That being said, we also want to enable our customers to do whatever they want with their clusters. If you’re purely after performance and are willing to sacrifice some protective measures because of it, one step you can take is disabling replicas in multi-node clusters. We don’t encourage users to do this, and haven’t encountered a production grade setup where this was necessary, but you’re free to do so.

Disabling replicas improved the ingest performance by 50% on average. Keep in mind that without replicas there is a possibility of data loss upon a non-recoverable node failure. As the data is split evenly among the nodes, the data loss would be around 1/X, X being the number of nodes in a cluster.

Vertical scaling can, of course, also be useful. If you’re already scaled to multi-node and need a bit more performance, or you’re testing your application and are not interested in multi-node benefits, increasing the tier of your cluster can also be helpful. It should, however, be secondary to horizontal scaling. We advise our customers to scale to a higher node count as long as possible.

Vertical scaling from CR1 offers the biggest improvement, as it's a very affordable tier suited for small applications and compatibility testing. In higher tiers, the performance gained by vertical scaling drops. The average improvement provided by vertical scaling is around 90%.

Conclusion

In this post, we hoped to provide users with a relatively simple introduction to benchmarking their CrateDB clusters. In our scenario, we showed what performance can be expected when using the CrateDB Cloud clusters, but this same process can be replicated on on-premise clusters using a diverse range of hardware.

These benchmarks provide only a single-use case using batched inserts, and we hope to follow this post with others in the future. The next logical step would be looking into read performance or maybe the write performance of the before mentioned on-premise hardware.

If you have a specific scenario or use a case in mind that you would like to see, don’t hesitate to reach out.

Correlated sub-queries in CrateDB

marijaselakovic — Wed, 18 Jan 2023 12:29:16 +0000

Correlated sub-queries are a new feature that CrateDB 5.1 supports. This post will introduce you to the concept of a correlated sub-query, its usage, and how it's currently implemented in CrateDB!

Our main motivation for this implementation was to increase the compatibility to support tools using correlated sub-queries. Grafana, for example, uses correlated sub-queries for the auto-completion of tables and columns.

Introduction: Sub-query and Correlated sub-query

Before we start with correlated sub-queries, we'll first have a look at how a sub-query works.

What is a sub-query?

A sub-query also called an inner query or inner select, is a query embedded within another query. Let’s have a look at the following example:

Suppose we have a table with customers and a table that lists CrateDB clusters owned by those customers:

SELECT * FROM doc.customers ORDER BY id;

+----+-------+---------+
| id | name  | country |
+----+-------+---------+
|  1 | Anton | Austria |
|  2 | Maria | Germany |
|  3 | Anna  | Italy   |
+----+-------+---------+

SELECT * FROM doc.clusters ORDER BY id;

+----+-------------+-----------------+-------------+
| id | customer_id | number_of_nodes | name        |
+----+-------------+-----------------+-------------+
|  1 |           1 |              10 | cluster 1   |
|  2 |           2 |              15 | cluster 2   |
|  3 |           2 |               8 | cluster 3   |
|  4 |           3 |              12 | cluster 4   |
+----+-------------+-----------------+-------------+

We want to find all the clusters owned by customers from Austria. We can first start by finding all customers from Austria:

SELECT id, name FROM customers WHERE country = 'Austria';

+----+-------+
| id | name  |
+----+-------+
|  1 | Anton |
+----+-------+

Now for each row in clusters, we want to look up if the cluster’s customer_id matches a customer where the country is Austria. Therefore we can include the first query as a sub-query into the where-clause of another query, a so-called outer-query:

SELECT clusters.id,
       clusters.name
FROM   clusters
WHERE  clusters.customer_id IN (SELECT customers.id
                                FROM   customers
                                WHERE  customers.country = 'Austria');

+----+-------------------+
| id | name              |
+----+-------------------+
|  1 | cluster 1         |
+----+-------------------+

We can learn how the query is executed using the EXPLAIN command. With this command, we will retrieve the execution plan of the query:

EXPLAIN
SELECT clusters.id,
       clusters.name
FROM   clusters
WHERE  clusters.customer_id IN (SELECT customers.id
                                FROM   customers
                                WHERE  customers.country = 'Austria');

+------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                        |
+------------------------------------------------------------------------------------------------+
| MultiPhase                                                                                     |
|   └ Collect[doc.clusters | [id, name] | (customer_id = ANY((SELECT id FROM (doc.customers))))] |
|   └ OrderBy[id ASC]                                                                            |
|     └ Collect[doc.customers | [id] | (country = 'Austria')]                                    |
+------------------------------------------------------------------------------------------------+

Let’s dive into the details of the execution plan to understand what exactly is happening here. The execution plan is a tree of operators which gives us insights into how this query is handled internally. Each operator represents an operation the database will execute. We will be looking into each operation from the bottom to the top:

The operator Collect[doc.customers] collects from the table doc.customers the value id where the query expression (country = 'Austria') matches on its rows. It represents the sub-query.
The operator OrderBy[id ASC] sorts the value id ascending. This operator is added by the query optimizer to return the values sorted, because it make the outer-query’s filter faster.
The operator Collect[doc.clusters] collects the values id, name from the table doc.clusters where the customer_id matches the result from the sub-query select id from doc.customers. It represents the outer-query.
The operator MultiPhase combines the operators OrderBy[id ASC] and Collect[doc.clusters], executes them in order, and injects the result of the first operator into the second operator.

The order of execution is the following:

Collect all ids from doc.customers where the country is Austria ordered by id ascending.
Collect the values id and name from doc.cluster where the customer_id matches the ids collected from the first operation.
Return these values back to the client.

Since the first operation, which represents the inner query, does not depend on any outer context, it can be executed once, standalone and the result can be injected into the outer-query.

So what is then a correlated sub-query?

Imagine now the following use-case: We would like to know the total number of nodes each customer has in its clusters. Let’s try the same approach as before and break this down into multiple steps: We can first create a query retrieving the sum of all nodes for a single customer_id:

SELECT SUM(number_of_nodes) FROM clusters WHERE clusters.customer_id = 2;

+----------------------+
| sum(number_of_nodes) |
+----------------------+
|                   23 |
+----------------------+

And then we embed this query again as a sub-query into an outer-query to generalize it for all customers:

SELECT customers.id,
       customers.name,
       (SELECT Sum(number_of_nodes)
        FROM   clusters
        WHERE  clusters.customer_id = customers.id) AS number_of_nodes
FROM   customers;

+----+-------+-----------------+
| id | name  | number_of_nodes |
+----+-------+-----------------+
|  1 | Anton |              10 |
|  3 | Anna  |              12 |
|  2 | Maria |              23 |
+----+-------+-----------------+

Please note that now the sub-query uses a reference customers.id which correlates to the outer- query. That is why it's called a correlated sub-query. The result of the sub-query will be different for each value of customers.id from the outer-query. Therefore, the inner-query needs to be executed for each value again. Consequently, the execution of correlated sub-queries in a database is not a straightforward task.

Let’s now check the execution plan of the query using EXPLAIN command again:

EXPLAIN
SELECT customers.id,
       customers.name,
       (SELECT Sum(number_of_nodes)
        FROM   clusters
        WHERE  clusters.customer_id = customers.id) AS number_of_nodes
FROM   customers;

+-----------------------------------------------------------------------------------------+
| EXPLAIN                                                                                 |
+-----------------------------------------------------------------------------------------+
| Eval[id, name, (SELECT sum(number_of_nodes) FROM (doc.clusters)) AS number_of_nodes]    |
|   └ CorrelatedJoin[id, name, (SELECT sum(number_of_nodes) FROM (doc.clusters))]         |
|     └ Collect[doc.customers | [id, name] | true]                                        |
|     └ SubPlan                                                                           |
|       └ Limit[2::bigint;0::bigint]                                                      |
|         └ HashAggregate[sum(number_of_nodes)]                                           |
|           └ Collect[doc.clusters | [number_of_nodes, customer_id] | (customer_id = id)] |
+-----------------------------------------------------------------------------------------+

Let’s look into the execution plan:

The operator Collect[doc.clusters] collects from the table doc.clusters the values [number_of_nodes, customer_id] from the rows where customer_idmatches a given id. It represents the data collection part of the sub-query.
The operator HashAggregate[sum(number_of_nodes)] aggregates the values of number of nodes. It is dependent on the previous operation. It represents the sum function of the sub-query.
The operator Limit[2::bigint;0::bigint] is used as an assertion. A sub-query must return at most 1 result row, so we limit the query to 2, if there is no limit to avoid retrieval.
The operatorSubPlan marks the formerly described operator tree as a correlated sub-query.
The operator Collect[doc.customers] collects the values [id, name] from the table doc.customers. It represents the outer query.
The operatorCorrelatedJoin performs the join between the outer-query results and the sub-query results. First, the outer query will be executed and then for each value of the result of the outer-query the sub-query will be executed.
The operatorEval[id, name] combines the results from the underlying CorrelatedJoin and returns the values [id, name,number_of_nodes] back to the client.

The order of execution is the following:

Collect all the values [id, name] from doc.customers.
For each of the id’s from the first operation, collect the values [number_of_nodes, customer_id] where the customer.id matches the id values and aggregate number_of_nodes.
Return the values id, name and the result of the aggregation back to the client.

The inner-query is now depending on the outer context. Therefore, it has to be executed again for every single row of the outer-query. This is the state of the current implementation in CrateDB. It works, but the execution is not optimal.

Can we do better?

The same query can also be expressed as a join. Relational databases are good at running joins because there is plenty of room for optimizations. The question is: How to convert a correlated-sub-query to a join?

Let’s take our correlated sub-query example from the beginning again:

SELECT customers.id,
       customers.name,
       (SELECT Sum(number_of_nodes)
        FROM   clusters
        WHERE  clusters.customer_id = customers.id) AS number_of_nodes
FROM   customers;

As mentioned above, the correlated sub-query has a reference customers.id to the outer-query. Hence the query is executed each time for every row of the outer-query. Let’s have a closer look at the sub-query:

SELECT Sum(number_of_nodes)
  FROM   clusters
  WHERE  clusters.customer_id = customers.id

The sub-query selects data from the table doc.clusters. So instead of utilizing the sub-query in the outer query, we can replace it with a join to the table doc.clusters. The where-clause of the sub-query, which includes the reference customers.id to the outer-query, becomes the join condition. We replace the sub-query in the outer-query with the aggregation function Sum(number_of_nodes) from the sub-query and we finally have to add a group-by for the aggregation:

SELECT customers.id,
       customers.name,
       Sum(clusters.number_of_nodes) AS number_of_nodes
FROM   clusters
       JOIN customers
         ON clusters.customer_id = customers.id
GROUP  BY customers.id,
          customers.name;

Voilà, we converted the correlated sub-query into a join. Let’s run this query again to see if it yields the same result.

+----+-------+-----------------+
| id | name  | number_of_nodes |
+----+-------+-----------------+
|  3 | Anna  |              12 |
|  1 | Anton |              10 |
|  2 | Maria |              23 |
+----+-------+-----------------+

This looks good. Let’s now check the logical plan using the EXPLAIN command:

EXPLAIN
SELECT customers.id,
       customers.name,
       Sum(clusters.number_of_nodes) AS number_of_nodes
FROM   clusters
       JOIN customers
         ON clusters.customer_id = customers.id
GROUP  BY customers.id,
          customers.NAME;

+-----------------------------------------------------------------------+
| EXPLAIN                                                               |
+-----------------------------------------------------------------------+
| Eval[id, name, sum(number_of_nodes) AS number_of_nodes]               |
|   └ GroupHashAggregate[id, name | sum(number_of_nodes)]               |
|     └ HashJoin[(customer_id = id)]                                    |
|       ├ Collect[doc.clusters | [number_of_nodes, customer_id] | true] |
|       └ Collect[doc.customers | [id, name] | true]                    |
+-----------------------------------------------------------------------+

Let’s look into the execution plan:

The operator Collect[doc.customers] collects from the table doc.customers the values [id, name].
The operator Collect[doc.clusters] collects the values [number_of_nodes, customer_id] from the table doc.clusters.
The operator HashJoin[(customer_id = id)] performs a hash-join on the output from the previous two operators with the join condition customer_id = id.
The operator GroupHashAggregate[id, name | sum(number_of_nodes)] aggregates the values of number_of_nodes for each [id, name] pair.
The operatorEval[id, name, sum(number_of_nodes)] takes the results from the underlying operator tree and returns the values [id, name,number_of_nodes] back to the client.

The order of execution is the following:

Collect the values [id, name]from doc.customers.
Collect the values [number_of_nodes, customer_id] from doc.clusters.
Join the previous two datasets together on customer_id = id.
Aggregate the values of number_of_nodes for each row together.
Return the values id, name and the result of the aggregation back to the client.

The new execution plan is much more efficient than the previous one. The expensive execution of the correlated sub-query for each value of the outer-query is eliminated. This transformation from the correlated join to a regular join is called decorrelation. Of course, we expect that the query optimizer is clever enough to do the decorrelation for us automatically, so we don’t have to rearrange our query manually. In practice, there are more pitfalls to generalizing this concept to a broader range of queries because the queries can be much more complex. However, this gives you an idea of the concept and the further work we have to put in to integrate this into CrateDB.

Summary

In this post, we covered what a correlated sub-query is and how it is implemented in CrateDB 5.1. We also showed where the bottleneck and limitations are with the current approach and how we can improve this in the future.

Are you interested in CrateDB? Have a look at the documentation CrateDB Reference. If you have questions, check out our CrateDB Community. We're happy to help!

Guide to write operations in CrateDB

marijaselakovic — Wed, 18 Jan 2023 12:08:03 +0000

In our previous article, we gave a general overview of the storage layer in CrateDB. In general, every shard in CrateDB represents a Lucene index that is broken down into segments and stored in the filesystem. A Lucene segment can be seen as a sub-index and can be searched independently.

When new records are written in CrateDB, Lucene first creates in-memory segments before flushing them to the disk. In this article, our goal is to give you a throughout understanding of how CrateDB writes new records. With that in mind, we will go through the basic concepts of Lucene, such as Lucene segments, refresh and flush operation and introduce the concept of translog that guarantees that write operations are persistent to disk.

Lucene segments
The Lucene segment is a part of the logical Lucene index and the Lucene index maps 1-1 to a CrateDB shard. It is an independent index, that can contain inverted indexes, k-d trees, and doc values. A Lucene segment can be searched independently of other segments and documents in the segments are immutable. Every time a field of a document is updated, the document is flagged as deleted in the segment it belonged, and the updated document is added to a new segment. The same behavior applies when a document is deleted in CrateDB. All subsequent queries will skip all the documents that were previously marked as deleted.

To keep the number of segments manageable, Lucene periodically merges them according to some merge policy. When segments are merged, documents that are marked as deleted are discarded and newly created segments will contain only valid, non-deleted documents from the original segments. Merge is triggered when new documents are added and maybe surprisingly, this can result in a smaller index size. To merge segments of a table or a partition on demand one can use the OPTIMIZE TABLE command in CrateDB. When the max_num_segments parameter is set to 1, CrateDB will fully merge the table or partition for optimal query performance. The more details on table optimization in CrateDB, check out our documentation.

Retention period

During the recovery of nodes, to speed up the process, CrateDB needs to replay operations that were executed on one shard on other shards (replicas), residing on different nodes. To preserve the recent deletions within the Lucene index, CrateDB supports the concept of soft delete. Because of that, deleted documents still take up disk space, which is why CrateDB keeps only recently deleted documents. It is possible to customize the retention lease period that says for how long the deleted documents have to be preserved. CrateDB will discard documents only after the expiration of this period. This means that if the merge operation takes place before the expiration, deleted documents will still remain physically available. The default retention lease period is 12 hours.

Writing data to CrateDB

The following diagram illustrates how a new record is stored in CrateDB. When a new document arrives it is first committed to a memory buffer and translog (step 1). Documents in the memory buffer will become in-memory Lucene segments after the refresh operation takes place (step 2). CrateDB executes a refresh operation every second if a search request is received in the last 30 seconds. Eventually, Lucene will commit new segments to the disk (step 3). Once the data are stored on the disk, the merge operation will get triggered and some of the segments will be merged.

Translog

Translog stores write operations for documents that are in memory and that have not been committed. A translog exists for each shard and it is stored in the physical disk. In case of a node failure, CrateDB can retrieve the potentially lost data by replaying the operations from the translog (step 4).

The data in translog are persisted to disk when the translog is fsynced and committed. The default behavior of CrateDB implies that the translog is flushed to the disk after every operation. Additionally, it is possible to flush the translog every translog.sync_interval which can be controlled by translog.durability parameter.

The following parameters control the behavior of the translog:

translog.flush_threshold_size sets the size of the transaction log containing the operations that are not yet safely persisted. This is done to prevent recoveries from taking too long.
translog.interval sets the frequency of flush necessity check.
translog.durability can be set as ASYNC or REQUEST. When set to ASYNC the translog is flushed every translog.sync_interval, and when set to REQUEST the flush happens after every operation.
translog.sync_interval sets how often the translog is fsynced to disk. The default value is 5s. The setting of this parameter takes effect only if the translog.durability is set to ASYNC.

Refresh

Between CrateDB and the disk, there is RAM memory. New records are first buffered in memory before being written to a new segment. Refresh operation makes the in-memory segment available for search. Furthermore, a table can be refreshed explicitly to ensure that the latest state of the table is fetched. In CrateDB this is done with the REFRESH TABLE command. However, the refresh operation doesn’t guarantee durability: to address the persistence issue, CrateDB relies on translog.

If not done explicitly, the table is refreshed with a specified refresh interval. The default value is one second but the interval can be changed with the table parameter refresh_interval. If no query accesses the table during the refresh interval, the table becomes idle. When the table is in an idle state it will not be refreshed until the next query arrives which will first refresh the table and then execute. This will also enable the periodic refresh again.

Flush

A flush operation triggers a Lucene commit: it writes the segments permanently on the disk and clears the transaction log. Writing segments to disk is expensive and takes place at less frequent intervals than the refresh operation. However, a flush happens automatically based on the translog configurations as illustrated in the previous section.

Summary

To summarize, this article provides an overview on how data is written to CrateDB. The following properties of CrateDB are important to be aware of:

A Lucene index is organized in segments which are first built and kept in memory and later flushed to disk. This can influence the time when data becomes available on disk.
In-memory segments become searchable after the refresh operation.
Documents are immutable and deleted documents are not discarded until a merge takes place.
Segments are occasionally merged and deleted documents are discarded.
If translog.durability is set to REQUEST, the translog is flushed after every operation. Otherwise, if set to ASYNC, the translog is flushed when certain configurable conditions are met.
CrateDB maintains the transaction log of operations on each shard for recovery in case of a node crash.

We hope you found this topic interesting. Do you want to learn more about CrateDB? Take a look at our documentation and in case of any questions, check out our CrateDB Community. We're happy to help!
We hope you found this topic interesting. Do you want to learn more about CrateDB? Take a look at our documentation and in case of any questions, check out our CrateDB Community. We're happy to help!

Working with cursors with SQL commands in CrateDB

marijaselakovic — Wed, 18 Jan 2023 11:53:34 +0000

Getting a large number of results in a single operation can put stress on the network and other resources, both client and server-side, so we need a mechanism to fetch results in manageable batches as we are ready to process them.

This can be done using cursors, a cursor is like having a bookmark pointing to a specific record in the result set of a query.

In CrateDB we were already able to use cursors at the protocol level, but with version 5.1.1 we have introduced the ability to do it with standard SQL commands.

First we issue this command:

BEGIN;

This is a SQL command that would normally start a transaction, there are no transactions as such in CrateDB, but this will create a scope on which cursors can be created.

DECLARE observations_cursor NO SCROLL CURSOR FOR 
    SELECT ts,device,reading 
    FROM doc.observations 
    WHERE ts BETWEEN '2022-01-01 00:00' AND '2022-10-01 00:00';

This associates a cursor name with a query and determines the point in time at which data is “frozen”, this is because cursors in CrateDB are INSENSITIVE, meaning that the client can take all the time it needs to retrieve the results and the data will always reflect the status of the tables as it was at the time the cursor was declared, ignoring any records that are updated, deleted, or newly inserted.

FETCH 10 FROM observations_cursor;

This retrieves 10 rows from the query, and when issued again it will retrieve the next 10 rows and so on. We can retrieve different number of records each time and we know we have reached the end of the result set when FETCH returns zero rows.

Once the cursor is not needed anymore it can be closed with either END, CLOSE, or COMMIT.

Take a look at this short animation showing an example of how this works: // Detect dark theme var iframe = document.getElementById('tweet-1572256514798424066-541'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1572256514798424066&theme=dark" }

We hope you find this useful, and we will be happy to hear about your experience in the Community.

Forem: marijaselakovic

From data storage to data analysis: Tutorial on CrateDB and pandas 2

Introduction

Requirements

Setting up CrateDB

Querying data with CrateDB and pandas

Analyze the data

Wrap up

Guide to bitwise operators in CrateDB

Bitwise AND

Bitwise OR

Bitwise XOR

Wrap up

Find the latest reported values with ease. Introducing max_by and min_by aggregations in CrateDB 5.2

Load the dataset

Performance and alternatives

Wrap up

Guide to sharding and partitioning best practices in CrateDB

Partition, shard, and Lucene index

How the data is distributed

Shard replication

Automatic creation of new partitions

How to avoid too many shards

Takeaway

Monitoring an on-premises CrateDB cluster with Prometheus and Grafana

First we need a CrateDB cluster

Setup of the Crate JMX HTTP Exporter

Prometheus Node Exporter

Prometheus SQL Exporter

Prometheus setup

Grafana setup

Alternative implementations

Wrapping up

Advanced downsampling with the LTTB algorithm

Getting some sample data

A basic approach using averages

Enter LTTB

Fetching large results sets from CrateDB

The case for pagination

Using LIMIT + OFFSET

Using LIMIT with a WHERE clause on a watermark field

Cursors

Using cursors in Python

Using cursors in Java

Using cursors with SQL commands

Scaling CrateDB clusters up and down to cope with peaks in demand

Preparing the test environment

Before deploying extra nodes

Deploying the extra nodes

During the event

The day the event ends

A day after the event has ended

When the time comes for the next event

Ingesting with CrateDB

Content

What we want to achieve

Infrastructure

CrateDB Clusters

Benchmarking

Results

Conclusion

Correlated sub-queries in CrateDB

Introduction: Sub-query and Correlated sub-query

What is a sub-query?

So what is then a correlated sub-query?

Can we do better?

Summary

Guide to write operations in CrateDB

Retention period

Writing data to CrateDB

Translog

Refresh

Flush

Summary

Working with cursors with SQL commands in CrateDB

Using `LIMIT` + `OFFSET`

Using `LIMIT` with a `WHERE` clause on a watermark field