Forem: Hernán Lionel Cianfagna

Retrieving records in bulk with a list of primary key values in CrateDB

Hernán Lionel Cianfagna — Fri, 23 Feb 2024 09:33:28 +0000

When we send SQL statements to CrateDB they need to be parsed, but in most situations we do not think about this because the resources used for parsing the statements are trivial in relation to what is required to actually execute the queries.

One exception to this is when INSERTing a large amount of rows, and for this case CrateDB has a very efficient bulk operations interface which can also be used for UPDATEs and DELETEs.

However I recently came across an unusual requirement, we had a very large table with a primary key made of multiple fields, and given tens of thousands of values for these fields we needed to retrieve all the corresponding records.

Let me exemplify the situation with this table definition:

CREATE TABLE sensor_data (
  ts TIMESTAMP
  ,machine_id TEXT
  ,sensor_type SMALLINT
  ,payload OBJECT
  ,PRIMARY KEY (ts,machine_id,sensor_type)
);

Let’s also create some sample data:

INSERT INTO sensor_data (ts,machine_id,sensor_type,payload )
SELECT now()
    ,concat('machine',a.b)
    ,random()*10
    ,{"test"='abc'}
FROM GENERATE_SERIES(1,100000) a(b);

There are different approaches we could use to retrieve multiple rows for the given PK values, such as:

SELECT * FROM sensor_data WHERE ts='2024-02-21 08:00:00.000Z' AND machine_id='machine1' AND sensor_type=8 
UNION 
SELECT * FROM sensor_data WHERE ts='2024-02-21 08:00:00.000Z' AND machine_id='machine2' AND sensor_type=5

or:

SELECT * 
FROM sensor_data 
WHERE (ts='2024-02-21 08:00:00.000Z' AND machine_id='machine1' AND sensor_type=8)
OR (ts='2024-02-21 08:00:00.000Z' AND machine_id='machine2' AND sensor_type=5);

This works reasonably well up to a few hundred records, but let’s see what happens if we try to use this approach to lookup tens of thousands of different records in a single statement as it was the requirement in this very particular case.

Let’s dynamically generate a query like the above but for 10,000 records:

WITH thedata
AS (
    SELECT CONCAT (
            'OR (ts=',(ts::BIGINT)::TEXT
            ,' and machine_id=''',machine_id
            ,''' and sensor_type=',sensor_type
            ,')'
            ) AS onewherecondition
    FROM sensor_data 
    LIMIT 10000
    )
SELECT CONCAT (
        'SELECT * FROM sensor_data WHERE '
        ,replace(replace(replace({ "thearray" = array_agg(onewherecondition) }::TEXT, '{"thearray":["OR ', ''), '","', ' '), '"]}', '')
        ,';')
FROM thedata;

This will generate a very long statement, and when we try to run it we may get:

StackOverflowError[null]

io.crate.exceptions.SQLParseException: line 1:1: statement is too large (stack overflow while parsing)

So we will need a different strategy, and we also want this to run as quickly as possible.

Let’s start by preparing a CSV file with 10,000 primary key values we will use for testing:

pip install crash
crash -c "SELECT ts,machine_id,sensor_type FROM sensor_data LIMIT 10000;" --format csv > pkvalues.csv

What we are going to do now is take advantage of a system column called _id which exists on all CrateDB tables. This column contains a unique identifier for each row, and for tables with a PK defined it is a compound string representation of all primary key values of that row. The useful characteristic here is that the value is deterministic, given 2 tables with the same PK definition rows with the same PK values will have the same _id values.

So to perform this “bulk SELECT” we are going to use a staging table defined with the same PK as the main table. The Python code below bulk loads the values from the CSV file to the staging table and then uses the _id values to locate all the rows we are interested in:

pip install pandas "crate[sqlalchemy]" --upgrade

import pandas as pd
import sqlalchemy as sa
from crate.client.sqlalchemy.support import insert_bulk

df = pd.read_csv("pkvalues.csv")

engine = sa.create_engine(
    "crate://localhost:4200",
    connect_args={"verify_ssl_cert": False},
)
connection = engine.connect()

connection.execute(sa.text("DROP TABLE IF EXISTS relevant_pk_values;"))
connection.execute(
    sa.text(
        """
        CREATE TABLE relevant_pk_values (
            ts TIMESTAMP
            ,machine_id TEXT
            ,sensor_type SMALLINT
            ,PRIMARY KEY (ts,machine_id,sensor_type)
        ) CLUSTERED INTO 1 SHARDS;
        """
    )
)

df.to_sql(
    name="relevant_pk_values",
    con=engine,
    if_exists="append",
    index=False,
    chunksize=5_000,
    method=insert_bulk,
)
connection.execute(sa.text("REFRESH TABLE relevant_pk_values;"))

resultset = connection.execute(
    sa.text(
        """
        SELECT *
        FROM sensor_data
        WHERE _id IN (SELECT _id FROM relevant_pk_values);
        """
    )
)

I hope you found this interesting, if you have any questions please do not hesitate to reach out to us through the CrateDB Community.

Using common table expressions to speed up queries

Hernán Lionel Cianfagna — Thu, 22 Feb 2024 09:44:56 +0000

Today I want to share with you a pattern you can use to replace JOINs with CTEs in your SQL queries and achieve consistent and faster execution times.

Consider a database where we store information about invoices, a simplified model could consist of a table where we store details about the customer and payment terms, a separate table where we have the items included in the invoice, and a 3rd table where we store product information:

CREATE TABLE invoices (
  invoice_number  BIGINT PRIMARY KEY
  ,customer_id  BIGINT
  ,payment_terms  TEXT  DEFAULT '30 days from issue date'
  ,issue_date TIMESTAMP
);

CREATE TABLE invoice_items (
  invoice_number  BIGINT
  ,product_id  BIGINT
  ,quantity  REAL
  ,unit_price REAL
  ,PRIMARY KEY (invoice_number,product_id)
);

CREATE TABLE products (
  product_id  BIGINT PRIMARY KEY
  ,product_description  TEXT
  ,applicable_tax_percentage  REAL
);

Let’s now imagine we want to know how many units of “super cool product” have been sold in January 2024, we could write a SQL query with JOINs like this:

SELECT SUM(quantity)
FROM invoices
JOIN invoice_items USING (invoice_number)
JOIN products USING (product_id)
WHERE product_description='super cool product' 
AND invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';

This is perfectly valid SQL, but it leaves the database engine with a lot of options.

Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance:

“super cool product“ may only be sold very rarely, we could then start by looking up its product_id then all the instances where it has been sold and then check if the corresponding invoices were in January 2024
or perhaps the product is sold often and we have data for 20 years of sales, so we could start by looking up all the invoices from January, then their line items, and see which ones are for this product
or maybe the company only sells a handful of products and this one is a best seller, we may also only keep invoices for the last 45 days, meaning that neither the date of the invoice nor the product are very selective, in which case it may be faster to consider the full list of invoice_items

With up-to-date statistics database engines like CrateDB can usually do a good job at identifying an optimal execution plan, but there is the risk some day statistics may not be available on your target environment, or even with statistics available other factors may induce the query engine to go down the wrong path.

The impact of using a suboptimal execution plan here could be huge, we could find ourselves trying to JOIN millions and millions of records.

Let’s do a small test creating sample data for the first scenario above, the one where “super cool product“ is only sold very rarely:

/* one million invoices in December */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2023-12-01'
FROM GENERATE_SERIES(1,1000000) a(b);

/* one million invoices in January */
INSERT INTO invoices (invoice_number,customer_id,issue_date)
SELECT a.b,1,'2024-01-01'
FROM GENERATE_SERIES(1,2000000) a(b);

/* 2 products */
INSERT INTO products (product_id,product_description)
VALUES (1,'super cool product'),(2,'another product');

/* one line item per invoice and only 1 instance in 2 million where product 1 was sold */
REFRESH TABLE invoices;
INSERT INTO invoice_items (invoice_number,product_id,quantity)
SELECT invoice_number
,CASE WHEN invoice_number=2000000 THEN 1 ELSE 2 END AS product_id
,ceiling(random()*10)
FROM invoices;

We can now run the query with the JOINs a few times, in my small test environment it settles in running in around 750 milliseconds.

We can also look at the execution plan and all its details using the EXPLAIN command.

Let’s now try this approach where we use CTEs to guide the engine to execute the query using steps we know are more optimal for the profile of our data:

WITH relevant_product_ids AS (
    SELECT product_id
    FROM products
    WHERE product_description='super cool product'
    )
    ,relevant_invoice_lines AS (
    SELECT invoice_number,quantity
    FROM invoice_items
    WHERE invoice_items.product_id IN (SELECT relevant_product_ids.product_id FROM relevant_product_ids)
    )
    ,relevant_invoices AS (
    SELECT invoice_number,issue_date
    FROM invoices
    WHERE invoices.invoice_number IN (SELECT relevant_invoice_lines.invoice_number FROM relevant_invoice_lines)
    )
SELECT SUM(quantity)
FROM relevant_invoices
JOIN relevant_invoice_lines USING (invoice_number)
WHERE relevant_invoices.issue_date BETWEEN '2024-01-01' AND '2024-02-01';

We now see this runs consistently in single digit milliseconds, a 100x improvement.

In large and busy environments this kind of optimization may make a big difference, so it may be something to add to your toolbox.

I hope you found this interesting, and as usual if you have any questions do not hesitate to reach out to us in the CrateDB Community.

Connecting with SSL to CrateDB using the PostgreSQL protocol from Java-based applications

Hernán Lionel Cianfagna — Wed, 13 Sep 2023 12:39:48 +0000

If you are using CrateDB Cloud, or if you have configured a server certificate for an on-premises deployment, and you try to enforce SSL on a PostgreSQL connection to CrateDB you may come across an error message like this:

Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt.
  C:\Users\Hernan\AppData\Roaming\postgresql\root.crt (The system cannot find the path specified)

org.postgresql.util.PSQLException: Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt

This is not specific to CrateDB, and you would get the same message trying to connect to an actual PostgreSQL instance, but I found no simple explanation of this error message and the options available, so here are my two cents.

What happens here is that the client is trying to confirm the server we are establishing an encrypted connection with is indeed the machine we intended to reach, doing this involves validating that the certificate used by the server has been issued by a trusted certification authority.
In this case, the client driver is trying to find the details of valid certification authorities on a PKCS12 file on the location indicated in the error message.

I find that in most cases it makes sense to pick one of the two options below to address this.

If we want the communication channel with the server to be encrypted, but we are on a trusted network environment and do not require verification of the server certificate, we can use this in our connection string:

ssl=true&sslmode=require

But if we want to have both encryption and the confirmation that we are talking to the intended server, we can tell the driver to use the list of certification authorities our JVM accepts:

ssl=true&sslmode=verify-full&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory

Some software (DBeaver for instance) may have separate configuration settings where you can set the SSL Factory and SSL mode:

I hope this helps. As usual please do not hesitate to let us know your thoughts in the CrateDB Community.

Using dbt with CrateDB

Hernán Lionel Cianfagna — Fri, 18 Aug 2023 16:32:06 +0000

Dbt is a tool for transforming data in data warehouses using Python and SQL. The idea is that Data Engineers make source data available to an environment where dbt projects run (for instance with Debezium or with Airflow), and Data Analysts can then run their dbt projects against this data to produce models (tables and views) that can be used with BI tools.
This layer allows the decoupling of the models on which reports and dashboards rely from the source data, and if our business rules or our source systems change we can still maintain the same models as a stable interface.

Some of the things that dbt can do include:

import reference data from csv files
track changes in source data with different strategies so that downstream models do not need to be built every time from scratch
run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models' logic

Due to its unique capabilities, CrateDB is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally. In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.

Our starting point will be a fresh install of dbt-postgres:

pip install dbt-postgres==1.6.0

We can then create a profiles file with our connection details:

cd ~
mkdir .dbt
cat << EOF > .dbt/profiles.yml
example_datawarehouse_profile:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      database: crate
      schema: doc
      search_path: doc
      user: dbt   
      password: pwd1234567A
EOF

(please note the values for database, schema, and search_path in this example)

We will not go into the details of how the project files are structured (for more information check out dbt’s documentation), but in general, a dbt project consists of a combination of SQL, Jinja, YAML, and markdown files. In our project folder, alongside the models folder that most projects have, we can also create a folder called macros where we can place macro overrides.
Let's then create a macros folder and place some files with overrides on it:

mkdir macros
cd macros
wget https://community.crate.io/uploads/short-url/fKupQCFUHtuoKom3jAfKrldUXkt.sql
wget https://community.crate.io/uploads/short-url/qvQExEq1OopiVUcXACLGfpdGHYF.sql
wget https://community.crate.io/uploads/short-url/3jcFxL1EExLrERJSTc6ScnzTS9f.sql
cd ..

A few things I have tested with these overrides:

models with view, table, and ephemeral materializations
dbt source freshness
dbt test
dbt seed
Incremental materializations (with incremental_strategy='delete+insert' and without involving OBJECT columns)

I hope you find this useful. CrateDB is continuously adding new features and I will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.

Using regex comparisons and other advanced database features for real-time inspection of web server logs

Hernán Lionel Cianfagna — Mon, 14 Aug 2023 12:23:16 +0000

In Storing server logs on CrateDB for fast search and aggregations we saw how we can get server logs sent to CrateDB in real-time, and for demo purposes we set up an instance of MediaWiki.
It was just an example, but it could have been any web server application.
Let's now imagine that we suspect people are trying to perform SQL injection attacks against our website, we need to keep an eye on the logs.
We have already seen how we can use fulltext search to look for specific error messages, but would it not be great if we could have some rules inspecting the log entries as they come in, extracting relevant information, and flagging anything potentially suspicious?
There are a lot of nice features in CrateDB to support this kind of setup, let me show you an example.
The log entries our web server container is producing look like this:

 192.168.0.121 - - [11/Aug/2023:12:59:42 +0000] "GET /favicon.ico HTTP/1.1" 200 852 "http://192.168.0.202/mw-config/index.php?page=Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"

We see we have the client IP address and the HTTP request that was sent.
Today I want to do 2 checks,

I want to query only for requests coming from a specific IP subnet,
and I want to see if the HTTP request has anything suspicious that could suggest an attempt to perform a SQL injection attack

I will use generated columns to extract this information from the log entries as they arrive.
CrateDB supports adding columns live to existing tables, but generated columns are special and they can only be added to empty tables, to add our new columns to our systemevents table without any downtime we will use another feature of CrateDB, table swapping.

Let's create a new table with the 2 special columns:

CREATE TABLE doc.systemevents2 (
    message TEXT
    ,INDEX message_ft USING FULLTEXT(message)
    ,facility INTEGER
    ,fromhost TEXT
    ,priority INTEGER
    ,DeviceReportedTime TIMESTAMP
    ,ReceivedAt TIMESTAMP
    ,InfoUnitID INTEGER
    ,SysLogTag TEXT 
    ,clientip IP GENERATED ALWAYS AS TRY_CAST(btrim(split_part(message,'-',1)) AS IP)
    ,suspectedSQLinjection BOOLEAN GENERATED ALWAYS AS message ~* 
                CONCAT('.*SELECT.*FROM.*' , 
                       '|.*UNION.*SELECT.*',
                       '|.*DELETE.*FROM.*',
                       '|.*UPATE.*SET.*',
                       '|.*ALTER.*TABLE.*',
                       '|.*(%27|'')%20.*%20(%27|'').*')
    );

Here we are extracting the client IP address from the message text and storing it using the dedicated IP data type in CrateDB. We use the split_part function to look up the string up to the dash symbol, then we use the btrim function to remove spaces from both sides of the string, and finally we use the TRY_CAST function so that log entries that do not have an IP address in this position get a NULL value as clientip but no error message is raised.

We are also using the case insensitive ~* regex comparison operator to look for indications of a possible SQL injection attack attempt, we are looking for occurrences of SELECT .. FROM , UNION ... SELECT , DELETE ... FROM , UPDATE ... SET , ALTER ... TABLE , or attempts to break a string delimiter injecting a single quote character. This will match entries like:

 172.17.0.1 - - [11/Aug/2023:13:03:07 +0000] "GET /mw-config/index.php?css=1%27%20WAITFOR%20DELAY%20%270%3A0%3A5%27%20AND%20%27Lshb%27%3D%27Lshb HTTP/1.1" 200 4627 "-" "sqlmap/1.7.8#pip (https://sqlmap.org)"

Let's now swap in this new table and rename the old one as systemevents_archive:

ALTER CLUSTER SWAP TABLE doc.systemevents2 TO doc.systemevents;
GRANT DML ON TABLE doc.systemevents TO rsyslog;
REVOKE DML ON TABLE doc.systemevents2 FROM rsyslog;
ALTER TABLE doc.systemevents2 RENAME TO systemevents_archive;

To show how this work we will need both normal activity, which we can generate just by navigating to http://localhost/ (perhaps from another machine to get a different IP address in the logs), and we will also need some malicious-looking activity, to generate this against our website we can use a well-known pentesting tool sqlmap :

sudo pip install sqlmap
sqlmap -u http://localhost/ --crawl=2

Press ENTER when prompted to proceed with default settings, there will be around 10 questions before the tool concludes there are no obvious vulnerabilities in the website.

We can now run queries like:

SELECT * 
FROM systemevents 
WHERE clientip << '172.17.0.0/16' 
ORDER BY devicereportedtime DESC
LIMIT 100;

to get the most recent activity from IP addresses in the 172.17.0.0/16 subnet, and we can also look for suspicious activity only:

SELECT * 
FROM systemevents 
WHERE suspectedSQLinjection
ORDER BY devicereportedtime DESC
LIMIT 100;

This kind of queries could then be integrated into a Grafana dashboard or some alerting system.

I hope you found this interesting. Please do not hesitate to let us know your thoughts in the CrateDB Community.

Storing server logs on CrateDB for fast search and aggregations

Hernán Lionel Cianfagna — Thu, 10 Aug 2023 08:27:19 +0000

Did you know that CrateDB can be a great store for your server logs?

If you have been using log aggregation tools or even some of the most advanced commercial SIEM systems, you have probably experienced the same frustrations I have:

timeouts when searching logs over long periods of time
a complex and proprietary query syntax
difficulties integrating queries on logs data into application monitoring dashboards

Storing server logs on CrateDB solves these problems, it allows to query the logs with standard SQL and from any tool supporting the PostgreSQL protocol; its unique indexing also makes full-text queries and aggregations super fast.
Let me show you an example.

First, we will need an instance of CrateDB, it may be best to have a dedicated cluster for this purpose, to separate the monitoring system from the systems being monitored, but for the purpose of this demo we can just have a single node cluster on a docker container:

sudo docker run -d --name cratedb --publish 4200:4200 --publish 5432:5432 crate -Cdiscovery.type=single-node

Next, we need a table to store the logs, let's connect to http://localhost:4200/#!/console and run:

CREATE TABLE doc.systemevents (
    message TEXT
    ,INDEX message_ft USING FULLTEXT(message)
    ,facility INTEGER
    ,fromhost TEXT
    ,priority INTEGER
    ,DeviceReportedTime TIMESTAMP
    ,ReceivedAt TIMESTAMP
    ,InfoUnitID INTEGER
    ,SysLogTag TEXT 
    );

Tip: if you are on a headless system you can also run queries with command-line tools.

Then we need an account for the logging system:

CREATE USER rsyslog WITH (PASSWORD='pwd123');

and we need to grant permissions on the table above:

GRANT DML ON TABLE doc.systemevents TO rsyslog;

We will use rsyslog to send the logs to CrateDB, for this setup we need rsyslog v8.2202 or higher and the ompgsql module:

sudo add-apt-repository ppa:adiscon/v8-stable
sudo apt-get update
sudo apt-get install rsyslog
sudo debconf-set-selections <<< 'rsyslog-pgsql rsyslog-pgsql/dbconfig-install string false'
sudo apt-get install rsyslog-pgsql

Let's now configure it to use the account we created earlier:

echo 'module(load="ompgsql")' | sudo tee /etc/rsyslog.d/pgsql.conf
echo '*.* action(type="ompgsql" conninfo="postgresql://rsyslog:pwd123@localhost/doc")' | sudo tee -a /etc/rsyslog.d/pgsql.conf
sudo systemctl restart rsyslog

If you are interested in more advanced setups involving queuing for additional reliability in production scenarios, you can read more about available settings in the rsyslog documentation.

Now let's imagine that we want to run a container with MediaWiki to host an intranet and we want all logs to go to CrateDB, we can just deploy this with:

sudo docker run --name mediawiki -p 80:80 -d --log-driver syslog --log-opt syslog-address=unixgram:///dev/log mediawiki

If we now point a web browser to port 80 http://localhost/ we will see a new MediaWiki page.
Let's play around a bit to generate log entries, just click on "set up the wiki" and then once on Continue.
This will have generated entries in the doc.systemevents table with syslogtag matching the container id of the container running the site.

We can now use the MATCH predicate to find the error messages we are interested in:

SELECT devicereportedtime,message
FROM doc.systemevents
WHERE MATCH(message_ft, 'Could not reliably determine') USING PHRASE
ORDER BY 1 DESC;

+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| devicereportedtime | message                                                                                                                                                                     |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Let's now see which log sources created the most entries:

SELECT syslogtag,count(*)
FROM doc.systemevents
GROUP BY 1
ORDER BY 2 DESC
LIMIT 5;

+----------------------+----------+
| syslogtag            | count(*) |
+----------------------+----------+
| kernel:              |       23 |
| 083053ae8ea3[52134]: |       20 |
| systemd[1]:          |       15 |
| sudo:                |       10 |
| rsyslogd:            |        5 |
+----------------------+----------+

I hope you found this interesting. Please do not hesitate to let us know your thoughts in the CrateDB Community.

How to add new nodes to on-premises CrateDB clusters

Hernán Lionel Cianfagna — Tue, 18 Jul 2023 13:12:26 +0000

A significant feature in CrateDB is that it can scale horizontally, which means that instead of adding more RAM, CPU, and disk resources to our existing nodes we can add more nodes to our CrateDB cluster.
This allows the handling of volumes of data that simply could not fit on a single node, but it is also very useful in scenarios where hosting everything in a single node, or a small number of nodes, would still be possible, this is because smaller nodes are often easier to manage infrastructure-wise.
More nodes also mean more resiliency to issues, on a scenario where we have for instance 5 nodes, and configure our tables with 2 replicas, we could lose 2 nodes and still serve our production workloads. This means we can carry out maintenance on the nodes one at a time and still be able to withstand an unplanned issue on another node without downtime.

Today we want to review how to add a new node to an existing on-premises cluster.

Discovery

When a CrateDB node starts it needs a mechanism to get a list of the nodes that make up the cluster, this is called discovery.
At the time of writing, there are 3 ways for a node to get this list:

The list of nodes can be defined in the discovery.seed_hosts setting in the configuration file (typically in /etc/crate/crate.yml)
The list can be retrieved with a DNS query, see https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns
In AWS environments, the list of nodes can be looked up via the EC2 API, filtering on specific security groups, availability zones, and tags, see https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2

For the purpose of this post, we will work with the discovery.seed_hosts list.

Scaling from a single-node deployment

If a node is started without specifying the initial_master_nodes setting (the default configuration), or with discovery_type set to single-node, it will be started as a standalone instance and it cannot later be scaled into a cluster. Single-node deployments are great for development and testing, but for production setups we recommend using a cluster with at least 3 nodes.

If you are going for a single-node deployment initially, but plan to scale to a multi-node cluster in the future, there are some settings to configure before the very first run of the CrateDB node so that it bootstraps as a 1-node cluster instead of a standalone instance.
The settings that we need are:

discovery.seed_hosts set to the hostname or the node.name of the node
initial_master_nodes set to the hostname or the node.name of the node
optionally we can set a cluster.name

If you are using containers you would pass these settings with lines in the args section of your YAML file, otherwise you could create /etc/crate/crate.yml before deploying the package for your distribution (refer to https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml for the template), or you could prevent the package installation from auto-starting the daemon by using a mechanism such as policy-rcd-declarative, then edit the configuration file (crate.yml ), and start the crate daemon once all settings are ready.

Networking considerations

Nodes need to be able to resolve each other's hostnames at DNS level, and they need to be able to reach each other on a TCP port which is 4300 by default.

For security reasons you should configure your network so that CrateDB cluster nodes are only reachable on port 4300 from other CrateDB nodes in the cluster.
In a Kubernetes environment this can be achieved with a Service resource with a ClusterIP.
In a non-containerized environment one way to do this is to use firewall software directly on each node, for instance:

#Enable ufw - all incoming connections blocked by default
sudo ufw enable

#Allow SSH if you are using it to manage your server
sudo ufw allow 22

#Allow 4200 for clients to connect to CrateDB via the http endpoint
sudo ufw allow 4200

#Allow 5432 if you have PostgreSQL clients
sudo ufw allow 5432

#Allow 4300 from 192.168.0.202 (another cluster node in this example)
sudo ufw allow proto tcp from 192.168.0.202 to any port 4300

You may also want to consider network access control and/or a separate network adapter for intra-cluster communications.

Deploying the new node

Make sure the new node does not auto-bootstrap as a single-node instance, you may want to either create /etc/crate/crate.yml in advance or use a mechanism as policy-rcd-declarative as mentioned earlier.
On the configuration file for the new node:

Set discovery.seed_hosts to the full list of nodes, including the new one you are adding.
Optionally set a node.name , if not done the node get assigned a random name from the sys.summits table. You may wonder what those default names are about, they are the names of mountains in the area around our main office, we love mountains at Crate.io.
Set cluster.name to a value that matches the other nodes in the cluster, if not specified the default cluster name is crate.
Consider if you want to set the cluster-wide settings gateway.expected_data_nodes, gateway.recover_after_data_nodes, and/or gateway.recover_after_time to prevent the unnecessary creation of new replicas and the rebalancing of shards when a node takes a little bit longer to start, or in case of transient issues, when the cluster is starting up from a situation where all nodes are shutdown. Please note these settings are used when the cluster is starting up from being offline, if you want to delay the allocation of replicas when a node becomes unavailable on a cluster that stays online there is a different setting at table level.

Now we can start the crate daemon, you will see the node joining the cluster and CrateDB will start using it for shards allocation.

Remember to add the new nodes alongside the old ones in any monitoring system and load balancer configuration you may have in your environment.

Updating settings on the old nodes

Now we need to align a number of settings in the other nodes, these are typically in the /etc/crate/crate.yml file:

Update discovery.seed_hosts adding the new node
If you have configured gateway. settings, update them to have the same values on all nodes

These settings only play a role during restart, not at runtime, so you do not need to restart the nodes after making these changes, but if the gateway. settings need updating you may see a warning in the Admin UI which can be acknowledged.

Please also note there is no need to update the initial_master_nodes list, this is only considered during the initial cluster bootstrapping.

And that is it, we have scaled out our cluster and we are ready to work with larger volumes of data. I hope you find this useful and, as usual, please do not hesitate to raise any thoughts or questions in the CrateDB Community.

Replicating data from other databases to CrateDB with Debezium and Kafka

Hernán Lionel Cianfagna — Tue, 28 Feb 2023 16:29:49 +0000

You may have line-of-business applications such as ERP software that work with transactional database systems like MSSQL, Oracle, or MySQL.

The setup may work perfectly fine for day-to-day operations, but you may find that it is not ideal for doing data analytics.

Attempting to run analytic workloads against the operational databases you may see concurrency issues deriving from locking, the analytics queries may have an impact on the performance of business-critical operations, and you may also find that the performance and feature-set in the transactional database system may not be good enough for analyzing large amounts of data.

Considering this, many organisations come to the conclusion they need to copy data to a separate environment to run reporting and dashboards, this is sometimes done with replication, sometimes with backups, and sometimes with complex ETL pipelines. This often comes with a set of challenges:

ballooning license costs
custom ad-hoc routines for getting the data to the analytics environment, requiring development, monitoring, and troubleshooting
a need to design and maintain an indexing strategy for the analytics copy of the data
high availability requirements for the analytics environment as the business starts relying on it

We know we can address several of these points by using a system like CrateDB. CrateDB is a feature-rich, open-source, SQL database which out-of-the-box automatically implements indexes, compression, and a columnar store so that most analytical queries can run much faster without any need to fiddle with settings. Because it is open-source, there is no need to be concerned about licensing expenses. Additionally, it can scale horizontally, which means that the number of nodes can be adjusted as needed to handle changing data volumes and workloads, and it provides high availability without requiring administrative effort.

If only we could replicate data from our operational database to CrateDB without having to write custom code… it turns out we can.

Enter Debezium, Debezium is a standard open-source system, built on top of Kafka, which allows to capture changes on a source database system and replicate them on another system without having to write custom scripts.

In this post I want to show an example replicating changes on a table from MSSQL to CrateDB.

Setup on the MSSQL side

We will need a SQL Server instance with the SQL Server Agent service up and running, if you are running MSSQL on a container you can get the agent running by setting the environment variable MSSQL_AGENT_ENABLED to True.

Connect to the instance with a client such as sqlcmd, SSMS, or DBeaver.

We are now going to go through a number of steps, if you already have a working system feel free to skip the operations you do not need.

Let’s create a database with a test table on it:

CREATE DATABASE erp;
GO
USE erp; 
CREATE TABLE dbo.tbltest (
    id INT PRIMARY KEY IDENTITY,
    createdon DATETIME DEFAULT getdate(),
    srcsystem NVARCHAR(max)
    );

Let’s now create an account for Debezium to use to pull the changes:

CREATE LOGIN debeziumlogin WITH PASSWORD = '<enterStrongPasswordHere>';
CREATE USER debeziumuser FOR LOGIN debeziumlogin;
CREATE ROLE debeziumrole;
EXEC sp_addrolemember 'debeziumrole', 'debeziumuser';
EXEC sp_addrolemember 'db_datareader', 'debeziumuser';

And let’s enable change data capture on our example table:

EXEC sys.sp_cdc_enable_db;
ALTER DATABASE erp ADD FILEGROUP cdcfg;
ALTER DATABASE erp ADD FILE (
    NAME= erp_cdc_file1,
    FILENAME='/var/opt/mssql/data/erp_cdc_file1.ndf'
    ) TO FILEGROUP cdcfg;
EXEC sys.sp_cdc_enable_table
    @source_schema='dbo',
    @source_name='tbltest',
    @role_name='debeziumrole',
    @filegroup_name='cdcfg',
    @supports_net_changes=0;

Setup on the CrateDB side

We will need a CrateDB instance, for the purpose of this example we can spin one up with:

sudo apt install docker.io
sudo docker run --publish 4200:4200 --publish 5432:5432 crate:latest -Cdiscovery.type=single-node

Now we need to run a couple of SQL commands on this instance, an easy way to do this is using the Admin UI that can be accessed navigating with a web browser to port 4200 on the server where CrateDB is running, for instance http://localhost:4200 and then open the console (second icon from the top on the left-hand side navigation bar).

We will create a user account for Debezium to use:

CREATE USER debezium WITH (password='debeziumpwdincratedb123');

The table on our MSSQL source is on the dbo schema, let’s imagine we want to have a dbo schema on CrateDB as well, the debezium account will need permissions on it:

GRANT DQL,DML,DDL ON SCHEMA dbo to debezium;

And let’s create the structure of the table that will receive the data:

CREATE TABLE dbo.tbltest (
    id INT PRIMARY KEY /* we need the PK definition to match the source table so that this can be used to lookup records when they need to be updated */
    ,createdon TIMESTAMP /* CrateDB supports defaults -of course- but because the source table already has a default value we do not need that here */
    ,srcsystem TEXT
    );

Zookeeper and Kafka

To use Debezium we will need to have working setups of Zookeeper and Kafka.

For the purpose of this example I will spin them up with containers on the same machine:

sudo docker run -it --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper
sudo docker run -it --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper --add-host host.docker.internal:host-gateway debezium/kafka

We need to create some special topics in Kafka:

sudo docker exec -it kafka "bash"
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_configs --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_offsets --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
exit

Please note this is a very basic setup, for production purposes you may want to adjust some of these settings.

Preparing and starting a Debezium container image

We need to customize the base debezium/connect Docker image adding a JDBC sink and the PostgreSQL drivers.

For this we need to download the zip file from kafka-connect-jdbc and then run the below replacing ************* with the appropriate URL:

mkdir customdockerimg
cd customdockerimg
wget *************/confluentinc-kafka-connect-jdbc-10.6.3.zip
sudo apt install unzip
mkdir confluentinc-kafka-connect-jdbc-10.6.3
cd confluentinc-kafka-connect-jdbc-10.6.3
unzip -j ../confluentinc-kafka-connect-jdbc-10.6.3.zip
cd ..
cat > Dockerfile <<EOF  
FROM debezium/connect
USER root:root
COPY ./confluentinc-kafka-connect-jdbc-10.6.3/ /kafka/connect/
RUN cd /kafka/libs && curl -sO https://jdbc.postgresql.org/download/postgresql-42.5.4.jar
USER 1001
EOF
sudo docker build -t cratedb-connect-debezium .

Let’s now start this custom image:

sudo docker run -it --rm --name connect -p 8083:8083 \
           -e GROUP_ID=1 \
           -e CONFIG_STORAGE_TOPIC=my_connect_configs \
           -e OFFSET_STORAGE_TOPIC=my_connect_offsets \
           --add-host host.docker.internal:host-gateway \
           --add-host $(hostname):host-gateway \
           -e BOOTSTRAP_SERVERS=host.docker.internal:9092 \
           -e KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
           -e VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
           cratedb-connect-debezium

This assumes Kafka is running locally on the same server, you will need to adjust BOOTSTRAP_SERVERS if that is not the case.

Configure a source connector

Let’s create a connector.json file as follows:

{
    "name": "mssql-source-tbltest",
    "config": {
        "connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
        "tasks.max": "1",

        "database.history.kafka.bootstrap.servers": "host.docker.internal:9092",
        "schema.history.internal.kafka.bootstrap.servers": "host.docker.internal:9092",
        "topic.prefix": "cratedbdemo",
        "database.encrypt": "false",

        "database.hostname": "host.docker.internal",
        "database.port": "1433",
        "database.user": "debeziumlogin",
        "database.password": "<enterStrongPasswordHere>",
        "database.server.name": "mssql-server",

        "database.names": "erp",        
        "table.whitelist": "dbo.tbltest",        
        "database.history.kafka.topic": "schema-changes.mssql-server.tbltest",
        "schema.history.internal.kafka.topic": "schema-changes.inventory.mssql-server.tbltest"              
    }
}

We can observe that there are settings there concerning the Kafka setup to use, the details to connect to MSSQL, the name of the table that we want to pull changes from, and the Kafka topics that will be used to track these changes.

Let’s deploy this:

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @connector.json

Configure a target

Let’s create a destination-connector.json file as follows:

{
    "name": "cratedb-sink-tbltest",
    "config": {
        "connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
        "tasks.max": "1",       

        "connection.url": "jdbc:postgresql://host.docker.internal:5432/",              
        "connection.user": "debezium",      
        "connection.password": "debeziumpwdincratedb123",               

        "topics": "cratedbdemo.erp.dbo.tbltest", 
        "table.name.format": "dbo.tbltest",
        "auto.create": "false",
        "auto.evolve": "false",

        "insert.mode": "upsert",
        "pk.fields": "id",
        "pk.mode": "record_value",      

        "transforms": "unwrap",                                                 
        "transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
    }
}

We got details to connect to CrateDB, the name of table that will receive the changes (please note this is case sensitive), and some transform instructions to flatten the JSON data stored in the Kafka topic.