<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Hernán Lionel Cianfagna</title>
    <description>The latest articles on Forem by Hernán Lionel Cianfagna (@hlcianfagna).</description>
    <link>https://forem.com/hlcianfagna</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1035734%2Ff9ec2bc7-0e15-47e9-a9bf-5a4af4dd293e.jpeg</url>
      <title>Forem: Hernán Lionel Cianfagna</title>
      <link>https://forem.com/hlcianfagna</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/hlcianfagna"/>
    <language>en</language>
    <item>
      <title>Retrieving records in bulk with a list of primary key values in CrateDB</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Fri, 23 Feb 2024 09:33:28 +0000</pubDate>
      <link>https://forem.com/crate/retrieving-records-in-bulk-with-a-list-of-primary-key-values-in-cratedb-3em0</link>
      <guid>https://forem.com/crate/retrieving-records-in-bulk-with-a-list-of-primary-key-values-in-cratedb-3em0</guid>
      <description>&lt;p&gt;When we send SQL statements to CrateDB they need to be parsed, but in most situations we do not think about this because the resources used for parsing the statements are trivial in relation to what is required to actually execute the queries.&lt;/p&gt;

&lt;p&gt;One exception to this is when INSERTing a large amount of rows, and for this case CrateDB has &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/interfaces/http.html#bulk-operations"&gt;a very efficient bulk operations interface&lt;/a&gt; which can also be used for UPDATEs and DELETEs.&lt;/p&gt;

&lt;p&gt;However I recently came across an unusual requirement, we had a very large table with a primary key made of multiple fields, and given tens of thousands of values for these fields we needed to retrieve all the corresponding records.&lt;/p&gt;

&lt;p&gt;Let me exemplify the situation with this table definition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;ts&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt; &lt;span class="nb"&gt;SMALLINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="k"&gt;OBJECT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s also create some sample data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'machine'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;
    &lt;span class="p"&gt;,{&lt;/span&gt;&lt;span class="nv"&gt;"test"&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'abc'&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;100000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are different approaches we could use to retrieve multiple rows for the given PK values, such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; 
&lt;span class="k"&gt;UNION&lt;/span&gt; 
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine2'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine1'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;OR&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'2024-02-21 08:00:00.000Z'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;machine_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'machine2'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;sensor_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works reasonably well up to a few hundred records, but let’s see what happens if we try to use this approach to lookup tens of thousands of different records in a single statement as it was the requirement in this very particular case.&lt;/p&gt;

&lt;p&gt;Let’s dynamically generate a query like the above but for 10,000 records:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;thedata&lt;/span&gt;
&lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="s1"&gt;'OR (ts='&lt;/span&gt;&lt;span class="p"&gt;,(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;BIGINT&lt;/span&gt;&lt;span class="p"&gt;)::&lt;/span&gt;&lt;span class="nb"&gt;TEXT&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;' and machine_id=&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;machine_id&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt; and sensor_type='&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;sensor_type&lt;/span&gt;
            &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;')'&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;onewherecondition&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;sensor_data&lt;/span&gt; 
    &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;CONCAT&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s1"&gt;'SELECT * FROM sensor_data WHERE '&lt;/span&gt;
        &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nv"&gt;"thearray"&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;array_agg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;onewherecondition&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}::&lt;/span&gt;&lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"thearray":["OR '&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'","'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;' '&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s1"&gt;'"]}'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;';'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;thedata&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will generate a very long statement, and when we try to run it we may get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;StackOverflowError[null]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;io.crate.exceptions.SQLParseException: line 1:1: statement is too large (stack overflow while parsing)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we will need a different strategy, and we also want this to run as quickly as possible.&lt;/p&gt;

&lt;p&gt;Let’s start by preparing a CSV file with 10,000 primary key values we will use for testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;crash
crash &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"SELECT ts,machine_id,sensor_type FROM sensor_data LIMIT 10000;"&lt;/span&gt; &lt;span class="nt"&gt;--format&lt;/span&gt; csv &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; pkvalues.csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What we are going to do now is take advantage of a &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/general/ddl/system-columns.html"&gt;system column&lt;/a&gt; called &lt;code&gt;_id&lt;/code&gt; which exists on all CrateDB tables. This column contains a unique identifier for each row, and for tables with a PK defined it is a compound string representation of all primary key values of that row. The useful characteristic here is that the value is deterministic, given 2 tables with the same PK definition rows with the same PK values will have the same &lt;code&gt;_id&lt;/code&gt; values.&lt;/p&gt;

&lt;p&gt;So to perform this “bulk SELECT” we are going to use a staging table defined with the same PK as the main table. The Python code below bulk loads the values from the CSV file to the staging table and then uses the &lt;code&gt;_id&lt;/code&gt; values to locate all the rows we are interested in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas &lt;span class="s2"&gt;"crate[sqlalchemy]"&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sqlalchemy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crate.client.sqlalchemy.support&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;insert_bulk&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pkvalues.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_engine&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;crate://localhost:4200&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;connect_args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;verify_ssl_cert&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DROP TABLE IF EXISTS relevant_pk_values;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        CREATE TABLE relevant_pk_values (
            ts TIMESTAMP
            ,machine_id TEXT
            ,sensor_type SMALLINT
            ,PRIMARY KEY (ts,machine_id,sensor_type)
        ) CLUSTERED INTO 1 SHARDS;
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;relevant_pk_values&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;con&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;if_exists&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chunksize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;method&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;insert_bulk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REFRESH TABLE relevant_pk_values;&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;resultset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;connection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;sa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        SELECT *
        FROM sensor_data
        WHERE _id IN (SELECT _id FROM relevant_pk_values);
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you found this interesting, if you have any questions please do not hesitate to reach out to us through the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>cratedb</category>
      <category>python</category>
      <category>sql</category>
      <category>performance</category>
    </item>
    <item>
      <title>Using common table expressions to speed up queries</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Thu, 22 Feb 2024 09:44:56 +0000</pubDate>
      <link>https://forem.com/crate/using-common-table-expressions-to-speed-up-queries-29om</link>
      <guid>https://forem.com/crate/using-common-table-expressions-to-speed-up-queries-29om</guid>
      <description>&lt;p&gt;Today I want to share with you a pattern you can use to replace &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/concepts/joins.html"&gt;JOINs&lt;/a&gt; with &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/general/dql/selects.html#sql-dql-with"&gt;CTEs&lt;/a&gt; in your SQL queries and achieve consistent and faster execution times.&lt;/p&gt;

&lt;p&gt;Consider a database where we store information about invoices, a simplified model could consist of a table where we store details about the customer and payment terms, a separate table where we have the items included in the invoice, and a 3rd table where we store product information:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;invoice_number&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;payment_terms&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;  &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="s1"&gt;'30 days from issue date'&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;invoice_number&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;  &lt;span class="nb"&gt;REAL&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;unit_price&lt;/span&gt; &lt;span class="nb"&gt;REAL&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;product_id&lt;/span&gt;  &lt;span class="nb"&gt;BIGINT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_description&lt;/span&gt;  &lt;span class="nb"&gt;TEXT&lt;/span&gt;
  &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;applicable_tax_percentage&lt;/span&gt;  &lt;span class="nb"&gt;REAL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now imagine we want to know how many units of “super cool product” have been sold in January 2024, we could write a SQL query with JOINs like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt; 
&lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is perfectly valid SQL, but it leaves the database engine with a lot of options.&lt;/p&gt;

&lt;p&gt;Even without considering the complexities of a distributed system, parallel processing, and disk/memory options there are still many different possible strategies here, for instance:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“super cool product“ may only be sold very rarely, we could then start by looking up its &lt;code&gt;product_id&lt;/code&gt; then all the instances where it has been sold and then check if the corresponding invoices were in January 2024&lt;/li&gt;
&lt;li&gt;or perhaps the product is sold often and we have data for 20 years of sales, so we could start by looking up all the invoices from January, then their line items, and see which ones are for this product&lt;/li&gt;
&lt;li&gt;or maybe the company only sells a handful of products and this one is a best seller, we may also only keep invoices for the last 45 days, meaning that neither the date of the invoice nor the product are very selective, in which case it may be faster to consider the full list of &lt;code&gt;invoice_items&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With up-to-date &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/sql/statements/analyze.html"&gt;statistics&lt;/a&gt; database engines like CrateDB can usually do a good job at identifying an optimal execution plan, but there is the risk some day statistics may not be available on your target environment, or even with statistics available other factors may induce the query engine to go down the wrong path.&lt;/p&gt;

&lt;p&gt;The impact of using a suboptimal execution plan here could be huge, we could find ourselves trying to JOIN millions and millions of records.&lt;/p&gt;

&lt;p&gt;Let’s do a small test creating sample data for the first scenario above, the one where “super cool product“ is only sold very rarely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="cm"&gt;/* one million invoices in December */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'2023-12-01'&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* one million invoices in January */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;GENERATE_SERIES&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* 2 products */&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt;&lt;span class="p"&gt;),(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'another product'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="cm"&gt;/* one line item per invoice and only 1 instance in 2 million where product 1 was sold */&lt;/span&gt;
&lt;span class="n"&gt;REFRESH&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;CASE&lt;/span&gt; &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2000000&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ceiling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can now run the query with the JOINs a few times, in my small test environment it settles in running in around 750 milliseconds.&lt;/p&gt;

&lt;p&gt;We can also look at the execution plan and all its details using the &lt;a href="https://cratedb.com/docs/crate/reference/en/5.6/sql/statements/explain.html"&gt;EXPLAIN command&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let’s now try this approach where we use CTEs to guide the engine to execute the query using steps we know are more optimal for the profile of our data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;product_id&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;product_description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'super cool product'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;invoice_items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product_id&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_product_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;relevant_invoices&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;
    &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;relevant_invoices&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;relevant_invoice_lines&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;invoice_number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;relevant_invoices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;issue_date&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="s1"&gt;'2024-02-01'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We now see this runs consistently in single digit milliseconds, a 100x improvement.&lt;/p&gt;

&lt;p&gt;In large and busy environments this kind of optimization may make a big difference, so it may be something to add to your toolbox.&lt;/p&gt;

&lt;p&gt;I hope you found this interesting, and as usual if you have any questions do not hesitate to reach out to us in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>database</category>
      <category>sql</category>
      <category>performance</category>
    </item>
    <item>
      <title>Connecting with SSL to CrateDB using the PostgreSQL protocol from Java-based applications</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Wed, 13 Sep 2023 12:39:48 +0000</pubDate>
      <link>https://forem.com/crate/connecting-with-ssl-to-cratedb-using-the-postgresql-protocol-from-java-based-applications-4mpp</link>
      <guid>https://forem.com/crate/connecting-with-ssl-to-cratedb-using-the-postgresql-protocol-from-java-based-applications-4mpp</guid>
      <description>&lt;p&gt;If you are using &lt;a href="https://crate.io/lp-free-trial"&gt;CrateDB Cloud&lt;/a&gt;, or if you have &lt;a href="https://crate.io/docs/crate/reference/en/latest/admin/ssl.html"&gt;configured a server certificate&lt;/a&gt; for an on-premises deployment, and you try to enforce SSL on a PostgreSQL connection to CrateDB you may come across an error message like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt.
  C:\Users\Hernan\AppData\Roaming\postgresql\root.crt (The system cannot find the path specified)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;org.postgresql.util.PSQLException: Could not open SSL root certificate file C:\Users\Hernan\AppData\Roaming\postgresql\root.crt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not specific to CrateDB, and you would get the same message trying to connect to an actual PostgreSQL instance, but I found no simple explanation of this error message and the options available, so here are my two cents.&lt;/p&gt;

&lt;p&gt;What happens here is that the client is trying to confirm the server we are establishing an encrypted connection with is indeed the machine we intended to reach, doing this involves validating that the certificate used by the server has been issued by a trusted certification authority.&lt;br&gt;
In this case, the client driver is trying to find the details of valid certification authorities on a PKCS12 file on the location indicated in the error message.&lt;/p&gt;

&lt;p&gt;I find that in most cases it makes sense to pick one of the two options below to address this.&lt;/p&gt;

&lt;p&gt;If we want the communication channel with the server to be encrypted, but we are on a trusted network environment and do not require verification of the server certificate, we can use this in our connection string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl=true&amp;amp;sslmode=require
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But if we want to have both encryption and the confirmation that we are talking to the intended server, we can tell the driver to use the list of certification authorities our JVM accepts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ssl=true&amp;amp;sslmode=verify-full&amp;amp;sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some software (&lt;a href="https://dbeaver.io/"&gt;DBeaver&lt;/a&gt; for instance) may have separate configuration settings where you can set the SSL Factory and SSL mode:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2k85hn3z5j7q6axwgfy2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2k85hn3z5j7q6axwgfy2.png" alt="SSL Mode and Factory settings can be found in DBeaver on the SSL tab of the connection settings" width="518" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I hope this helps. As usual please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>postgres</category>
      <category>java</category>
    </item>
    <item>
      <title>Using dbt with CrateDB</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Fri, 18 Aug 2023 16:32:06 +0000</pubDate>
      <link>https://forem.com/crate/using-dbt-with-cratedb-65i</link>
      <guid>https://forem.com/crate/using-dbt-with-cratedb-65i</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/dbt-labs/dbt-core"&gt;Dbt&lt;/a&gt; is a tool for transforming data in data warehouses using Python and SQL. The idea is that Data Engineers make source data available to an environment where dbt projects run (for instance with &lt;a href="https://dev.to/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981"&gt;Debezium&lt;/a&gt; or with &lt;a href="https://crate.io/integrations/cratedb-and-apache-airflow"&gt;Airflow&lt;/a&gt;), and Data Analysts can then run their dbt projects against this data to produce models (tables and views) that can be used with &lt;a href="https://crate.io/integrations#visualization-analytics"&gt;BI tools&lt;/a&gt;.&lt;br&gt;
This layer allows the decoupling of the models on which reports and dashboards rely from the source data, and if our business rules or our source systems change we can still maintain the same models as a stable interface.&lt;/p&gt;

&lt;p&gt;Some of the things that dbt can do include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;import reference data from csv files&lt;/li&gt;
&lt;li&gt;track changes in source data with different strategies so that downstream models do not need to be built every time from scratch&lt;/li&gt;
&lt;li&gt;run tests on data, to confirm assumptions remain valid, and to validate any changes made to the models' logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Due to its unique capabilities, &lt;a href="https://cratedb.com/"&gt;CrateDB&lt;/a&gt; is an excellent warehouse choice for data transformation projects. It offers automatic indexing, fast aggregations, easy partitioning, and the ability to scale horizontally. In this article, I will illustrate how to get the most important functionalities of dbt working by doing the necessary changes in the configuration.&lt;/p&gt;

&lt;p&gt;Our starting point will be a fresh install of &lt;code&gt;dbt-postgres&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;dbt-postgres&lt;span class="o"&gt;==&lt;/span&gt;1.6.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can then create a profiles file with our connection details:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~
&lt;span class="nb"&gt;mkdir&lt;/span&gt; .dbt
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; .dbt/profiles.yml
example_datawarehouse_profile:
  target: dev
  outputs:
    dev:
      type: postgres
      host: localhost
      port: 5432
      database: crate
      schema: doc
      search_path: doc
      user: dbt   
      password: pwd1234567A
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(please note the values for &lt;code&gt;database&lt;/code&gt;, &lt;code&gt;schema&lt;/code&gt;, and &lt;code&gt;search_path&lt;/code&gt; in this example)&lt;/p&gt;

&lt;p&gt;We will not go into the details of how the project files are structured (for more information check out dbt’s documentation), but in general, a dbt project consists of a combination of SQL, Jinja, YAML, and markdown files. In our project folder, alongside the &lt;code&gt;models&lt;/code&gt; folder that most projects have, we can also create a folder called &lt;code&gt;macros&lt;/code&gt; where we can place macro overrides.&lt;br&gt;
Let's then create a &lt;code&gt;macros&lt;/code&gt; folder and place some files with overrides on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;macros
&lt;span class="nb"&gt;cd &lt;/span&gt;macros
wget https://community.crate.io/uploads/short-url/fKupQCFUHtuoKom3jAfKrldUXkt.sql
wget https://community.crate.io/uploads/short-url/qvQExEq1OopiVUcXACLGfpdGHYF.sql
wget https://community.crate.io/uploads/short-url/3jcFxL1EExLrERJSTc6ScnzTS9f.sql
&lt;span class="nb"&gt;cd&lt;/span&gt; ..
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things I have tested with these overrides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;models with &lt;a href="https://docs.getdbt.com/docs/build/materializations"&gt;view, table, and ephemeral materializations&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/deploy/source-freshness"&gt;dbt source freshness&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/build/tests"&gt;dbt test&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.getdbt.com/docs/build/seeds"&gt;dbt seed&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.getdbt.com/docs/build/incremental-models"&gt;Incremental materializations&lt;/a&gt; (with &lt;code&gt;incremental_strategy='delete+insert'&lt;/code&gt; and without involving &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/ddl/data-types.html#objects"&gt;OBJECT&lt;/a&gt; columns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you find this useful. CrateDB is continuously adding new features and I will endeavor to come back and update this article if there are any developments and some of these overrides require changes or become obsolete.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Using regex comparisons and other advanced database features for real-time inspection of web server logs</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Mon, 14 Aug 2023 12:23:16 +0000</pubDate>
      <link>https://forem.com/crate/using-regex-comparisons-and-other-advanced-database-features-for-real-time-inspection-of-web-server-logs-35ip</link>
      <guid>https://forem.com/crate/using-regex-comparisons-and-other-advanced-database-features-for-real-time-inspection-of-web-server-logs-35ip</guid>
      <description>&lt;p&gt;In &lt;a href="https://dev.to/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no"&gt;Storing server logs on CrateDB for fast search and aggregations&lt;/a&gt; we saw how we can get server logs sent to CrateDB in real-time, and for demo purposes we set up an instance of MediaWiki.&lt;br&gt;
It was just an example, but it could have been any web server application.&lt;br&gt;
Let's now imagine that we suspect people are trying to perform &lt;a href="https://en.wikipedia.org/wiki/SQL_injection"&gt;SQL injection&lt;/a&gt; attacks against our website, we need to keep an eye on the logs.&lt;br&gt;
We have already seen how we can use &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/dql/fulltext.html"&gt;fulltext search&lt;/a&gt; to look for specific error messages, but would it not be great if we could have some rules inspecting the log entries as they come in, extracting relevant information, and flagging anything potentially suspicious?&lt;br&gt;
There are a lot of nice features in CrateDB to support this kind of setup, let me show you an example.&lt;br&gt;
The log entries our web server container is producing look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 192.168.0.121 - - [11/Aug/2023:12:59:42 +0000] "GET /favicon.ico HTTP/1.1" 200 852 "http://192.168.0.202/mw-config/index.php?page=Welcome" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see we have the client IP address and the HTTP request that was sent.&lt;br&gt;
Today I want to do 2 checks,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I want to query only for requests coming from a specific IP subnet,&lt;/li&gt;
&lt;li&gt;and I want to see if the HTTP request has anything suspicious that could suggest an attempt to perform a SQL injection attack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I will use &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/ddl/generated-columns.html"&gt;generated columns&lt;/a&gt; to extract this information from the log entries as they arrive.&lt;br&gt;
CrateDB supports &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/alter-table.html#add-column"&gt;adding columns&lt;/a&gt; live to existing tables, but generated columns are special and they can only be added to empty tables, to add our new columns to our &lt;code&gt;systemevents&lt;/code&gt; table without any downtime we will use another feature of CrateDB, &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/alter-cluster.html#swap-table"&gt;table swapping&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Let's create a new table with the 2 special columns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;message_ft&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;FULLTEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;facility&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;fromhost&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DeviceReportedTime&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ReceivedAt&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;InfoUnitID&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;SysLogTag&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; 
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;clientip&lt;/span&gt; &lt;span class="n"&gt;IP&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;TRY_CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;btrim&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;split_part&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="s1"&gt;'-'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;IP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;suspectedSQLinjection&lt;/span&gt; &lt;span class="nb"&gt;BOOLEAN&lt;/span&gt; &lt;span class="k"&gt;GENERATED&lt;/span&gt; &lt;span class="n"&gt;ALWAYS&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;~*&lt;/span&gt; 
                &lt;span class="n"&gt;CONCAT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'.*SELECT.*FROM.*'&lt;/span&gt; &lt;span class="p"&gt;,&lt;/span&gt; 
                       &lt;span class="s1"&gt;'|.*UNION.*SELECT.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*DELETE.*FROM.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*UPATE.*SET.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*ALTER.*TABLE.*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                       &lt;span class="s1"&gt;'|.*(%27|&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;)%20.*%20(%27|&lt;/span&gt;&lt;span class="se"&gt;''&lt;/span&gt;&lt;span class="s1"&gt;).*'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we are extracting the client IP address from the &lt;code&gt;message&lt;/code&gt; text and storing it using the dedicated &lt;code&gt;IP&lt;/code&gt; data type in CrateDB. We use the &lt;code&gt;split_part&lt;/code&gt; function to look up the string up to the dash symbol, then we use the &lt;code&gt;btrim&lt;/code&gt; function to remove spaces from both sides of the string, and finally we use the &lt;code&gt;TRY_CAST&lt;/code&gt; function so that log entries that do not have an IP address in this position get a &lt;code&gt;NULL&lt;/code&gt; value as &lt;code&gt;clientip&lt;/code&gt; but no error message is raised.&lt;/p&gt;

&lt;p&gt;We are also using the case insensitive &lt;code&gt;~*&lt;/code&gt; &lt;a href="https://crate.io/docs/crate/reference/en/master/general/dql/selects.html#sql-dql-regexp"&gt;regex comparison&lt;/a&gt; operator to look for indications of a possible SQL injection attack attempt, we are looking for occurrences of &lt;code&gt;SELECT .. FROM&lt;/code&gt; , &lt;code&gt;UNION ... SELECT&lt;/code&gt; , &lt;code&gt;DELETE ... FROM&lt;/code&gt; , &lt;code&gt;UPDATE ... SET&lt;/code&gt; , &lt;code&gt;ALTER ... TABLE&lt;/code&gt; , or attempts to break a string delimiter injecting a single quote character. This will match entries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; 172.17.0.1 - - [11/Aug/2023:13:03:07 +0000] "GET /mw-config/index.php?css=1%27%20WAITFOR%20DELAY%20%270%3A0%3A5%27%20AND%20%27Lshb%27%3D%27Lshb HTTP/1.1" 200 4627 "-" "sqlmap/1.7.8#pip (https://sqlmap.org)"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now swap in this new table and rename the old one as &lt;code&gt;systemevents_archive&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;CLUSTER&lt;/span&gt; &lt;span class="n"&gt;SWAP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;REVOKE&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents2&lt;/span&gt; &lt;span class="k"&gt;RENAME&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;systemevents_archive&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To show how this work we will need both normal activity, which we can generate just by navigating to &lt;a href="http://localhost/"&gt;http://localhost/&lt;/a&gt; (perhaps from another machine to get a different IP address in the logs), and we will also need some malicious-looking activity, to generate this against our website we can use a well-known pentesting tool &lt;a href="https://github.com/sqlmapproject/sqlmap"&gt;sqlmap&lt;/a&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;sqlmap
sqlmap &lt;span class="nt"&gt;-u&lt;/span&gt; http://localhost/ &lt;span class="nt"&gt;--crawl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Press ENTER when prompted to proceed with default settings, there will be around 10 questions before the tool concludes there are no obvious vulnerabilities in the website.&lt;/p&gt;

&lt;p&gt;We can now run queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;systemevents&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;clientip&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'172.17.0.0/16'&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;to get the most recent activity from IP addresses in the 172.17.0.0/16 subnet, and we can also look for suspicious activity only:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;systemevents&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;suspectedSQLinjection&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This kind of queries could then be integrated into a &lt;a href="https://community.crate.io/t/monitoring-an-on-premises-cratedb-cluster-with-prometheus-and-grafana/1236"&gt;Grafana dashboard&lt;/a&gt; or some alerting system.&lt;/p&gt;

&lt;p&gt;I hope you found this interesting. Please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>regex</category>
      <category>security</category>
      <category>sql</category>
      <category>networking</category>
    </item>
    <item>
      <title>Storing server logs on CrateDB for fast search and aggregations</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Thu, 10 Aug 2023 08:27:19 +0000</pubDate>
      <link>https://forem.com/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no</link>
      <guid>https://forem.com/crate/storing-server-logs-on-cratedb-for-fast-search-and-aggregations-no</guid>
      <description>&lt;p&gt;Did you know that CrateDB can be a great store for your server logs?&lt;/p&gt;

&lt;p&gt;If you have been using log aggregation tools or even some of the most advanced commercial SIEM systems, you have probably experienced the same frustrations I have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;timeouts when searching logs over long periods of time&lt;/li&gt;
&lt;li&gt;a complex and proprietary query syntax&lt;/li&gt;
&lt;li&gt;difficulties integrating queries on logs data into application monitoring dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Storing server logs on CrateDB solves these problems, it allows to query the logs with standard SQL and from any tool supporting the PostgreSQL protocol; its unique indexing also makes full-text queries and aggregations super fast.&lt;br&gt;
Let me show you an example.&lt;/p&gt;

&lt;p&gt;First, we will need an instance of CrateDB, it may be best to have a dedicated cluster for this purpose, to separate the monitoring system from the systems being monitored, but for the purpose of this demo we can just have a single node cluster on a docker container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; cratedb &lt;span class="nt"&gt;--publish&lt;/span&gt; 4200:4200 &lt;span class="nt"&gt;--publish&lt;/span&gt; 5432:5432 crate &lt;span class="nt"&gt;-Cdiscovery&lt;/span&gt;.type&lt;span class="o"&gt;=&lt;/span&gt;single-node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need a table to store the logs, let's connect to &lt;a href="http://localhost:4200/#!/console"&gt;http://localhost:4200/#!/console&lt;/a&gt; and run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;message_ft&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;FULLTEXT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;facility&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;fromhost&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;priority&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DeviceReportedTime&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ReceivedAt&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;InfoUnitID&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;SysLogTag&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; 
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tip: if you are on a headless system you can also run queries with &lt;a href="https://cratedb.com/docs/crate/clients-tools/en/latest/cli.html"&gt;command-line tools&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Then we need an account for the logging system:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;PASSWORD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'pwd123'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and we need to grant permissions on the table above:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DML&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;rsyslog&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We will use &lt;a href="https://github.com/rsyslog/rsyslog"&gt;rsyslog&lt;/a&gt; to send the logs to CrateDB, for this setup we need &lt;code&gt;rsyslog&lt;/code&gt; v8.2202 or higher and the &lt;code&gt;ompgsql&lt;/code&gt; module:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;add-apt-repository ppa:adiscon/v8-stable
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;rsyslog
&lt;span class="nb"&gt;sudo &lt;/span&gt;debconf-set-selections &lt;span class="o"&gt;&amp;lt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="s1"&gt;'rsyslog-pgsql rsyslog-pgsql/dbconfig-install string false'&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;rsyslog-pgsql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now configure it to use the account we created earlier:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'module(load="ompgsql")'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/rsyslog.d/pgsql.conf
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'*.* action(type="ompgsql" conninfo="postgresql://rsyslog:pwd123@localhost/doc")'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt; /etc/rsyslog.d/pgsql.conf
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart rsyslog
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are interested in more advanced setups involving queuing for additional reliability in production scenarios, you can read more about available settings in the &lt;a href="https://www.rsyslog.com/doc/v8-stable/tutorials/high_database_rate.html"&gt;rsyslog documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now let's imagine that we want to run a container with &lt;a href="https://www.mediawiki.org/wiki/MediaWiki"&gt;MediaWiki&lt;/a&gt; to host an intranet and we want all logs to go to CrateDB, we can just deploy this with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;--name&lt;/span&gt; mediawiki &lt;span class="nt"&gt;-p&lt;/span&gt; 80:80 &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--log-driver&lt;/span&gt; syslog &lt;span class="nt"&gt;--log-opt&lt;/span&gt; syslog-address&lt;span class="o"&gt;=&lt;/span&gt;unixgram:///dev/log mediawiki
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we now point a web browser to port 80 &lt;a href="http://localhost/"&gt;http://localhost/&lt;/a&gt; we will see a new MediaWiki page.&lt;br&gt;
Let's play around a bit to generate log entries, just click on "set up the wiki" and then once on Continue.&lt;br&gt;
This will have generated entries in the &lt;code&gt;doc.systemevents&lt;/code&gt; table with &lt;code&gt;syslogtag&lt;/code&gt; matching the container id of the container running the site.&lt;/p&gt;

&lt;p&gt;We can now use the &lt;a href="https://crate.io/docs/crate/reference/en/5.4/general/dql/fulltext.html#predicates-match"&gt;MATCH predicate&lt;/a&gt; to find the error messages we are interested in:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;devicereportedtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;MATCH&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message_ft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Could not reliably determine'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;PHRASE&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| devicereportedtime | message                                                                                                                                                                     |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
|      1691510710000 | AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.3. Set the 'ServerName' directive globally to suppress this message |
+--------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's now see which log sources created the most entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;syslogtag&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;doc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;systemevents&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+----------------------+----------+
| syslogtag            | count(*) |
+----------------------+----------+
| kernel:              |       23 |
| 083053ae8ea3[52134]: |       20 |
| systemd[1]:          |       15 |
| sudo:                |       10 |
| rsyslogd:            |        5 |
+----------------------+----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I hope you found this interesting. Please do not hesitate to let us know your thoughts in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>docker</category>
      <category>containers</category>
      <category>database</category>
    </item>
    <item>
      <title>How to add new nodes to on-premises CrateDB clusters</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Tue, 18 Jul 2023 13:12:26 +0000</pubDate>
      <link>https://forem.com/crate/how-to-add-new-nodes-to-on-premises-cratedb-clusters-4lfo</link>
      <guid>https://forem.com/crate/how-to-add-new-nodes-to-on-premises-cratedb-clusters-4lfo</guid>
      <description>&lt;p&gt;A significant feature in CrateDB is that it can scale horizontally, which means that instead of adding more RAM, CPU, and disk resources to our existing nodes we can add more nodes to our CrateDB cluster.&lt;br&gt;
This allows the handling of volumes of data that simply could not fit on a single node, but it is also very useful in scenarios where hosting everything in a single node, or a small number of nodes, would still be possible, this is because smaller nodes are often easier to manage infrastructure-wise.&lt;br&gt;
More nodes also mean more resiliency to issues, on a scenario where we have for instance 5 nodes, and configure our tables with 2 replicas, we could lose 2 nodes and still serve our production workloads. This means we can carry out maintenance on the nodes one at a time and still be able to withstand an unplanned issue on another node without downtime.&lt;/p&gt;

&lt;p&gt;Today we want to review how to add a new node to an existing on-premises cluster.&lt;/p&gt;
&lt;h2&gt;
  
  
  Related reading
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/howtos/en/latest/clustering/multi-node-setup.html"&gt;CrateDB multi-node setup — CrateDB: How-Tos&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/reference/en/latest/concepts/clustering.html"&gt;Clustering — CrateDB: Reference&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://crate.io/docs/crate/reference/en/latest/concepts/storage-consistency.html"&gt;Storage and consistency — CrateDB: Reference&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Discovery
&lt;/h2&gt;

&lt;p&gt;When a CrateDB node starts it needs a mechanism to get a list of the nodes that make up the cluster, this is called discovery.&lt;br&gt;
At the time of writing, there are 3 ways for a node to get this list:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The list of nodes can be defined in the &lt;code&gt;discovery.seed_hosts&lt;/code&gt; setting in the configuration file (typically in &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;The list can be retrieved with a DNS query, see &lt;a href="https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns"&gt;https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-via-dns&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;In AWS environments, the list of nodes can be looked up via the EC2 API, filtering on specific security groups, availability zones, and tags, see &lt;a href="https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2"&gt;https://crate.io/docs/crate/reference/en/5.4/config/cluster.html#discovery-on-amazon-ec2&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the purpose of this post, we will work with the &lt;code&gt;discovery.seed_hosts&lt;/code&gt; list.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling from a single-node deployment
&lt;/h2&gt;

&lt;p&gt;If a node is started without specifying the &lt;code&gt;initial_master_nodes&lt;/code&gt; setting (the default configuration), or with &lt;code&gt;discovery_type&lt;/code&gt; set to &lt;code&gt;single-node&lt;/code&gt;, it will be started as a standalone instance and it cannot later be scaled into a cluster. Single-node deployments are great for development and testing, but for production setups we recommend using a cluster with at least 3 nodes.&lt;/p&gt;

&lt;p&gt;If you are going for a single-node deployment initially, but plan to scale to a multi-node cluster in the future, there are some settings to configure before the very first run of the CrateDB node so that it bootstraps as a 1-node cluster instead of a standalone instance.&lt;br&gt;
The settings that we need are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;discovery.seed_hosts&lt;/code&gt; set to the hostname or the &lt;code&gt;node.name&lt;/code&gt; of the node&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;initial_master_nodes&lt;/code&gt; set to the hostname or the &lt;code&gt;node.name&lt;/code&gt; of the node&lt;/li&gt;
&lt;li&gt;optionally we can set a &lt;code&gt;cluster.name&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are using containers you would pass these settings with lines in the &lt;code&gt;args&lt;/code&gt; section of your YAML file, otherwise you could create &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; before deploying the package for your distribution (refer to &lt;a href="https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml"&gt;https://github.com/crate/crate/blob/master/app/src/main/dist/config/crate.yml&lt;/a&gt; for the template), or you could prevent the package installation from auto-starting the daemon by using a mechanism such as &lt;code&gt;policy-rcd-declarative&lt;/code&gt;, then edit the configuration file (&lt;code&gt;crate.yml&lt;/code&gt; ), and start the &lt;code&gt;crate&lt;/code&gt; daemon once all settings are ready.&lt;/p&gt;
&lt;h2&gt;
  
  
  Networking considerations
&lt;/h2&gt;

&lt;p&gt;Nodes need to be able to resolve each other's hostnames at DNS level, and they need to be able to reach each other on a TCP port which is 4300 by default.&lt;/p&gt;

&lt;p&gt;For security reasons you should configure your network so that CrateDB cluster nodes are only reachable on port 4300 from other CrateDB nodes in the cluster.&lt;br&gt;
In a Kubernetes environment this can be achieved with a &lt;code&gt;Service&lt;/code&gt; resource with a &lt;code&gt;ClusterIP&lt;/code&gt;.&lt;br&gt;
In a non-containerized environment one way to do this is to use firewall software directly on each node, for instance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#Enable ufw - all incoming connections blocked by default&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable&lt;/span&gt;

&lt;span class="c"&gt;#Allow SSH if you are using it to manage your server&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 22

&lt;span class="c"&gt;#Allow 4200 for clients to connect to CrateDB via the http endpoint&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 4200

&lt;span class="c"&gt;#Allow 5432 if you have PostgreSQL clients&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 5432

&lt;span class="c"&gt;#Allow 4300 from 192.168.0.202 (another cluster node in this example)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow proto tcp from 192.168.0.202 to any port 4300
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You may also want to consider network access control and/or a separate network adapter for intra-cluster communications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying the new node
&lt;/h2&gt;

&lt;p&gt;Make sure the new node does not auto-bootstrap as a single-node instance, you may want to either create &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; in advance or use a mechanism as &lt;code&gt;policy-rcd-declarative&lt;/code&gt; as mentioned earlier.&lt;br&gt;
On the configuration file for the new node:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;discovery.seed_hosts&lt;/code&gt; to the full list of nodes, including the new one you are adding.&lt;/li&gt;
&lt;li&gt;Optionally set a &lt;code&gt;node.name&lt;/code&gt; , if not done the node get assigned a random name from the &lt;code&gt;sys.summits&lt;/code&gt; table. You may wonder what those default names are about, they are the names of mountains in the area around our main office, we love mountains at &lt;a href="http://crate.io/"&gt;Crate.io&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;cluster.name&lt;/code&gt; to a value that matches the other nodes in the cluster, if not specified the default cluster name is &lt;code&gt;crate&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Consider if you want to set the &lt;a href="https://crate.io/docs/crate/reference/en/latest/config/cluster.html#metadata-gateway"&gt;cluster-wide settings&lt;/a&gt; &lt;code&gt;gateway.expected_data_nodes&lt;/code&gt;, &lt;code&gt;gateway.recover_after_data_nodes&lt;/code&gt;, and/or &lt;code&gt;gateway.recover_after_time&lt;/code&gt; to prevent the unnecessary creation of new replicas and the rebalancing of shards when a node takes a little bit longer to start, or in case of transient issues, when the cluster is starting up from a situation where all nodes are shutdown. Please note these settings are used when the cluster is starting up from being offline, if you want to delay the allocation of replicas when a node becomes unavailable on a cluster that stays online there is &lt;a href="https://crate.io/docs/crate/reference/en/5.4/sql/statements/create-table.html#unassigned-node-left-delayed-timeout"&gt;a different setting at table level&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we can start the &lt;code&gt;crate&lt;/code&gt; daemon, you will see the node joining the cluster and CrateDB will start using it for shards allocation.&lt;/p&gt;

&lt;p&gt;Remember to add the new nodes alongside the old ones in any monitoring system and load balancer configuration you may have in your environment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Updating settings on the old nodes
&lt;/h2&gt;

&lt;p&gt;Now we need to align a number of settings in the other nodes, these are typically in the &lt;code&gt;/etc/crate/crate.yml&lt;/code&gt; file:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Update &lt;code&gt;discovery.seed_hosts&lt;/code&gt; adding the new node&lt;/li&gt;
&lt;li&gt;If you have configured &lt;code&gt;gateway.&lt;/code&gt; settings, update them to have the same values on all nodes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These settings only play a role during restart, not at runtime, so you do not need to restart the nodes after making these changes, but if the &lt;code&gt;gateway.&lt;/code&gt; settings need updating you may see a warning in the Admin UI which can be acknowledged.&lt;/p&gt;

&lt;p&gt;Please also note there is no need to update the &lt;code&gt;initial_master_nodes&lt;/code&gt; list, this is only considered during the initial cluster bootstrapping.&lt;/p&gt;

&lt;p&gt;And that is it, we have scaled out our cluster and we are ready to work with larger volumes of data. I hope you find this useful and, as usual, please do not hesitate to raise any thoughts or questions in the &lt;a href="https://community.crate.io/"&gt;CrateDB Community&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>clustering</category>
      <category>scaling</category>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Replicating data from other databases to CrateDB with Debezium and Kafka</title>
      <dc:creator>Hernán Lionel Cianfagna</dc:creator>
      <pubDate>Tue, 28 Feb 2023 16:29:49 +0000</pubDate>
      <link>https://forem.com/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981</link>
      <guid>https://forem.com/crate/replicating-data-from-other-databases-to-cratedb-with-debezium-and-kafka-1981</guid>
      <description>&lt;p&gt;You may have line-of-business applications such as ERP software that work with transactional database systems like MSSQL, Oracle, or MySQL.&lt;/p&gt;

&lt;p&gt;The setup may work perfectly fine for day-to-day operations, but you may find that it is not ideal for doing data analytics.&lt;/p&gt;

&lt;p&gt;Attempting to run analytic workloads against the operational databases you may see concurrency issues deriving from locking, the analytics queries may have an impact on the performance of business-critical operations, and you may also find that the performance and feature-set in the transactional database system may not be good enough for analyzing large amounts of data.&lt;/p&gt;

&lt;p&gt;Considering this, many organisations come to the conclusion they need to copy data to a separate environment to run reporting and dashboards, this is sometimes done with replication, sometimes with backups, and sometimes with complex ETL pipelines. This often comes with a set of challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ballooning license costs&lt;/li&gt;
&lt;li&gt;custom ad-hoc routines for getting the data to the analytics environment, requiring development, monitoring, and troubleshooting&lt;/li&gt;
&lt;li&gt;a need to design and maintain an indexing strategy for the analytics copy of the data&lt;/li&gt;
&lt;li&gt;high availability requirements for the analytics environment as the business starts relying on it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We know we can address several of these points by using a system like CrateDB. CrateDB is a feature-rich, open-source, SQL database which out-of-the-box automatically implements indexes, compression, and a columnar store so that most analytical queries can run much faster without any need to fiddle with settings. Because it is open-source, there is no need to be concerned about licensing expenses. Additionally, it can scale horizontally, which means that the number of nodes can be adjusted as needed to handle changing data volumes and workloads, and it provides high availability without requiring administrative effort.&lt;/p&gt;

&lt;p&gt;If only we could replicate data from our operational database to CrateDB without having to write custom code… it turns out we can.&lt;/p&gt;

&lt;p&gt;Enter Debezium, &lt;a href="https://debezium.io/"&gt;Debezium &lt;/a&gt;is a standard open-source system, built on top of Kafka, which allows to capture changes on a source database system and replicate them on another system without having to write custom scripts.&lt;/p&gt;

&lt;p&gt;In this post I want to show an example replicating changes on a table from MSSQL to CrateDB.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup on the MSSQL side
&lt;/h2&gt;

&lt;p&gt;We will need a SQL Server instance with the SQL Server Agent service up and running, if you are running MSSQL on a container you can get the agent running by setting the environment variable &lt;code&gt;MSSQL_AGENT_ENABLED&lt;/code&gt; to &lt;code&gt;True&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Connect to the instance with a client such as &lt;code&gt;sqlcmd&lt;/code&gt;, SSMS, or &lt;a href="https://dbeaver.io/"&gt;DBeaver&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We are now going to go through a number of steps, if you already have a working system feel free to skip the operations you do not need.&lt;/p&gt;

&lt;p&gt;Let’s create a database with a test table on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;GO&lt;/span&gt;
&lt;span class="n"&gt;USE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="k"&gt;IDENTITY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;createdon&lt;/span&gt; &lt;span class="nb"&gt;DATETIME&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;getdate&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="n"&gt;NVARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now create an account for Debezium to use to pull the changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;LOGIN&lt;/span&gt; &lt;span class="n"&gt;debeziumlogin&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;PASSWORD&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&amp;lt;enterStrongPasswordHere&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;debeziumuser&lt;/span&gt; &lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="n"&gt;LOGIN&lt;/span&gt; &lt;span class="n"&gt;debeziumlogin&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;ROLE&lt;/span&gt; &lt;span class="n"&gt;debeziumrole&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sp_addrolemember&lt;/span&gt; &lt;span class="s1"&gt;'debeziumrole'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'debeziumuser'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sp_addrolemember&lt;/span&gt; &lt;span class="s1"&gt;'db_datareader'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'debeziumuser'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And let’s enable change data capture on our example table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sp_cdc_enable_db&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FILEGROUP&lt;/span&gt; &lt;span class="n"&gt;cdcfg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;DATABASE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="n"&gt;FILE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;erp_cdc_file1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;FILENAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'/var/opt/mssql/data/erp_cdc_file1.ndf'&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="n"&gt;FILEGROUP&lt;/span&gt; &lt;span class="n"&gt;cdcfg&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;EXEC&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sp_cdc_enable_table&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;source_schema&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'dbo'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;source_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'tbltest'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;role_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'debeziumrole'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;filegroup_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'cdcfg'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="o"&gt;@&lt;/span&gt;&lt;span class="n"&gt;supports_net_changes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setup on the CrateDB side
&lt;/h2&gt;

&lt;p&gt;We will need a CrateDB instance, for the purpose of this example we can spin one up with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;docker.io
&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;--publish&lt;/span&gt; 4200:4200 &lt;span class="nt"&gt;--publish&lt;/span&gt; 5432:5432 crate:latest &lt;span class="nt"&gt;-Cdiscovery&lt;/span&gt;.type&lt;span class="o"&gt;=&lt;/span&gt;single-node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we need to run a couple of SQL commands on this instance, an easy way to do this is using the Admin UI that can be accessed navigating with a web browser to port 4200 on the server where CrateDB is running, for instance &lt;code&gt;http://localhost:4200&lt;/code&gt; and then open the console (second icon from the top on the left-hand side navigation bar).&lt;/p&gt;

&lt;p&gt;We will create a user account for Debezium to use:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;USER&lt;/span&gt; &lt;span class="n"&gt;debezium&lt;/span&gt; &lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'debeziumpwdincratedb123'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The table on our MSSQL source is on the &lt;code&gt;dbo&lt;/code&gt; schema, let’s imagine we want to have a &lt;code&gt;dbo&lt;/code&gt; schema on CrateDB as well, the &lt;code&gt;debezium&lt;/code&gt; account will need permissions on it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;GRANT&lt;/span&gt; &lt;span class="n"&gt;DQL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DML&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;DDL&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="k"&gt;SCHEMA&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt; &lt;span class="k"&gt;to&lt;/span&gt; &lt;span class="n"&gt;debezium&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And let’s create the structure of the table that will receive the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;INT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt; &lt;span class="cm"&gt;/* we need the PK definition to match the source table so that this can be used to lookup records when they need to be updated */&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;createdon&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="cm"&gt;/* CrateDB supports defaults -of course- but because the source table already has a default value we do not need that here */&lt;/span&gt;
    &lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Zookeeper and Kafka
&lt;/h2&gt;

&lt;p&gt;To use Debezium we will need to have working setups of Zookeeper and Kafka.&lt;/p&gt;

&lt;p&gt;For the purpose of this example I will spin them up with containers on the same machine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; zookeeper &lt;span class="nt"&gt;-p&lt;/span&gt; 2181:2181 &lt;span class="nt"&gt;-p&lt;/span&gt; 2888:2888 &lt;span class="nt"&gt;-p&lt;/span&gt; 3888:3888 debezium/zookeeper
&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; kafka &lt;span class="nt"&gt;-p&lt;/span&gt; 9092:9092 &lt;span class="nt"&gt;--link&lt;/span&gt; zookeeper:zookeeper &lt;span class="nt"&gt;--add-host&lt;/span&gt; host.docker.internal:host-gateway debezium/kafka
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We need to create some special topics in Kafka:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; kafka &lt;span class="s2"&gt;"bash"&lt;/span&gt;
bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 1 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 1 &lt;span class="nt"&gt;--topic&lt;/span&gt; my_connect_configs &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; host.docker.internal:9092 &lt;span class="nt"&gt;--config&lt;/span&gt; cleanup.policy&lt;span class="o"&gt;=&lt;/span&gt;compact
bin/kafka-topics.sh &lt;span class="nt"&gt;--create&lt;/span&gt; &lt;span class="nt"&gt;--replication-factor&lt;/span&gt; 1 &lt;span class="nt"&gt;--partitions&lt;/span&gt; 1 &lt;span class="nt"&gt;--topic&lt;/span&gt; my_connect_offsets &lt;span class="nt"&gt;--bootstrap-server&lt;/span&gt; host.docker.internal:9092 &lt;span class="nt"&gt;--config&lt;/span&gt; cleanup.policy&lt;span class="o"&gt;=&lt;/span&gt;compact
&lt;span class="nb"&gt;exit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please note this is a very basic setup, for production purposes you may want to adjust some of &lt;a href="https://kafka.apache.org/documentation/#topicconfigs"&gt;these settings&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Preparing and starting a Debezium container image
&lt;/h2&gt;

&lt;p&gt;We need to customize the base &lt;code&gt;debezium/connect&lt;/code&gt; Docker image adding a JDBC sink and the PostgreSQL drivers.&lt;/p&gt;

&lt;p&gt;For this we need to download the zip file from &lt;a href="https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc"&gt;kafka-connect-jdbc&lt;/a&gt; and then run the below replacing &lt;code&gt;*************&lt;/code&gt; with the appropriate URL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;customdockerimg
&lt;span class="nb"&gt;cd &lt;/span&gt;customdockerimg
wget &lt;span class="k"&gt;*************&lt;/span&gt;/confluentinc-kafka-connect-jdbc-10.6.3.zip
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;unzip
&lt;span class="nb"&gt;mkdir &lt;/span&gt;confluentinc-kafka-connect-jdbc-10.6.3
&lt;span class="nb"&gt;cd &lt;/span&gt;confluentinc-kafka-connect-jdbc-10.6.3
unzip &lt;span class="nt"&gt;-j&lt;/span&gt; ../confluentinc-kafka-connect-jdbc-10.6.3.zip
&lt;span class="nb"&gt;cd&lt;/span&gt; ..
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Dockerfile &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt;  
FROM debezium/connect
USER root:root
COPY ./confluentinc-kafka-connect-jdbc-10.6.3/ /kafka/connect/
RUN cd /kafka/libs &amp;amp;&amp;amp; curl -sO https://jdbc.postgresql.org/download/postgresql-42.5.4.jar
USER 1001
&lt;/span&gt;&lt;span class="no"&gt;EOF
&lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker build &lt;span class="nt"&gt;-t&lt;/span&gt; cratedb-connect-debezium &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s now start this custom image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;docker run &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--name&lt;/span&gt; connect &lt;span class="nt"&gt;-p&lt;/span&gt; 8083:8083 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GROUP_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;CONFIG_STORAGE_TOPIC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my_connect_configs &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OFFSET_STORAGE_TOPIC&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;my_connect_offsets &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;--add-host&lt;/span&gt; host.docker.internal:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;--add-host&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;BOOTSTRAP_SERVERS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host.docker.internal:9092 &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;KEY_CONVERTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;org.apache.kafka.connect.json.JsonConverter &lt;span class="se"&gt;\&lt;/span&gt;
           &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;VALUE_CONVERTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;org.apache.kafka.connect.json.JsonConverter &lt;span class="se"&gt;\&lt;/span&gt;
           cratedb-connect-debezium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This assumes Kafka is running locally on the same server, you will need to adjust &lt;code&gt;BOOTSTRAP_SERVERS&lt;/code&gt; if that is not the case.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configure a source connector
&lt;/h2&gt;

&lt;p&gt;Let’s create a &lt;code&gt;connector.json&lt;/code&gt; file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mssql-source-tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.connector.sqlserver.SqlServerConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tasks.max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.history.kafka.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema.history.internal.kafka.bootstrap.servers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal:9092"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"topic.prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedbdemo"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.encrypt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.hostname"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"host.docker.internal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.port"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1433"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debeziumlogin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;enterStrongPasswordHere&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"database.server.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mssql-server"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"database.names"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"erp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;        
        &lt;/span&gt;&lt;span class="nl"&gt;"table.whitelist"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;        
        &lt;/span&gt;&lt;span class="nl"&gt;"database.history.kafka.topic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schema-changes.mssql-server.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"schema.history.internal.kafka.topic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"schema-changes.inventory.mssql-server.tbltest"&lt;/span&gt;&lt;span class="w"&gt;              
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can observe that there are settings there concerning the Kafka setup to use, the details to connect to MSSQL, the name of the table that we want to pull changes from, and the Kafka topics that will be used to track these changes.&lt;/p&gt;

&lt;p&gt;Let’s deploy this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept:application/json"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type:application/json"&lt;/span&gt; http://localhost:8083/connectors/ &lt;span class="nt"&gt;-d&lt;/span&gt; @connector.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configure a target
&lt;/h2&gt;

&lt;p&gt;Let’s create a &lt;code&gt;destination-connector.json&lt;/code&gt; file as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedb-sink-tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"connector.class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.confluent.connect.jdbc.JdbcSinkConnector"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"tasks.max"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;       

        &lt;/span&gt;&lt;span class="nl"&gt;"connection.url"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"jdbc:postgresql://host.docker.internal:5432/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;              
        &lt;/span&gt;&lt;span class="nl"&gt;"connection.user"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debezium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      
        &lt;/span&gt;&lt;span class="nl"&gt;"connection.password"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"debeziumpwdincratedb123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;               

        &lt;/span&gt;&lt;span class="nl"&gt;"topics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cratedbdemo.erp.dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; 
        &lt;/span&gt;&lt;span class="nl"&gt;"table.name.format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"dbo.tbltest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"auto.create"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"auto.evolve"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"false"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="nl"&gt;"insert.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"upsert"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"pk.fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"pk.mode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"record_value"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;      

        &lt;/span&gt;&lt;span class="nl"&gt;"transforms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"unwrap"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;                                                 
        &lt;/span&gt;&lt;span class="nl"&gt;"transforms.unwrap.type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"io.debezium.transforms.ExtractNewRecordState"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We got details to connect to CrateDB, the name of table that will receive the changes (please note this is case sensitive), and some transform instructions to flatten the JSON data stored in the Kafka topic.&lt;/p&gt;

&lt;p&gt;Let’s deploy this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Accept:application/json"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type:application/json"&lt;/span&gt; http://localhost:8083/connectors/ &lt;span class="nt"&gt;-d&lt;/span&gt; @destination-connector.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;Let’s see this in action.&lt;/p&gt;

&lt;p&gt;Let’s create a record from the MSSQL side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcsystem&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;@@&lt;/span&gt;&lt;span class="k"&gt;version&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now let’s go to CrateDB and check the table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As magic the record is there.&lt;/p&gt;

&lt;p&gt;Let’s now try an update from the MSSQL side:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;erp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dbo&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tbltest&lt;/span&gt;
&lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;srcsystem&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Updated successfully'&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs04ve699lndjq45vg8u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgs04ve699lndjq45vg8u.png" alt="Image description" width="800" height="222"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using Debezium we can replicate changes from different database systems to CrateDB without having to develop any custom logic, we can then take advantage of CrateDB’s performance and features for our analytic workloads.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>database</category>
      <category>architecture</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
