<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mohamed Hussain S</title>
    <description>The latest articles on Forem by Mohamed Hussain S (@mohhddhassan).</description>
    <link>https://forem.com/mohhddhassan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3228500%2Fc2e87a6e-70e3-4023-a3d3-5adddef072f2.jpeg</url>
      <title>Forem: Mohamed Hussain S</title>
      <link>https://forem.com/mohhddhassan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mohhddhassan"/>
    <language>en</language>
    <item>
      <title>PostgreSQL Restore Failures: It Wasn’t pgBackRest, It Was My Recovery Logic</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 06 May 2026 12:24:28 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/postgresql-restore-failures-it-wasnt-pgbackrest-it-was-my-recovery-logic-527d</link>
      <guid>https://forem.com/mohhddhassan/postgresql-restore-failures-it-wasnt-pgbackrest-it-was-my-recovery-logic-527d</guid>
      <description>&lt;p&gt;I was building and testing a PostgreSQL backup and restore workflow using pgBackRest.&lt;/p&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;take backups&lt;/li&gt;
&lt;li&gt;restore them automatically&lt;/li&gt;
&lt;li&gt;validate the database&lt;/li&gt;
&lt;li&gt;make recovery predictable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Instead, I ended up repeatedly breaking PostgreSQL recovery itself.&lt;/p&gt;

&lt;p&gt;At one point, PostgreSQL refused to start entirely, the application depending on it failed to start, and I started seeing errors like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invalid checkpoint record
could not locate a valid checkpoint record at 0/DEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Later, I also hit timeline mismatch errors like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: [058]: target timeline 3 forked from backup timeline 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first, I thought:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;pgBackRest restores were corrupting PostgreSQL.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That assumption turned out to be completely wrong.&lt;/p&gt;

&lt;p&gt;The real problem was the way I was handling recovery.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Was Building
&lt;/h2&gt;

&lt;p&gt;I was testing a PostgreSQL backup/restore flow locally after repeated restore failures elsewhere.&lt;/p&gt;

&lt;p&gt;To isolate the issue properly, I moved PostgreSQL onto my local machine and started testing the restore logic independently through API-triggered workflows.&lt;/p&gt;

&lt;p&gt;The restore flow looked roughly like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Download backup repo&lt;/li&gt;
&lt;li&gt;Stop PostgreSQL&lt;/li&gt;
&lt;li&gt;Restore backup&lt;/li&gt;
&lt;li&gt;Start PostgreSQL&lt;/li&gt;
&lt;li&gt;Validate database&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Sounds straightforward.&lt;/p&gt;

&lt;p&gt;It wasn't.&lt;/p&gt;




&lt;h2&gt;
  
  
  The First Major Failure
&lt;/h2&gt;

&lt;p&gt;After a restore attempt, PostgreSQL refused to start.&lt;/p&gt;

&lt;p&gt;The logs looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LOG: database system was interrupted
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record at 0/DEAD
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At that point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL was down&lt;/li&gt;
&lt;li&gt;the application couldn't start&lt;/li&gt;
&lt;li&gt;authentication-related functionality stopped working&lt;/li&gt;
&lt;li&gt;and repeated restore attempts made things even worse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What confused me initially was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The restore itself appeared to complete.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But PostgreSQL would immediately enter recovery problems afterward.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Wrong Assumption
&lt;/h2&gt;

&lt;p&gt;This was the real issue.&lt;/p&gt;

&lt;p&gt;Every time recovery failed, I kept seeing files like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;backup_label&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;recovery.signal&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;standby.signal&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I assumed they were leftover artifacts from failed restores.&lt;/p&gt;

&lt;p&gt;My restore automation started aggressively cleaning them up.&lt;/p&gt;

&lt;p&gt;Something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; recovery.signal standby.signal backup_label
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I genuinely believed this was helping PostgreSQL start cleanly.&lt;/p&gt;

&lt;p&gt;In reality:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I was deleting the exact recovery metadata PostgreSQL needed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That misunderstanding caused almost every major issue afterward.&lt;/p&gt;




&lt;h2&gt;
  
  
  What PostgreSQL Was Actually Trying To Do
&lt;/h2&gt;

&lt;p&gt;This was the turning point.&lt;/p&gt;

&lt;p&gt;pgBackRest wasn't randomly writing junk files into the data directory.&lt;/p&gt;

&lt;p&gt;Those files exist for a reason.&lt;/p&gt;

&lt;p&gt;During restore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup_label&lt;/code&gt; tells PostgreSQL where recovery should begin&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recovery.signal&lt;/code&gt; tells PostgreSQL to enter recovery mode&lt;/li&gt;
&lt;li&gt;WAL replay reconstructs a consistent database state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL was actually trying to perform a valid recovery process.&lt;/p&gt;

&lt;p&gt;My automation kept interrupting or invalidating it.&lt;/p&gt;

&lt;p&gt;Once I understood that, the entire problem started making sense.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Recovery Loop Problem
&lt;/h2&gt;

&lt;p&gt;Because my cleanup logic removed recovery metadata prematurely, PostgreSQL ended up in inconsistent states repeatedly.&lt;/p&gt;

&lt;p&gt;Sometimes it would:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;enter recovery mode&lt;/li&gt;
&lt;li&gt;fail WAL replay&lt;/li&gt;
&lt;li&gt;lose checkpoint continuity&lt;/li&gt;
&lt;li&gt;refuse startup entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Other times it would partially start, but remain stuck in recovery mode.&lt;/p&gt;

&lt;p&gt;That led to additional logic being added just to stabilize startup behavior.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_is_in_recovery&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and when required:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pg_promote&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The goal wasn't to "force PostgreSQL to work".&lt;/p&gt;

&lt;p&gt;The goal was:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;let PostgreSQL finish recovery properly, then promote only when necessary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That distinction mattered a lot.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Timeline Mismatch Error
&lt;/h2&gt;

&lt;p&gt;At one stage, I also hit this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ERROR: [058]: target timeline 3 forked from backup timeline 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This one was especially confusing at first.&lt;/p&gt;

&lt;p&gt;The issue was not just corrupted startup state anymore.&lt;/p&gt;

&lt;p&gt;Now PostgreSQL was rejecting WAL history itself.&lt;/p&gt;

&lt;p&gt;This happened because earlier restore attempts had already created inconsistent recovery timelines.&lt;/p&gt;

&lt;p&gt;I had essentially created multiple broken recovery histories while repeatedly testing and modifying the restore process.&lt;/p&gt;

&lt;p&gt;That was another important lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PostgreSQL backups are not just data files.&lt;br&gt;
They are tightly connected to WAL history and recovery timelines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At this point, I realized I was no longer debugging a simple restore failure. I was debugging recovery history itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem In My Restore Flow
&lt;/h2&gt;

&lt;p&gt;Initially, my restore logic tried to "fix" PostgreSQL after restore.&lt;/p&gt;

&lt;p&gt;That approach was fundamentally flawed.&lt;/p&gt;

&lt;p&gt;The older flow looked roughly like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Old Approach&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Delta restore&lt;/td&gt;
&lt;td&gt;Mixed old/new recovery state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delete &lt;code&gt;backup_label&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Broke recovery metadata&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Delete &lt;code&gt;recovery.signal&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Interrupted recovery&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Force archive changes&lt;/td&gt;
&lt;td&gt;Caused WAL continuity issues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hope PostgreSQL starts&lt;/td&gt;
&lt;td&gt;No validation or recovery awareness&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;I was treating recovery artifacts like corruption.&lt;/p&gt;

&lt;p&gt;They weren't corruption.&lt;/p&gt;

&lt;p&gt;They were part of PostgreSQL recovery itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Change That Finally Fixed It
&lt;/h2&gt;

&lt;p&gt;The biggest realization was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Stop fighting PostgreSQL recovery.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Instead of trying to manually "clean up" PostgreSQL after restore, I changed the restore flow completely.&lt;/p&gt;

&lt;p&gt;The corrected restore flow became:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Stop PostgreSQL cleanly&lt;/li&gt;
&lt;li&gt;Completely empty the data directory&lt;/li&gt;
&lt;li&gt;Run pgBackRest restore properly&lt;/li&gt;
&lt;li&gt;Let PostgreSQL recover normally&lt;/li&gt;
&lt;li&gt;Wait for readiness&lt;/li&gt;
&lt;li&gt;Promote only if recovery mode persists&lt;/li&gt;
&lt;li&gt;Validate using pgBackRest check&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The critical change was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_run_pgbackrest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;restore&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--type=immediate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And equally important:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_empty_directory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pg_data_dir&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of attempting partial or delta-style recovery cleanup, the restore process now starts from a completely clean data directory.&lt;/p&gt;

&lt;p&gt;That eliminated a huge amount of inconsistent state.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;--type=immediate&lt;/code&gt; Helped
&lt;/h2&gt;

&lt;p&gt;This turned out to be extremely important.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;--type=immediate&lt;/code&gt; tells pgBackRest:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;restore to the latest immediately consistent point available.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL could perform proper WAL-based recovery&lt;/li&gt;
&lt;li&gt;recovery metadata stayed intact&lt;/li&gt;
&lt;li&gt;WAL replay remained valid&lt;/li&gt;
&lt;li&gt;timeline handling became predictable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PostgreSQL itself was finally allowed to control recovery correctly.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Mistake That Increased the Blast Radius
&lt;/h2&gt;

&lt;p&gt;One thing I learned the hard way:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Never test restore automation against a database actively used by an application.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even though this was a testing workflow, the PostgreSQL instance was still tied to application startup behavior.&lt;/p&gt;

&lt;p&gt;So whenever PostgreSQL failed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;application startup failed too&lt;/li&gt;
&lt;li&gt;user-related functionality broke&lt;/li&gt;
&lt;li&gt;debugging became much harder under pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After repeated failures, I moved the restore testing flow entirely onto my local machine and isolated PostgreSQL from the rest of the application stack.&lt;/p&gt;

&lt;p&gt;That made debugging significantly easier.&lt;/p&gt;




&lt;h2&gt;
  
  
  Another Subtle Issue: Backup Failures After Restore
&lt;/h2&gt;

&lt;p&gt;I also ran into another confusing problem after some restore attempts.&lt;/p&gt;

&lt;p&gt;In certain cases, subsequent backups started failing unexpectedly after a restore.&lt;/p&gt;

&lt;p&gt;Part of the issue came from mixing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restore operations&lt;/li&gt;
&lt;li&gt;delta-style restore assumptions&lt;/li&gt;
&lt;li&gt;and archive/WAL state inconsistencies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At one stage, I was also toggling archive-related behavior incorrectly during recovery experiments, which further complicated WAL continuity.&lt;/p&gt;

&lt;p&gt;This reinforced another important realization:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PostgreSQL backups are tightly coupled with WAL history and recovery timelines.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even when the database appears to start correctly, inconsistent recovery state can break future backup behavior in subtle ways.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned From This
&lt;/h2&gt;

&lt;p&gt;This experience completely changed how I think about PostgreSQL recovery.&lt;/p&gt;

&lt;p&gt;Some major lessons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;backup_label&lt;/code&gt; and &lt;code&gt;recovery.signal&lt;/code&gt; are not garbage files&lt;/li&gt;
&lt;li&gt;PostgreSQL recovery is heavily WAL-dependent&lt;/li&gt;
&lt;li&gt;Timelines matter more than most people realize&lt;/li&gt;
&lt;li&gt;Partial cleanup creates inconsistent recovery states&lt;/li&gt;
&lt;li&gt;A clean restore is often safer than trying to "repair" recovery manually&lt;/li&gt;
&lt;li&gt;pgBackRest already knows how to orchestrate PostgreSQL recovery properly&lt;/li&gt;
&lt;li&gt;Restore validation matters as much as backup creation&lt;/li&gt;
&lt;li&gt;Backup testing should happen in isolated environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PostgreSQL recovery is not something you should "fight".&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I stopped trying to override recovery behavior manually and instead allowed PostgreSQL + pgBackRest to handle recovery the way they were designed to, the restore flow finally became stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Final Restore Flow That Actually Worked
&lt;/h2&gt;

&lt;p&gt;After multiple failed recovery attempts, timeline mismatches, and broken startup states, I stopped trying to manually "fix" PostgreSQL recovery and instead simplified the restore process completely.&lt;/p&gt;

&lt;p&gt;The final stable flow looked roughly like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# simplified restore flow
&lt;/span&gt;
&lt;span class="nf"&gt;stop_postgres&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;empty_data_directory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;pgbackrest_restore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--type=immediate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;start_postgres&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;wait_for_connection&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;postgres_is_in_recovery&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;promote_postgres&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nf"&gt;pgbackrest_check&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important part here is not the code itself.&lt;/p&gt;

&lt;p&gt;It's the recovery philosophy behind it.&lt;/p&gt;

&lt;p&gt;The earlier versions of my restore logic tried to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;partially clean recovery state&lt;/li&gt;
&lt;li&gt;remove recovery metadata&lt;/li&gt;
&lt;li&gt;force PostgreSQL out of recovery&lt;/li&gt;
&lt;li&gt;preserve old data directory state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That approach kept creating inconsistent recovery conditions.&lt;/p&gt;

&lt;p&gt;The corrected flow instead does three important things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;starts from a completely clean data directory&lt;/li&gt;
&lt;li&gt;lets pgBackRest manage recovery metadata properly&lt;/li&gt;
&lt;li&gt;allows PostgreSQL to perform WAL recovery the way it was designed to&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest change was no longer treating files like &lt;code&gt;backup_label&lt;/code&gt; or &lt;code&gt;recovery.signal&lt;/code&gt; as corruption artifacts.&lt;/p&gt;

&lt;p&gt;They were part of the recovery process itself.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;At the beginning, I thought PostgreSQL restores were failing because the database was corrupted.&lt;/p&gt;

&lt;p&gt;In reality, the corruption was coming from my own recovery assumptions.&lt;/p&gt;

&lt;p&gt;The system wasn't broken.&lt;/p&gt;

&lt;p&gt;My mental model of PostgreSQL recovery was.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>arrayJoin in ClickHouse: Why Your Rows Are Duplicating (and How to Control It)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:20:11 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/arrayjoin-in-clickhouse-why-your-rows-are-duplicating-and-how-to-control-it-5862</link>
      <guid>https://forem.com/mohhddhassan/arrayjoin-in-clickhouse-why-your-rows-are-duplicating-and-how-to-control-it-5862</guid>
      <description>&lt;p&gt;When working with arrays in ClickHouse, &lt;code&gt;arrayJoin&lt;/code&gt; feels straightforward.&lt;/p&gt;

&lt;p&gt;Until your query suddenly returns far more rows than expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Use Case
&lt;/h2&gt;

&lt;p&gt;Let’s say you have a table like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt; &lt;span class="n"&gt;UInt32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="n"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ENGINE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;MergeTree&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;user_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
&lt;span class="na"&gt;actions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;click'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;scroll'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;purchase'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you want each action as a separate row.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Tool: &lt;code&gt;arrayJoin&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;arrayJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1   click
1   scroll
1   purchase
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So far, everything looks correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where Things Go Wrong
&lt;/h2&gt;

&lt;p&gt;Now let’s say you write:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;arrayJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;arrayJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;action2&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You might expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 rows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But you actually get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;9 rows&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;arrayJoin&lt;/code&gt; doesn’t just flatten arrays.&lt;/p&gt;

&lt;p&gt;It expands rows.&lt;/p&gt;

&lt;p&gt;Each element in the array creates a new row.&lt;/p&gt;

&lt;p&gt;So when you use it multiple times:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First &lt;code&gt;arrayJoin&lt;/code&gt; → expands rows&lt;/li&gt;
&lt;li&gt;Second &lt;code&gt;arrayJoin&lt;/code&gt; → expands again&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;p&gt;3 elements → 3 × 3 = 9 rows&lt;/p&gt;

&lt;p&gt;This is effectively a cartesian multiplication of rows.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hidden Impact
&lt;/h2&gt;

&lt;p&gt;This becomes a real problem when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arrays are large&lt;/li&gt;
&lt;li&gt;Multiple &lt;code&gt;arrayJoin&lt;/code&gt;s are used&lt;/li&gt;
&lt;li&gt;You don’t expect row multiplication&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorrect output&lt;/li&gt;
&lt;li&gt;Sudden increase in row count&lt;/li&gt;
&lt;li&gt;Slower queries&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Better Approach
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Use a single &lt;code&gt;arrayJoin&lt;/code&gt; when possible
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;arrayJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  2. Use &lt;code&gt;ARRAY JOIN&lt;/code&gt; syntax (cleaner and explicit)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="n"&gt;ARRAY&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;actions&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  3. Use &lt;code&gt;arrayZip&lt;/code&gt; to avoid unintended multiplication
&lt;/h3&gt;

&lt;p&gt;If you’re working with multiple arrays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="n"&gt;arrayJoin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;arrayZip&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actions&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;zipped&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures elements are paired instead of multiplied.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;arrayJoin&lt;/code&gt; is powerful-but easy to misuse.&lt;/p&gt;

&lt;p&gt;If used without understanding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row count can explode&lt;/li&gt;
&lt;li&gt;Queries become expensive&lt;/li&gt;
&lt;li&gt;Results can be misleading&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Event tracking pipelines&lt;/li&gt;
&lt;li&gt;Flattening nested JSON&lt;/li&gt;
&lt;li&gt;Working with semi-structured logs&lt;/li&gt;
&lt;li&gt;Exploding arrays into rows for analysis&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  One Important Gotcha
&lt;/h2&gt;

&lt;p&gt;Every &lt;code&gt;arrayJoin&lt;/code&gt; multiplies rows.&lt;/p&gt;

&lt;p&gt;If your result size looks unexpectedly large, this is one of the first things to check.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;arrayJoin&lt;/code&gt; is one of the most useful tools in ClickHouse.&lt;/p&gt;

&lt;p&gt;But its behavior is not always intuitive.&lt;/p&gt;

&lt;p&gt;In many cases, the issue is not the data itself-but how the query expands it.&lt;/p&gt;

&lt;p&gt;Understanding this early can save a lot of debugging time.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>sql</category>
      <category>database</category>
      <category>performance</category>
    </item>
    <item>
      <title>greatCircleDistance in ClickHouse: Avoiding Full Table Scans</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Mon, 20 Apr 2026 16:18:23 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/greatcircledistance-in-clickhouse-avoiding-full-table-scans-375p</link>
      <guid>https://forem.com/mohhddhassan/greatcircledistance-in-clickhouse-avoiding-full-table-scans-375p</guid>
      <description>&lt;p&gt;When working with location data, one problem shows up almost immediately:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“How do I calculate the distance between two coordinates stored in my database?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At first, it seems like something you’d have to handle outside the database.&lt;/p&gt;

&lt;p&gt;But if you're using ClickHouse, there’s a built-in function for this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Right Tool: &lt;code&gt;greatCircleDistance&lt;/code&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;greatCircleDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lat2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It calculates the shortest distance between two points on Earth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;greatCircleDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0827&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2707&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;9716&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;77&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;5946&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;distance_meters&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you the distance between Chennai and Bangalore - in &lt;strong&gt;meters&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Looks Simple… But There’s a Catch
&lt;/h2&gt;

&lt;p&gt;Now let’s say you write a query like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;greatCircleDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0827&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2707&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At first glance, this looks perfectly fine.&lt;/p&gt;

&lt;p&gt;But this can quietly turn into a full table scan - especially on large datasets.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why This Happens
&lt;/h2&gt;

&lt;p&gt;In ClickHouse, indexes don’t work like traditional B-tree indexes.&lt;/p&gt;

&lt;p&gt;They are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sparse&lt;/li&gt;
&lt;li&gt;Designed for &lt;strong&gt;range pruning&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They work well for queries like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;lat&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But not for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;greatCircleDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The function is applied on the columns, so ClickHouse cannot use the index to skip data efficiently.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Better Approach (What You Should Actually Do)
&lt;/h2&gt;

&lt;p&gt;Instead of directly applying the function, &lt;strong&gt;reduce the dataset first&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bounding Box Filter
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;city&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;locations&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;lat&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0827&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0827&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;lon&lt;/span&gt; &lt;span class="k"&gt;BETWEEN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2707&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2707&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;greatCircleDistance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0827&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;2707&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;5000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;(The bounding box is an approximation to reduce the search space before exact filtering.)&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why This Works
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;lat BETWEEN&lt;/code&gt; → uses index&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;lon BETWEEN&lt;/code&gt; → reduces rows further&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;greatCircleDistance&lt;/code&gt; → applied only on filtered data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So instead of scanning the entire table:&lt;br&gt;
 =&amp;gt; You narrow it down first, then compute accurately&lt;/p&gt;




&lt;h2&gt;
  
  
  Real-World Use Cases
&lt;/h2&gt;

&lt;p&gt;This pattern is useful in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delivery radius filtering&lt;/li&gt;
&lt;li&gt;Finding nearby users&lt;/li&gt;
&lt;li&gt;Geo-based analytics&lt;/li&gt;
&lt;li&gt;Ride-sharing systems&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  One Important Gotcha
&lt;/h2&gt;

&lt;p&gt;Make sure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Coordinates are in &lt;strong&gt;degrees (not radians)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Order is always &lt;code&gt;(lat, lon)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Swapping them will give incorrect results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;greatCircleDistance&lt;/code&gt; is powerful - but if used blindly, it can hurt performance.&lt;/p&gt;

&lt;p&gt;In ClickHouse, performance often depends more on how you query than what you query.&lt;/p&gt;

&lt;p&gt;Sometimes, the right approach isn’t just using a function - but knowing &lt;strong&gt;when and how to use it efficiently&lt;/strong&gt;.&lt;/p&gt;




</description>
      <category>clickhouse</category>
      <category>sql</category>
      <category>dataengineering</category>
      <category>performance</category>
    </item>
    <item>
      <title>Why My S3 Backup Setup Broke: Buckets, “Folders”, and Scheduling Misconceptions</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:04:20 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/why-my-s3-backup-setup-broke-buckets-folders-and-scheduling-misconceptions-315e</link>
      <guid>https://forem.com/mohhddhassan/why-my-s3-backup-setup-broke-buckets-folders-and-scheduling-misconceptions-315e</guid>
      <description>&lt;p&gt;&lt;em&gt;Another lesson in building reliable systems - not just configuring them.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I thought I had everything set up correctly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backups configured&lt;/li&gt;
&lt;li&gt;S3-compatible storage connected&lt;/li&gt;
&lt;li&gt;Backup triggered via cron jobs during testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet nothing showed up where I expected.&lt;/p&gt;

&lt;p&gt;What looked like a simple configuration issue turned out to be a wrong mental model of how S3 actually works.&lt;/p&gt;

&lt;p&gt;This post breaks down what went wrong and what fixed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An S3-compatible object storage (not AWS directly)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A system that allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configuring a bucket&lt;/li&gt;
&lt;li&gt;Setting a backup path&lt;/li&gt;
&lt;li&gt;Defining backup frequency&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Everything seemed straightforward.&lt;/p&gt;

&lt;p&gt;But the problem started with one assumption:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Buckets can behave like folders.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The First Mistake: Treating Buckets Like Folders
&lt;/h2&gt;

&lt;p&gt;In a traditional file system, you think like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;backups/
  app1/
    db.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So it felt natural to assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a “folder” in object storage&lt;/li&gt;
&lt;li&gt;Then create buckets inside it for different use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8owl9w7awnhyhx94611.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8owl9w7awnhyhx94611.png" alt=" " width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my case, I had something like a folder already created in the object storage UI, and I assumed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;That is my base, and I can create buckets under it&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I tried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connecting to that “folder” as a bucket&lt;/li&gt;
&lt;li&gt;Then creating another bucket inside it (for vector DB backups)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kept failing.&lt;/p&gt;

&lt;p&gt;At first, I thought:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maybe it is a permission issue&lt;/li&gt;
&lt;li&gt;Maybe my user does not have enough access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But that was not the real problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Actually Going Wrong
&lt;/h2&gt;

&lt;p&gt;I was effectively trying to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat a bucket like a parent directory&lt;/li&gt;
&lt;li&gt;And create another bucket inside it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not how S3 works.&lt;/p&gt;

&lt;p&gt;In S3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buckets are top-level containers&lt;/li&gt;
&lt;li&gt;You cannot nest buckets inside other buckets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when I tried to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect to an existing bucket&lt;/li&gt;
&lt;li&gt;And then create another bucket under it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It failed because the concept itself is invalid.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Correct Mental Model
&lt;/h2&gt;

&lt;p&gt;This is how S3 actually works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backups&lt;/span&gt;
&lt;span class="na"&gt;object key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app1/2026-04-15/db.sql&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are only two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket (top-level)&lt;/li&gt;
&lt;li&gt;Object key (full path as a string)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no real folder hierarchy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Organizing Data the Right Way
&lt;/h2&gt;

&lt;p&gt;The fix was not about creating folders.&lt;/p&gt;

&lt;p&gt;It was about changing how I name objects.&lt;/p&gt;

&lt;p&gt;Instead of trying to structure things at the bucket level, I moved that structure into the object key.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;object_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives a structure like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backups&lt;/span&gt;

&lt;span class="s"&gt;qdrant/&lt;/span&gt;
  &lt;span class="s"&gt;collection_1/&lt;/span&gt;
    &lt;span class="s"&gt;snapshot_001&lt;/span&gt;
  &lt;span class="s"&gt;collection_2/&lt;/span&gt;
    &lt;span class="s"&gt;snapshot_002&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though S3 is flat internally, most UIs render this as a folder-like structure.&lt;/p&gt;

&lt;p&gt;This is the correct way to organize data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Mistake: Mixing Bucket and Path
&lt;/h2&gt;

&lt;p&gt;Another issue was passing paths as part of the bucket name.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;backups/qdrant&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is invalid.&lt;/p&gt;

&lt;p&gt;Correct approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bucket = backups&lt;/li&gt;
&lt;li&gt;object key = qdrant/collection_name/snapshot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;S3 APIs expect a valid bucket name, not a path.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Finally Clicked
&lt;/h2&gt;

&lt;p&gt;The breakthrough was realizing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I was not dealing with folders at all&lt;/li&gt;
&lt;li&gt;I was dealing with string prefixes inside object keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I stopped trying to create hierarchy at the bucket level and moved everything into object naming, the entire setup started working as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Correct configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  backups
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Object naming:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone was enough to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organize backups cleanly&lt;/li&gt;
&lt;li&gt;Avoid bucket-related errors&lt;/li&gt;
&lt;li&gt;Make the storage layout intuitive in the UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Buckets are not folders&lt;/li&gt;
&lt;li&gt;You cannot create a bucket inside another bucket&lt;/li&gt;
&lt;li&gt;S3 is a flat object store&lt;/li&gt;
&lt;li&gt;Folder-like structures come from object key prefixes&lt;/li&gt;
&lt;li&gt;Always keep bucket and path separate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The issue was not with permissions or configuration.&lt;/p&gt;

&lt;p&gt;It was a mismatch between how I expected storage to behave and how it actually works.&lt;/p&gt;

&lt;p&gt;Once the mental model changed, the implementation became simple.&lt;/p&gt;

&lt;p&gt;If something feels unnecessarily complicated in S3, it is often a sign that the model being used is incorrect.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>s3</category>
      <category>cloudstorage</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Debugging a Broken Metrics Pipeline: What Actually Went Wrong</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 03:29:50 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o</link>
      <guid>https://forem.com/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 4 of a series on building a metrics pipeline into ClickHouse&lt;br&gt;
Read &lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 3: Understanding Vector Transforms&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  When Things Still Don’t Work
&lt;/h2&gt;

&lt;p&gt;At this point, the pipeline looked correct.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sources were defined&lt;/li&gt;
&lt;li&gt;Transforms were working&lt;/li&gt;
&lt;li&gt;Data structure matched expectations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet, something was still off.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Data wasn’t behaving the way it should.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where debugging became the main task.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Only Way Forward: Logs
&lt;/h2&gt;

&lt;p&gt;When dealing with ingestion issues in ClickHouse, logs become your best source of truth.&lt;/p&gt;

&lt;p&gt;I started monitoring the error logs directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/log/clickhouse-server/clickhouse-server.err.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This immediately surfaced issues that were not visible from the pipeline configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Error That Didn’t Make Sense
&lt;/h2&gt;

&lt;p&gt;At one point, I started seeing this error repeatedly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;There exists no table monitoring.cpu in database monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was confusing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I hadn’t created a table named &lt;code&gt;cpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;It wasn’t part of my current setup&lt;/li&gt;
&lt;li&gt;My Vector configuration didn’t reference it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So where was it coming from?&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Actually Happening
&lt;/h2&gt;

&lt;p&gt;After digging deeper, the issue had nothing to do with my current pipeline.&lt;/p&gt;

&lt;p&gt;It turned out that a previously used Telegraf process was still running in the background.&lt;/p&gt;

&lt;p&gt;Even though I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removed configurations&lt;/li&gt;
&lt;li&gt;Switched tools&lt;/li&gt;
&lt;li&gt;Rebuilt the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The old process was still active and sending data using an outdated setup.&lt;/p&gt;

&lt;p&gt;That’s why ClickHouse was reporting errors for a table I never intended to use.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;This wasn’t a configuration issue.&lt;/p&gt;

&lt;p&gt;It was a &lt;strong&gt;runtime issue&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The system I was debugging was not the only system running.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That realization changed how I approached debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing It
&lt;/h2&gt;

&lt;p&gt;The solution was simple - but easy to miss.&lt;/p&gt;

&lt;p&gt;First, I checked for any running Telegraf processes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps aux | &lt;span class="nb"&gt;grep &lt;/span&gt;telegraf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then stopped them explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop telegraf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the old process was stopped, the errors disappeared.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Teaches
&lt;/h2&gt;

&lt;p&gt;This led to an important lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Always validate the runtime environment - not just the configuration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When working with pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old processes may still be running&lt;/li&gt;
&lt;li&gt;Multiple agents may write to the same destination&lt;/li&gt;
&lt;li&gt;Previous setups can interfere with new ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t account for this, you may end up debugging the wrong problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Most of the pipeline development ended up looking like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write → Run → Fail → Check logs → Fix → Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each iteration helped refine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transform logic&lt;/li&gt;
&lt;li&gt;Data structure&lt;/li&gt;
&lt;li&gt;Schema alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This loop is where real progress happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Finally Worked
&lt;/h2&gt;

&lt;p&gt;Once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transforms were correct&lt;/li&gt;
&lt;li&gt;Timestamps were fixed&lt;/li&gt;
&lt;li&gt;Old processes were stopped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline stabilized.&lt;/p&gt;

&lt;p&gt;Data started flowing consistently into ClickHouse, and queries returned expected results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Recap
&lt;/h2&gt;

&lt;p&gt;This series covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 4: Debugging and making the pipeline reliable (this post)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Building data pipelines is rarely about getting things right on the first try.&lt;/p&gt;

&lt;p&gt;It’s about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observing how the system behaves&lt;/li&gt;
&lt;li&gt;Identifying where it breaks&lt;/li&gt;
&lt;li&gt;Iterating until it stabilizes&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Debugging is not a side task - it is the process.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>vector</category>
      <category>debugging</category>
    </item>
    <item>
      <title>From Pipelines to Transforms: Making Vector Work with ClickHouse</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 01:13:59 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho</link>
      <guid>https://forem.com/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 3 of a series on building a metrics pipeline into ClickHouse&lt;br&gt;
Read &lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector Pipelines&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Where Things Got Real
&lt;/h2&gt;

&lt;p&gt;By this point, the pipeline structure made sense.&lt;/p&gt;

&lt;p&gt;I understood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sources&lt;/li&gt;
&lt;li&gt;Transforms&lt;/li&gt;
&lt;li&gt;Sinks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the pipeline still wasn’t working reliably.&lt;/p&gt;

&lt;p&gt;That’s when it became clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The hardest part wasn’t collecting data.&lt;br&gt;
It was transforming it correctly.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Why Transforms Matter
&lt;/h2&gt;

&lt;p&gt;Raw metrics are rarely usable as-is.&lt;/p&gt;

&lt;p&gt;When sending data into ClickHouse, even small inconsistencies can break ingestion.&lt;/p&gt;

&lt;p&gt;Some common issues encountered were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong data types&lt;/li&gt;
&lt;li&gt;Unexpected field structures&lt;/li&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Incorrect timestamp formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if everything else is correct, these issues cause inserts to fail.&lt;/p&gt;


&lt;h2&gt;
  
  
  Enter VRL
&lt;/h2&gt;

&lt;p&gt;In Vector, transformations are written using &lt;strong&gt;Vector Remap Language&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At first, VRL feels simple.&lt;/p&gt;

&lt;p&gt;But in practice, it’s strict.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types must be explicit&lt;/li&gt;
&lt;li&gt;Fields must be handled carefully&lt;/li&gt;
&lt;li&gt;Errors are not ignored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That strictness is what makes pipelines reliable - but also harder to get right.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Timestamp Problem
&lt;/h2&gt;

&lt;p&gt;One of the biggest issues I faced was timestamp handling.&lt;/p&gt;

&lt;p&gt;ClickHouse expects timestamps in a specific format.&lt;/p&gt;

&lt;p&gt;The raw data didn’t match that format.&lt;/p&gt;

&lt;p&gt;Even when everything else was correct, inserts would fail silently because of this.&lt;/p&gt;

&lt;p&gt;The fix looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, "%+"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This line did three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parsed the incoming timestamp&lt;/li&gt;
&lt;li&gt;Converted it into a Unix format&lt;/li&gt;
&lt;li&gt;Made it compatible with ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It seems simple - but this was a major blocker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Normalizing Metrics
&lt;/h2&gt;

&lt;p&gt;Another challenge was aligning the data structure with what ClickHouse expects.&lt;/p&gt;

&lt;p&gt;For both host and GPU metrics, this required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converting values into numeric types&lt;/li&gt;
&lt;li&gt;Standardizing field names&lt;/li&gt;
&lt;li&gt;Adding metadata like host and source&lt;/li&gt;
&lt;li&gt;Ensuring consistent structure across all metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this step, ingestion would fail even if the pipeline looked correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Raw Data to Queryable Format
&lt;/h2&gt;

&lt;p&gt;One important transformation was changing how metrics were structured.&lt;/p&gt;

&lt;p&gt;Instead of storing multiple values in a single record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cpu, memory, disk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data was reshaped into a row-based format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;metric_name = "cpu", value = ...
metric_name = "memory", value = ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made it easier to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query data in ClickHouse&lt;/li&gt;
&lt;li&gt;Aggregate metrics&lt;/li&gt;
&lt;li&gt;Maintain a consistent schema&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why This Was Hard
&lt;/h2&gt;

&lt;p&gt;Most of the time spent on this pipeline wasn’t on setup.&lt;/p&gt;

&lt;p&gt;It was here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write transform → Run → Fail → Fix → Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each iteration revealed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A type mismatch&lt;/li&gt;
&lt;li&gt;A missing field&lt;/li&gt;
&lt;li&gt;A formatting issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the pipeline actually gets built.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed After This
&lt;/h2&gt;

&lt;p&gt;Once the transforms were correct:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data started flowing reliably&lt;/li&gt;
&lt;li&gt;Inserts into ClickHouse succeeded&lt;/li&gt;
&lt;li&gt;Queries started returning meaningful results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the pipeline finally felt stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;Even after fixing transformations, one major challenge remained:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Debugging unexpected failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the next part, I’ll walk through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How I debugged pipeline issues&lt;/li&gt;
&lt;li&gt;What ClickHouse logs revealed&lt;/li&gt;
&lt;li&gt;And a mistake that cost me time&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 3: Writing transforms and handling data correctly (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Transforms are where pipelines either succeed or fail.&lt;/p&gt;

&lt;p&gt;Understanding how data needs to be shaped is more important than the tool itself.&lt;/p&gt;

&lt;p&gt;Once that becomes clear, everything else starts to fall into place.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>observability</category>
      <category>vector</category>
    </item>
    <item>
      <title>Understanding Vector Pipelines: From Config Files to Data Flow</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 09 Apr 2026 19:29:55 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k</link>
      <guid>https://forem.com/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of a series on building a metrics pipeline into ClickHouse&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Read Part 1: Why my metrics pipeline with Telegraf didn’t work&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Picking Up Where Things Broke
&lt;/h2&gt;

&lt;p&gt;In the previous part, I talked about trying to build a metrics pipeline using Telegraf - and why that approach didn’t work for my use case.&lt;/p&gt;

&lt;p&gt;The biggest issue wasn’t just tooling.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I didn’t have enough control over how data moved through the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s what led me to explore a different approach.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Vector
&lt;/h2&gt;

&lt;p&gt;I came across Vector while looking for something more flexible.&lt;/p&gt;

&lt;p&gt;At a glance, it felt different.&lt;/p&gt;

&lt;p&gt;Instead of thinking in terms of plugins and configs, Vector is built around a &lt;strong&gt;pipeline model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that changes everything.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core Idea: Pipelines
&lt;/h2&gt;

&lt;p&gt;At the center of Vector is a simple concept:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sources → Transforms → Sinks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;But this model makes the flow of data explicit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sources&lt;/strong&gt; → where data comes from&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transforms&lt;/strong&gt; → how data is modified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sinks&lt;/strong&gt; → where data is sent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to my earlier approach, this immediately felt clearer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;Instead of writing a config and hoping everything connects correctly, you define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What data you are collecting&lt;/li&gt;
&lt;li&gt;How that data should be shaped&lt;/li&gt;
&lt;li&gt;Where that data should go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That shift sounds small - but it changes how you think about the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Config Files to Data Flow
&lt;/h2&gt;

&lt;p&gt;With Telegraf, my thinking looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write config → Run → Debug errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Vector, it started becoming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Collect → Transform → Route → Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The focus moved from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“What config do I write?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“How does data move through each stage?”&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The New Learning Curve
&lt;/h2&gt;

&lt;p&gt;Of course, switching tools didn’t magically solve everything.&lt;/p&gt;

&lt;p&gt;There were new challenges.&lt;/p&gt;

&lt;p&gt;Vector uses YAML for configuration, which was different from the TOML I was used to.&lt;/p&gt;

&lt;p&gt;And more importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The pipeline only works if every stage is defined correctly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some of the early issues I ran into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorrect source definitions&lt;/li&gt;
&lt;li&gt;Misconfigured sinks&lt;/li&gt;
&lt;li&gt;Data not flowing as expected&lt;/li&gt;
&lt;li&gt;Silent failures when something didn’t connect properly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At times, it felt like nothing was happening-even though everything looked “correct.”&lt;/p&gt;




&lt;h2&gt;
  
  
  First Realization: Everything Is Connected
&lt;/h2&gt;

&lt;p&gt;One important thing I learned quickly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If one stage breaks, the entire pipeline breaks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike simpler setups, you can’t treat components independently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A bad transform can stop data entirely&lt;/li&gt;
&lt;li&gt;A misconfigured sink can drop everything silently&lt;/li&gt;
&lt;li&gt;A source that doesn’t emit correctly makes debugging harder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This forced me to start thinking in terms of &lt;strong&gt;end-to-end flow&lt;/strong&gt;, not individual pieces.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Improved Immediately
&lt;/h2&gt;

&lt;p&gt;Despite the challenges, a few things became better compared to before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear visibility into how data moves&lt;/li&gt;
&lt;li&gt;Better control over transformations&lt;/li&gt;
&lt;li&gt;More flexibility in shaping data before sending it to ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though things weren’t fully working yet, I finally felt like I was closer to solving the actual problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Still Missing
&lt;/h2&gt;

&lt;p&gt;At this stage, the pipeline structure made sense.&lt;/p&gt;

&lt;p&gt;But one part was still unclear-and turned out to be the hardest:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How to correctly transform the data so that ClickHouse would accept it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where most of the complexity showed up.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part, I’ll dive into the most challenging part of this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing transforms using &lt;strong&gt;Vector Remap Language (VRL)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Handling strict data types&lt;/li&gt;
&lt;li&gt;Fixing timestamp issues&lt;/li&gt;
&lt;li&gt;And shaping metrics into a format that ClickHouse can actually ingest&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;p&gt;This post is part of a series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 2: Understanding Vector pipelines (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data correctly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Switching tools didn’t solve the problem immediately.&lt;/p&gt;

&lt;p&gt;But it did something more important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It made the system visible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I could see how data moved through each stage, debugging stopped being guesswork-and started becoming structured.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>observability</category>
      <category>backend</category>
    </item>
    <item>
      <title>Why My Metrics Pipeline with Telegraf Didn’t Work (and What I Learned)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Tue, 07 Apr 2026 10:18:51 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h</link>
      <guid>https://forem.com/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 1 of a series on building a metrics pipeline into ClickHouse&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Collecting metrics is easy.&lt;/p&gt;

&lt;p&gt;Shipping them to an analytical database without losing your mind is the hard part.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;At one point, the task seemed straightforward:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Collect system metrics (CPU, memory, GPU) and store them in ClickHouse for analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a common observability use case.&lt;br&gt;
You collect metrics, send them somewhere, and run queries on top.&lt;/p&gt;

&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;p&gt;But in practice, it didn’t go as planned.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Initial Approach: Telegraf
&lt;/h2&gt;

&lt;p&gt;I started with Telegraf.&lt;/p&gt;

&lt;p&gt;It’s widely used for collecting system metrics and has a plugin-based architecture, which makes it a natural first choice.&lt;/p&gt;

&lt;p&gt;This was also where I first came across TOML.&lt;/p&gt;

&lt;p&gt;At first, it felt like I just needed to “write a config and run it.”&lt;br&gt;
But very quickly, I realized:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Configuration isn’t just syntax-it defines how your system behaves.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  What I Was Trying to Build
&lt;/h2&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect host-level metrics (CPU, memory, etc.)&lt;/li&gt;
&lt;li&gt;Collect GPU metrics&lt;/li&gt;
&lt;li&gt;Push everything into ClickHouse&lt;/li&gt;
&lt;li&gt;Run analytical queries on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essentially, a basic observability pipeline.&lt;/p&gt;


&lt;h2&gt;
  
  
  Where Things Started Breaking
&lt;/h2&gt;

&lt;p&gt;On paper, Telegraf looked like it should work.&lt;/p&gt;

&lt;p&gt;In reality, I ran into a few issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No straightforward way to push data into ClickHouse&lt;/li&gt;
&lt;li&gt;Lack of a native ClickHouse output plugin&lt;/li&gt;
&lt;li&gt;Debugging wasn’t very intuitive&lt;/li&gt;
&lt;li&gt;Configurations became rigid as complexity increased&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point, I was spending more time trying to make the tool fit the use case than actually solving the problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Shift in Perspective
&lt;/h2&gt;

&lt;p&gt;This is where something important clicked.&lt;/p&gt;

&lt;p&gt;Up until this point, I was thinking in terms of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write config → Run tool → Expect output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that approach wasn’t working.&lt;/p&gt;

&lt;p&gt;What I needed instead was a clearer understanding of how data actually flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data source → Transformation → Destination
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem wasn’t just the tool-it was the lack of control over how data moved through the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Decided to Move Away
&lt;/h2&gt;

&lt;p&gt;At this stage, it became clear that I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More control over data transformations&lt;/li&gt;
&lt;li&gt;Better visibility into how data flows&lt;/li&gt;
&lt;li&gt;A system that is easier to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Telegraf, while powerful, didn’t give me that level of flexibility for this use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;That’s when I decided to try a different approach using Vector.&lt;/p&gt;

&lt;p&gt;Instead of treating configuration as static setup, Vector treats it as a &lt;strong&gt;pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next part, I’ll walk through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How Vector pipelines work&lt;/li&gt;
&lt;li&gt;Why the &lt;em&gt;sources → transforms → sinks&lt;/em&gt; model made a difference&lt;/li&gt;
&lt;li&gt;And what changed when I adopted that approach&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;p&gt;This post is part of a series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: Why the Telegraf approach didn’t work (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;What started as a simple setup turned into a deeper lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools don’t solve problems-understanding systems does.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once that became clear, the direction forward was much easier.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Full Text Search in ClickHouse: What Works in 2026</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 01 Apr 2026 06:46:21 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/full-text-search-in-clickhouse-what-works-and-what-doesnt-241c</link>
      <guid>https://forem.com/mohhddhassan/full-text-search-in-clickhouse-what-works-and-what-doesnt-241c</guid>
      <description>&lt;p&gt;ClickHouse is the undisputed heavyweight champion of analytics famed for fast aggregations, massive columnar storage, and processing trillions of rows. Historically, however, if you wanted "real" full-text search, the engineering consensus was clear: &lt;strong&gt;Don't use ClickHouse.&lt;/strong&gt; You had to pay the "architectural tax" of syncing your data to a dedicated engine like Elasticsearch or OpenSearch.&lt;/p&gt;

&lt;p&gt;But as of 2026, that consensus has shifted. With the General Availability of &lt;strong&gt;Inverted Indices&lt;/strong&gt; and native ranking functions, the question is no longer &lt;em&gt;if&lt;/em&gt; ClickHouse can do search, but how much of your infrastructure you can now simplify by moving it all into one place.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Do We Mean by "Full-Text Search"?
&lt;/h2&gt;

&lt;p&gt;Full-text search is fundamentally different from simple string filtering. In a dedicated search ecosystem, it typically requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization:&lt;/strong&gt; Breaking sentences into individual, searchable words.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inverted Indexing:&lt;/strong&gt; A specialized data structure that maps tokens to row IDs so the engine doesn't have to scan the entire table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance Scoring:&lt;/strong&gt; Ranking results using algorithms like &lt;strong&gt;BM25&lt;/strong&gt; so the best matches appear first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the past, ClickHouse only handled basic filtering. Today, it handles the entire stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Works: The 2026 Reality
&lt;/h2&gt;

&lt;p&gt;ClickHouse now provides a tiered approach to text search. Depending on your performance needs, you have three primary tools:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Heavyweight: Native Inverted Indices
&lt;/h3&gt;

&lt;p&gt;This is the single biggest update to the ClickHouse ecosystem. You no longer need to rely on brute-force &lt;code&gt;LIKE&lt;/code&gt; patterns that scan every byte of data. By defining an Inverted Index, ClickHouse creates a mapping that allows it to jump directly to the relevant data blocks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Creating a high-performance inverted index&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;inv_idx&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;inverted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GRANULARITY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The performance impact is massive. For datasets in the billions of rows, an indexed search can be &lt;strong&gt;10x to 100x faster&lt;/strong&gt; than a standard query because it narrows the search space to a few "granules" of data.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Industry-Standard Ranking (BM25)
&lt;/h3&gt;

&lt;p&gt;A search engine is only as good as its sorting. ClickHouse now supports &lt;strong&gt;BM25 scoring&lt;/strong&gt; natively. This allows you to find "connection errors" and ensure the rows where those terms appear most prominently are at the top of your result set.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Precision Tokenization
&lt;/h3&gt;

&lt;p&gt;Using functions like &lt;code&gt;hasToken()&lt;/code&gt;, ClickHouse understands word boundaries. It knows that a search for the word "log" should not return results for "logger" or "biological." This brings a level of precision previously reserved for dedicated search engines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where ClickHouse Excels
&lt;/h2&gt;

&lt;p&gt;In the current landscape, ClickHouse is the "sweet spot" for several specific high-growth use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log Analytics &amp;amp; Observability:&lt;/strong&gt; This is the primary "Elasticsearch killer." You can search billions of logs for a specific error message and, in the same query, calculate the average latency or error rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Simplicity:&lt;/strong&gt; Managing a ClickHouse cluster &lt;em&gt;and&lt;/em&gt; a search cluster is an operational nightmare. Moving both workloads to ClickHouse reduces your infrastructure footprint, simplifies your ingestion pipelines, and slashes your cloud bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Queries:&lt;/strong&gt; ClickHouse allows you to join search results with structured metadata (like user IDs or pricing tables) instantly - something that is notoriously difficult in traditional search engines.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Still "Doesn't Work"
&lt;/h2&gt;

&lt;p&gt;Despite these massive strides, ClickHouse is not a magic bullet for every search problem. There are still areas where dedicated engines hold the lead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex Linguistics:&lt;/strong&gt; If you need deep morphological analysis (e.g., matching "mice" to "mouse" or handling complex compounding in German), dedicated engines still have more mature language plugins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy Matching &amp;amp; Auto-Correct:&lt;/strong&gt; While ClickHouse can calculate &lt;code&gt;levenshteinDistance()&lt;/code&gt;, it isn't yet optimized for high-concurrency "did you mean?" style suggestions found on major e-commerce sites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant Search Products:&lt;/strong&gt; If you are building a consumer-facing product where search &lt;em&gt;is&lt;/em&gt; the entire product, the fine-grained tuning of a search-first engine is still superior.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ClickHouse vs. Search Engines: The 2026 Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ClickHouse (2026)&lt;/th&gt;
&lt;th&gt;Elasticsearch / OpenSearch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Strength&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analytics + Search&lt;/td&gt;
&lt;td&gt;High-Relevance Search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very Low (Columnar)&lt;/td&gt;
&lt;td&gt;High (Index Overhead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aggregation Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best-in-class&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relevance (BM25)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully Supported&lt;/td&gt;
&lt;td&gt;Industry Standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational Effort&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (Single System)&lt;/td&gt;
&lt;td&gt;High (Multiple Systems)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The boundary between "Analytics" and "Search" has officially blurred. &lt;/p&gt;

&lt;p&gt;If you are analyzing logs, building internal observability tools, or need to search across massive datasets where cost and aggregation speed matter most, &lt;strong&gt;ClickHouse is now a full-text search engine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choosing ClickHouse in 2026 means opting for a simpler architecture and better performance without sacrificing the core search capabilities your team needs.&lt;/p&gt;




</description>
      <category>clickhouse</category>
      <category>database</category>
      <category>backend</category>
      <category>opensource</category>
    </item>
    <item>
      <title>When Synthetic Data Lies: A Hidden Correlation Problem I Didn’t Expect</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 26 Mar 2026 10:06:38 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/when-synthetic-data-lies-a-hidden-correlation-problem-i-didnt-expect-3f8b</link>
      <guid>https://forem.com/mohhddhassan/when-synthetic-data-lies-a-hidden-correlation-problem-i-didnt-expect-3f8b</guid>
      <description>&lt;p&gt;While working on a small analytics setup using ClickHouse and Superset, I generated some synthetic data to test queries and dashboards.&lt;/p&gt;

&lt;p&gt;Initially, everything looked fine. The distributions seemed reasonable, and the dashboards behaved as expected.&lt;/p&gt;

&lt;p&gt;But as I increased the dataset size, a few patterns started to look off.&lt;/p&gt;

&lt;p&gt;Revenue seemed to concentrate in a single country.&lt;br&gt;
In some cases, certain countries had no purchases at all.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kaac2iu9fs1sfqn44t6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kaac2iu9fs1sfqn44t6.png" alt=" " width="652" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first, it looked like a simple distribution issue. But the patterns were too consistent to ignore.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Checking the Usual Suspects&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first assumption was that something was wrong with the queries or aggregations.&lt;/p&gt;

&lt;p&gt;So I checked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;query logic&lt;/li&gt;
&lt;li&gt;filters&lt;/li&gt;
&lt;li&gt;materialized views&lt;/li&gt;
&lt;li&gt;dashboard configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything seemed correct.&lt;/p&gt;

&lt;p&gt;Which pointed to a different possibility:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The issue wasn’t in how the data was queried - it was in how the data was generated.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Looking at the Data More Closely&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of relying on dashboards, I went back to the raw data.&lt;/p&gt;

&lt;p&gt;A simple aggregation made things clearer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one country dominating purchases&lt;/li&gt;
&lt;li&gt;another missing entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hdhnwpeki8684ehk29f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hdhnwpeki8684ehk29f.png" alt=" " width="646" height="304"&gt;&lt;/a&gt;&lt;br&gt;
Overall event distribution looks reasonable - the issue isn’t obvious here&lt;/p&gt;

&lt;p&gt;At this point, it was clear that the data itself wasn’t behaving as expected.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The First Issue: Randomness That Wasn’t Quite Random&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The initial data generation logic used &lt;code&gt;rand()&lt;/code&gt; like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;multiIf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'India'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'US'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'UK'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Germany'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'UAE'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a glance, this looks reasonable.&lt;/p&gt;

&lt;p&gt;But each &lt;code&gt;rand()&lt;/code&gt; call is evaluated independently.&lt;/p&gt;

&lt;p&gt;So instead of generating a single random value and assigning a category, the logic evaluates a new random value at each step.&lt;/p&gt;

&lt;p&gt;This leads to unintended distributions and subtle bias in the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Fixing That… and Introducing Another Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make the data more stable, I switched to a deterministic approach using &lt;code&gt;number&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;country_rand&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made the distributions predictable and easier to reason about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cfoboq9mi0t8drjt87w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cfoboq9mi0t8drjt87w.png" alt=" " width="654" height="300"&gt;&lt;/a&gt;&lt;br&gt;
After fixing randomness - a different issue appears (some countries have zero purchases)&lt;/p&gt;

&lt;p&gt;But it introduced a different issue.&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;event_type&lt;/code&gt; and &lt;code&gt;country&lt;/code&gt; were now derived from the same base value: &lt;code&gt;number&lt;/code&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The Real Issue: Hidden Correlation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even with different multipliers, these values were not independent.&lt;/p&gt;

&lt;p&gt;They were mathematically related.&lt;/p&gt;

&lt;p&gt;This meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;certain rows could only produce certain combinations&lt;/li&gt;
&lt;li&gt;some combinations would never occur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this case, the rows that produced &lt;code&gt;"purchase"&lt;/code&gt; never aligned with the rows that produced &lt;code&gt;"UAE"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Which resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UAE having zero purchases&lt;/li&gt;
&lt;li&gt;other countries showing skewed distributions&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;What Was Actually Wrong&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The issue wasn’t randomness.&lt;/p&gt;

&lt;p&gt;It was lack of independence.&lt;/p&gt;

&lt;p&gt;The variables in the synthetic dataset were not independent of each other.&lt;/p&gt;

&lt;p&gt;And that’s enough to produce misleading analytics.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Fixing the Data Generation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To resolve this, I changed how the values were generated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;used different transformations&lt;/li&gt;
&lt;li&gt;added offsets&lt;/li&gt;
&lt;li&gt;ensured each variable had its own distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;country_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;47&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;device_rand&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This breaks the alignment between variables and restores independence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpw35esilqryhqhncsx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpw35esilqryhqhncsx4.png" alt=" " width="650" height="302"&gt;&lt;/a&gt;&lt;br&gt;
After this change, the distributions behaved as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At smaller scales, the issue wasn’t obvious.&lt;/p&gt;

&lt;p&gt;The data looked fine, and the dashboards seemed reasonable.&lt;/p&gt;

&lt;p&gt;But as the dataset grew:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;patterns became more consistent&lt;/li&gt;
&lt;li&gt;biases became more visible&lt;/li&gt;
&lt;li&gt;incorrect assumptions started to look like real insights&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Takeaway&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Synthetic data can look correct while still producing misleading results.&lt;/p&gt;

&lt;p&gt;The problem wasn’t query performance.&lt;/p&gt;

&lt;p&gt;It was data correctness.&lt;/p&gt;

&lt;p&gt;Scaling the data didn’t create the issue.&lt;/p&gt;

&lt;p&gt;It revealed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This was a good reminder that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data generation deserves as much attention as querying&lt;/li&gt;
&lt;li&gt;small assumptions can lead to large inconsistencies&lt;/li&gt;
&lt;li&gt;and “reasonable-looking” data isn’t always reliable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When working with analytics systems, it’s easy to trust what the data shows.&lt;/p&gt;

&lt;p&gt;But sometimes, it’s worth questioning how that data was created in the first place.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Related&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This came up while building an analytics setup using ClickHouse and Superset, where I was comparing raw tables and materialized views.&lt;/p&gt;

&lt;p&gt;If you're interested in that setup, you can read about it here:&lt;br&gt;
👉 &lt;a href="https://dev.to/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n"&gt;link to Blog 1&lt;/a&gt;&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>analytics</category>
      <category>debugging</category>
    </item>
    <item>
      <title>How ClickHouse + Superset Work Together for Analytics (And What Actually Matters)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 25 Mar 2026 00:59:24 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n</link>
      <guid>https://forem.com/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n</guid>
      <description>&lt;p&gt;Modern analytics systems require more than just fast databases - they need a complete workflow from data storage to visualization.&lt;/p&gt;

&lt;p&gt;I set up a small analytics pipeline using ClickHouse and Apache Superset to understand how dashboards are built end to end.&lt;/p&gt;

&lt;p&gt;The setup itself was straightforward, but while testing it, one question kept coming up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does query optimization actually matter at smaller scales?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To explore this, I compared queries on a raw table with queries on a materialized view. The difference wasn’t huge - but it was enough to reveal how things behave as data grows.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why I Built This&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The goal wasn’t to simulate a production system, but to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understand how ClickHouse works in an analytics workflow&lt;/li&gt;
&lt;li&gt;explore how Superset interacts with a database&lt;/li&gt;
&lt;li&gt;observe how query performance changes with different data models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was more of a hands-on exploration than a benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why a BI Tool?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Running SQL queries directly is sufficient for basic analysis. However, as requirements grow, teams need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reusable datasets&lt;/li&gt;
&lt;li&gt;interactive dashboards&lt;/li&gt;
&lt;li&gt;faster exploration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A BI tool provides a structured way to bridge raw data and decision-making.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Apache Superset Instead of Grafana&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Both tools serve different purposes:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apache Superset&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;SQL-first analytics workflow&lt;/li&gt;
&lt;li&gt;rich visualization capabilities&lt;/li&gt;
&lt;li&gt;designed for OLAP use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Grafana&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;strong in monitoring and observability&lt;/li&gt;
&lt;li&gt;optimized for time-series metrics&lt;/li&gt;
&lt;li&gt;less flexible for ad-hoc analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For analytics workloads on ClickHouse, Superset provides greater flexibility and control.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why ClickHouse + Superset?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;ClickHouse and Superset complement each other in a typical analytics stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ClickHouse handles large-scale aggregations efficiently&lt;/li&gt;
&lt;li&gt;Superset enables exploration and visualization on top of SQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ClickHouse performs the computation, while Superset exposes it for analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The overall architecture follows a simple flow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data → ClickHouse → Materialized View → Superset → Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This separation makes it easier to control performance - heavy computation stays in ClickHouse, while Superset focuses on visualization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dataset Design&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A simple events table was created in ClickHouse using synthetic data.&lt;/p&gt;

&lt;p&gt;The goal was not to simulate production-scale workloads, but to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate the integration&lt;/li&gt;
&lt;li&gt;build dashboards&lt;/li&gt;
&lt;li&gt;observe query behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpdqto00t8k3h1q07qlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpdqto00t8k3h1q07qlw.png" alt=" " width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dashboard Creation in Superset&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After establishing the connection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;datasets were defined on ClickHouse tables&lt;/li&gt;
&lt;li&gt;charts were built using SQL queries&lt;/li&gt;
&lt;li&gt;dashboards were assembled with filters for interaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Superset acts as a visualization layer while still relying heavily on SQL for data definition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33zl7admx15nv5bl8jrc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33zl7admx15nv5bl8jrc.png" alt=" " width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Explore View&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwembd2vw0vmm477ajihl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwembd2vw0vmm477ajihl.png" alt="Final Dashboard" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Final Dashboard&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Raw Table vs Materialized View&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To understand performance behavior, queries were executed on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the raw table&lt;/li&gt;
&lt;li&gt;a materialized view with pre-aggregated data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Raw table → ~281 ms&lt;/li&gt;
&lt;li&gt;Materialized view → ~222 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0uvvn8jmutthrcxnlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0uvvn8jmutthrcxnlt.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw Table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezt2z03c345vuwjptad4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezt2z03c345vuwjptad4.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MV Table&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Materialized Views Improve Performance&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Materialized views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce the volume of data scanned&lt;/li&gt;
&lt;li&gt;pre-compute aggregations&lt;/li&gt;
&lt;li&gt;simplify query logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though the dataset is small, the improvement is measurable.&lt;/p&gt;

&lt;p&gt;At this scale, the difference is minor - but it highlights something important:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As data grows, these small optimizations compound significantly.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Insight&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The performance difference is small at low scale, but the pattern is clear.&lt;/p&gt;

&lt;p&gt;As datasets grow, query performance becomes less about the BI tool and more about how the data is modeled.&lt;/p&gt;

&lt;p&gt;Materialized views, pre-aggregation, and query design matter far more than visualization tooling.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges Faced&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Driver Not Detected by Superset&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Could not load database driver: ClickHouseConnectEngineSpec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Root Cause
&lt;/h4&gt;

&lt;p&gt;Superset runs inside its own internal virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/app/.venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The package was installed using system pip instead of the venv pip, making it invisible to Superset.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/app/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; ensurepip
/app/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;clickhouse-connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;ClickHouse Not Visible in UI&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ClickHouse did not appear in the database dropdown.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Use manual connection string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clickhousedb://default:password@clickhouse:8123/default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Authentication Issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Authentication failures occurred due to existing volumes storing old credentials.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Reset the ClickHouse volume and restart containers.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;SQLite Migration Errors&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table ab_permission already exists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Rebuild containers and allow Superset to handle initialization automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Learnings&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Data modeling plays a critical role in analytics performance&lt;/li&gt;
&lt;li&gt;Materialized views are essential for scalable query performance&lt;/li&gt;
&lt;li&gt;Superset relies on a properly optimized backend&lt;/li&gt;
&lt;li&gt;Docker environment isolation can introduce subtle issues&lt;/li&gt;
&lt;li&gt;Understanding internal environments (like virtualenvs) is crucial&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A Note on Synthetic Data&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One interesting issue I ran into during this process was with synthetic data generation.&lt;/p&gt;

&lt;p&gt;At first, everything looked correct - but as the dataset grew, some unexpected patterns started to appear in the results.&lt;/p&gt;

&lt;p&gt;It turned out to be a subtle problem related to how the data was being generated, not queried.&lt;/p&gt;

&lt;p&gt;I’ll cover that in a follow-up post.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This setup was a good way to understand how modern analytics systems are put together - combining storage, computation, and visualization.&lt;/p&gt;

&lt;p&gt;Even with a small dataset, experimenting with different query strategies shows how systems behave as they scale.&lt;/p&gt;

&lt;p&gt;The tools themselves are powerful, but performance ultimately depends on how the data is structured and queried.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;References&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://superset.apache.org/user-docs/intro/" rel="noopener noreferrer"&gt;Apache Superset Documentation&lt;/a&gt;&lt;br&gt;
&lt;a href="https://clickhouse.com/docs/integrations/superset" rel="noopener noreferrer"&gt;Superset to ClickHouse&lt;/a&gt;&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>superset</category>
      <category>analytics</category>
      <category>database</category>
    </item>
    <item>
      <title>Managing Large PostgreSQL Tables with Native Partitioning and pg_partman</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Mon, 16 Mar 2026 08:31:49 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/managing-large-postgresql-tables-with-native-partitioning-and-pgpartman-59ak</link>
      <guid>https://forem.com/mohhddhassan/managing-large-postgresql-tables-with-native-partitioning-and-pgpartman-59ak</guid>
      <description>&lt;p&gt;As databases grow, tables that store large volumes of time-based data can quickly become difficult to manage.&lt;/p&gt;

&lt;p&gt;Over time, this leads to several issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queries become slower&lt;/li&gt;
&lt;li&gt;indexes grow larger&lt;/li&gt;
&lt;li&gt;maintenance operations like &lt;code&gt;VACUUM&lt;/code&gt; take longer&lt;/li&gt;
&lt;li&gt;managing old data becomes complicated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL provides &lt;strong&gt;native table partitioning&lt;/strong&gt; to help address these problems.&lt;/p&gt;

&lt;p&gt;However, while partitioning improves performance and data management, operating partitioned tables manually can introduce operational complexity.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how &lt;strong&gt;native PostgreSQL partitioning&lt;/strong&gt; works&lt;/li&gt;
&lt;li&gt;the operational challenges of managing partitions manually&lt;/li&gt;
&lt;li&gt;how pg_partman automates partition management&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Native PostgreSQL Partitioning
&lt;/h1&gt;

&lt;p&gt;PostgreSQL supports table partitioning, allowing a large logical table to be split into multiple smaller physical tables called &lt;strong&gt;partitions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PostgreSQL currently supports three partitioning methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Range
List
Hash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For time-based data, &lt;strong&gt;range partitioning&lt;/strong&gt; is the most common approach.&lt;/p&gt;

&lt;p&gt;For example, imagine storing application events with a timestamp.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="n"&gt;JSONB&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;RANGE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;events&lt;/code&gt; becomes the &lt;strong&gt;parent table&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Actual data is stored in child tables called &lt;strong&gt;partitions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example partitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2026_03_20&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-20'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2026_03_21&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-22'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each partition is a &lt;strong&gt;physical table inside PostgreSQL&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Data Is Inserted into Partitions
&lt;/h1&gt;

&lt;p&gt;Applications still insert data into the &lt;strong&gt;parent table&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21 10:15:00'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"event":"login"}'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL automatically routes the row to the correct partition based on the partition key.&lt;/p&gt;

&lt;p&gt;In this example, the row would be stored inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;events_2026_03_21
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Queries still run against the parent table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL internally performs &lt;strong&gt;partition pruning&lt;/strong&gt;, meaning only the relevant partitions are scanned.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Partitioning Improves Performance
&lt;/h1&gt;

&lt;p&gt;Partitioning helps in several ways.&lt;/p&gt;

&lt;p&gt;Instead of scanning a very large table, PostgreSQL only scans the partitions that match the query conditions.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          Query for last 7 days
                                   ↓
                 PostgreSQL scans only recent partitions
                                   ↓
                        Older partitions are skipped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Partitioning also simplifies operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;archiving historical data&lt;/li&gt;
&lt;li&gt;dropping old data quickly&lt;/li&gt;
&lt;li&gt;managing index sizes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, dropping old data can be done instantly by dropping a partition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2024_03_01&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is far faster than deleting millions of rows from a single large table.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Operational Challenge with Native Partitioning
&lt;/h1&gt;

&lt;p&gt;While native partitioning is powerful, managing partitions manually introduces operational challenges.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;new partitions must be &lt;strong&gt;created ahead of time&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;old partitions must be &lt;strong&gt;removed manually&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;retention policies must be implemented manually&lt;/li&gt;
&lt;li&gt;missing partitions can cause insert failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  application inserts event at midnight
                                   ↓
                        new partition does not exist
                                   ↓
                              insert fails
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As systems scale and tables grow, manually managing partitions becomes increasingly difficult.&lt;/p&gt;

&lt;p&gt;This is where automation becomes valuable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Introducing pg_partman
&lt;/h1&gt;

&lt;p&gt;pg_partman is a PostgreSQL extension designed to automate partition management.&lt;/p&gt;

&lt;p&gt;It builds on top of PostgreSQL’s native partitioning and helps manage partitioned tables more efficiently.&lt;/p&gt;

&lt;p&gt;pg_partman can automatically handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creation of future partitions&lt;/li&gt;
&lt;li&gt;retention and removal of old partitions&lt;/li&gt;
&lt;li&gt;partition maintenance operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces the operational overhead of managing partitioned tables manually.&lt;/p&gt;




&lt;h1&gt;
  
  
  How pg_partman Works
&lt;/h1&gt;

&lt;p&gt;pg_partman works by managing a &lt;strong&gt;parent partitioned table&lt;/strong&gt; and automatically maintaining its partitions.&lt;/p&gt;

&lt;p&gt;A simplified workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              Parent table
                                   ↓
                         pg_partman configuration
                                   ↓
                   Automatic creation of future partitions
                                   ↓
               Optional retention policies for old partitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, if a table is partitioned by &lt;strong&gt;day&lt;/strong&gt;, pg_partman can automatically create upcoming partitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;events_p2026_03_22
events_p2026_03_23
events_p2026_03_24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that new inserts always have a valid partition.&lt;/p&gt;




&lt;h1&gt;
  
  
  Benefits of Using pg_partman
&lt;/h1&gt;

&lt;p&gt;Compared to managing partitions manually, pg_partman provides several advantages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Automatic partition creation
                                   ↓
                    Reduced operational overhead
                                   ↓
                   Safer data retention management
                                   ↓
                   Less risk of missing partitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of maintaining partition logic manually in application code or scripts, pg_partman handles this inside the database.&lt;/p&gt;




&lt;h1&gt;
  
  
  When pg_partman Is Useful
&lt;/h1&gt;

&lt;p&gt;pg_partman is particularly useful for workloads involving &lt;strong&gt;large append-only datasets&lt;/strong&gt;, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;event logs&lt;/li&gt;
&lt;li&gt;analytics data&lt;/li&gt;
&lt;li&gt;application activity tracking&lt;/li&gt;
&lt;li&gt;time-series data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these scenarios, new data continuously arrives while older data eventually becomes less important.&lt;/p&gt;

&lt;p&gt;Partition automation helps manage this lifecycle efficiently.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;PostgreSQL’s native partitioning provides powerful capabilities for managing large datasets.&lt;/p&gt;

&lt;p&gt;However, operating partitioned tables manually can introduce additional operational complexity.&lt;/p&gt;

&lt;p&gt;Extensions like pg_partman simplify this process by automating partition creation and maintenance.&lt;/p&gt;

&lt;p&gt;By combining PostgreSQL’s native partitioning features with pg_partman’s automation, teams can manage large time-based datasets more reliably and with less manual intervention.&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>database</category>
      <category>backend</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
