<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Mohamed Hussain S</title>
    <description>The latest articles on Forem by Mohamed Hussain S (@mohhddhassan).</description>
    <link>https://forem.com/mohhddhassan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3228500%2Fc2e87a6e-70e3-4023-a3d3-5adddef072f2.jpeg</url>
      <title>Forem: Mohamed Hussain S</title>
      <link>https://forem.com/mohhddhassan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/mohhddhassan"/>
    <language>en</language>
    <item>
      <title>Why My S3 Backup Setup Broke: Buckets, “Folders”, and Scheduling Misconceptions</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 07:04:20 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/why-my-s3-backup-setup-broke-buckets-folders-and-scheduling-misconceptions-315e</link>
      <guid>https://forem.com/mohhddhassan/why-my-s3-backup-setup-broke-buckets-folders-and-scheduling-misconceptions-315e</guid>
      <description>&lt;p&gt;&lt;em&gt;Another lesson in building reliable systems - not just configuring them.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I thought I had everything set up correctly.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backups configured&lt;/li&gt;
&lt;li&gt;S3-compatible storage connected&lt;/li&gt;
&lt;li&gt;Backup triggered via cron jobs during testing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet nothing showed up where I expected.&lt;/p&gt;

&lt;p&gt;What looked like a simple configuration issue turned out to be a wrong mental model of how S3 actually works.&lt;/p&gt;

&lt;p&gt;This post breaks down what went wrong and what fixed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was working with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An S3-compatible object storage (not AWS directly)&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;A system that allows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Configuring a bucket&lt;/li&gt;
&lt;li&gt;Setting a backup path&lt;/li&gt;
&lt;li&gt;Defining backup frequency&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Everything seemed straightforward.&lt;/p&gt;

&lt;p&gt;But the problem started with one assumption:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Buckets can behave like folders.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The First Mistake: Treating Buckets Like Folders
&lt;/h2&gt;

&lt;p&gt;In a traditional file system, you think like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;backups/
  app1/
    db.sql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So it felt natural to assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a “folder” in object storage&lt;/li&gt;
&lt;li&gt;Then create buckets inside it for different use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8owl9w7awnhyhx94611.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa8owl9w7awnhyhx94611.png" alt=" " width="800" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my case, I had something like a folder already created in the object storage UI, and I assumed:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;That is my base, and I can create buckets under it&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I tried:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connecting to that “folder” as a bucket&lt;/li&gt;
&lt;li&gt;Then creating another bucket inside it (for vector DB backups)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This kept failing.&lt;/p&gt;

&lt;p&gt;At first, I thought:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Maybe it is a permission issue&lt;/li&gt;
&lt;li&gt;Maybe my user does not have enough access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But that was not the real problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Actually Going Wrong
&lt;/h2&gt;

&lt;p&gt;I was effectively trying to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat a bucket like a parent directory&lt;/li&gt;
&lt;li&gt;And create another bucket inside it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not how S3 works.&lt;/p&gt;

&lt;p&gt;In S3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buckets are top-level containers&lt;/li&gt;
&lt;li&gt;You cannot nest buckets inside other buckets&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So when I tried to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Connect to an existing bucket&lt;/li&gt;
&lt;li&gt;And then create another bucket under it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It failed because the concept itself is invalid.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Correct Mental Model
&lt;/h2&gt;

&lt;p&gt;This is how S3 actually works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backups&lt;/span&gt;
&lt;span class="na"&gt;object key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app1/2026-04-15/db.sql&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are only two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket (top-level)&lt;/li&gt;
&lt;li&gt;Object key (full path as a string)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is no real folder hierarchy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Organizing Data the Right Way
&lt;/h2&gt;

&lt;p&gt;The fix was not about creating folders.&lt;/p&gt;

&lt;p&gt;It was about changing how I name objects.&lt;/p&gt;

&lt;p&gt;Instead of trying to structure things at the bucket level, I moved that structure into the object key.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;object_name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives a structure like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;bucket&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backups&lt;/span&gt;

&lt;span class="s"&gt;qdrant/&lt;/span&gt;
  &lt;span class="s"&gt;collection_1/&lt;/span&gt;
    &lt;span class="s"&gt;snapshot_001&lt;/span&gt;
  &lt;span class="s"&gt;collection_2/&lt;/span&gt;
    &lt;span class="s"&gt;snapshot_002&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even though S3 is flat internally, most UIs render this as a folder-like structure.&lt;/p&gt;

&lt;p&gt;This is the correct way to organize data.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Second Mistake: Mixing Bucket and Path
&lt;/h2&gt;

&lt;p&gt;Another issue was passing paths as part of the bucket name.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;backups/qdrant&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is invalid.&lt;/p&gt;

&lt;p&gt;Correct approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;bucket = backups&lt;/li&gt;
&lt;li&gt;object key = qdrant/collection_name/snapshot&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;S3 APIs expect a valid bucket name, not a path.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Finally Clicked
&lt;/h2&gt;

&lt;p&gt;The breakthrough was realizing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I was not dealing with folders at all&lt;/li&gt;
&lt;li&gt;I was dealing with string prefixes inside object keys&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Once I stopped trying to create hierarchy at the bucket level and moved everything into object naming, the entire setup started working as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Correct configuration:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bucket:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  backups
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Object naming:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;  &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;snapshot_name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This alone was enough to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organize backups cleanly&lt;/li&gt;
&lt;li&gt;Avoid bucket-related errors&lt;/li&gt;
&lt;li&gt;Make the storage layout intuitive in the UI&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Buckets are not folders&lt;/li&gt;
&lt;li&gt;You cannot create a bucket inside another bucket&lt;/li&gt;
&lt;li&gt;S3 is a flat object store&lt;/li&gt;
&lt;li&gt;Folder-like structures come from object key prefixes&lt;/li&gt;
&lt;li&gt;Always keep bucket and path separate&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;The issue was not with permissions or configuration.&lt;/p&gt;

&lt;p&gt;It was a mismatch between how I expected storage to behave and how it actually works.&lt;/p&gt;

&lt;p&gt;Once the mental model changed, the implementation became simple.&lt;/p&gt;

&lt;p&gt;If something feels unnecessarily complicated in S3, it is often a sign that the model being used is incorrect.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>s3</category>
      <category>cloudstorage</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Debugging a Broken Metrics Pipeline: What Actually Went Wrong</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 03:29:50 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o</link>
      <guid>https://forem.com/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 4 of a series on building a metrics pipeline into ClickHouse&lt;br&gt;
Read &lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 3: Understanding Vector Transforms&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  When Things Still Don’t Work
&lt;/h2&gt;

&lt;p&gt;At this point, the pipeline looked correct.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sources were defined&lt;/li&gt;
&lt;li&gt;Transforms were working&lt;/li&gt;
&lt;li&gt;Data structure matched expectations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yet, something was still off.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Data wasn’t behaving the way it should.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where debugging became the main task.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Only Way Forward: Logs
&lt;/h2&gt;

&lt;p&gt;When dealing with ingestion issues in ClickHouse, logs become your best source of truth.&lt;/p&gt;

&lt;p&gt;I started monitoring the error logs directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo tail&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/log/clickhouse-server/clickhouse-server.err.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This immediately surfaced issues that were not visible from the pipeline configuration.&lt;/p&gt;




&lt;h2&gt;
  
  
  An Error That Didn’t Make Sense
&lt;/h2&gt;

&lt;p&gt;At one point, I started seeing this error repeatedly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;There exists no table monitoring.cpu in database monitoring
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was confusing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;I hadn’t created a table named &lt;code&gt;cpu&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;It wasn’t part of my current setup&lt;/li&gt;
&lt;li&gt;My Vector configuration didn’t reference it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So where was it coming from?&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Actually Happening
&lt;/h2&gt;

&lt;p&gt;After digging deeper, the issue had nothing to do with my current pipeline.&lt;/p&gt;

&lt;p&gt;It turned out that a previously used Telegraf process was still running in the background.&lt;/p&gt;

&lt;p&gt;Even though I had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removed configurations&lt;/li&gt;
&lt;li&gt;Switched tools&lt;/li&gt;
&lt;li&gt;Rebuilt the pipeline&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The old process was still active and sending data using an outdated setup.&lt;/p&gt;

&lt;p&gt;That’s why ClickHouse was reporting errors for a table I never intended to use.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;This wasn’t a configuration issue.&lt;/p&gt;

&lt;p&gt;It was a &lt;strong&gt;runtime issue&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The system I was debugging was not the only system running.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That realization changed how I approached debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Fixing It
&lt;/h2&gt;

&lt;p&gt;The solution was simple - but easy to miss.&lt;/p&gt;

&lt;p&gt;First, I checked for any running Telegraf processes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ps aux | &lt;span class="nb"&gt;grep &lt;/span&gt;telegraf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then stopped them explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop telegraf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the old process was stopped, the errors disappeared.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Teaches
&lt;/h2&gt;

&lt;p&gt;This led to an important lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Always validate the runtime environment - not just the configuration.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When working with pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Old processes may still be running&lt;/li&gt;
&lt;li&gt;Multiple agents may write to the same destination&lt;/li&gt;
&lt;li&gt;Previous setups can interfere with new ones&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you don’t account for this, you may end up debugging the wrong problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Debugging Loop
&lt;/h2&gt;

&lt;p&gt;Most of the pipeline development ended up looking like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write → Run → Fail → Check logs → Fix → Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each iteration helped refine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transform logic&lt;/li&gt;
&lt;li&gt;Data structure&lt;/li&gt;
&lt;li&gt;Schema alignment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This loop is where real progress happens.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Finally Worked
&lt;/h2&gt;

&lt;p&gt;Once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Transforms were correct&lt;/li&gt;
&lt;li&gt;Timestamps were fixed&lt;/li&gt;
&lt;li&gt;Old processes were stopped&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline stabilized.&lt;/p&gt;

&lt;p&gt;Data started flowing consistently into ClickHouse, and queries returned expected results.&lt;/p&gt;




&lt;h2&gt;
  
  
  Series Recap
&lt;/h2&gt;

&lt;p&gt;This series covered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 4: Debugging and making the pipeline reliable (this post)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Building data pipelines is rarely about getting things right on the first try.&lt;/p&gt;

&lt;p&gt;It’s about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Observing how the system behaves&lt;/li&gt;
&lt;li&gt;Identifying where it breaks&lt;/li&gt;
&lt;li&gt;Iterating until it stabilizes&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;Debugging is not a side task - it is the process.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>vector</category>
      <category>debugging</category>
    </item>
    <item>
      <title>From Pipelines to Transforms: Making Vector Work with ClickHouse</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 16 Apr 2026 01:13:59 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho</link>
      <guid>https://forem.com/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 3 of a series on building a metrics pipeline into ClickHouse&lt;br&gt;
Read &lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector Pipelines&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Where Things Got Real
&lt;/h2&gt;

&lt;p&gt;By this point, the pipeline structure made sense.&lt;/p&gt;

&lt;p&gt;I understood:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sources&lt;/li&gt;
&lt;li&gt;Transforms&lt;/li&gt;
&lt;li&gt;Sinks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the pipeline still wasn’t working reliably.&lt;/p&gt;

&lt;p&gt;That’s when it became clear:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The hardest part wasn’t collecting data.&lt;br&gt;
It was transforming it correctly.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  Why Transforms Matter
&lt;/h2&gt;

&lt;p&gt;Raw metrics are rarely usable as-is.&lt;/p&gt;

&lt;p&gt;When sending data into ClickHouse, even small inconsistencies can break ingestion.&lt;/p&gt;

&lt;p&gt;Some common issues encountered were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Wrong data types&lt;/li&gt;
&lt;li&gt;Unexpected field structures&lt;/li&gt;
&lt;li&gt;Missing values&lt;/li&gt;
&lt;li&gt;Incorrect timestamp formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even if everything else is correct, these issues cause inserts to fail.&lt;/p&gt;


&lt;h2&gt;
  
  
  Enter VRL
&lt;/h2&gt;

&lt;p&gt;In Vector, transformations are written using &lt;strong&gt;Vector Remap Language&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;At first, VRL feels simple.&lt;/p&gt;

&lt;p&gt;But in practice, it’s strict.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Types must be explicit&lt;/li&gt;
&lt;li&gt;Fields must be handled carefully&lt;/li&gt;
&lt;li&gt;Errors are not ignored&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That strictness is what makes pipelines reliable - but also harder to get right.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Timestamp Problem
&lt;/h2&gt;

&lt;p&gt;One of the biggest issues I faced was timestamp handling.&lt;/p&gt;

&lt;p&gt;ClickHouse expects timestamps in a specific format.&lt;/p&gt;

&lt;p&gt;The raw data didn’t match that format.&lt;/p&gt;

&lt;p&gt;Even when everything else was correct, inserts would fail silently because of this.&lt;/p&gt;

&lt;p&gt;The fix looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.timestamp = to_unix_timestamp!(parse_timestamp!(.timestamp, "%+"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This line did three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Parsed the incoming timestamp&lt;/li&gt;
&lt;li&gt;Converted it into a Unix format&lt;/li&gt;
&lt;li&gt;Made it compatible with ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It seems simple - but this was a major blocker.&lt;/p&gt;




&lt;h2&gt;
  
  
  Normalizing Metrics
&lt;/h2&gt;

&lt;p&gt;Another challenge was aligning the data structure with what ClickHouse expects.&lt;/p&gt;

&lt;p&gt;For both host and GPU metrics, this required:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Converting values into numeric types&lt;/li&gt;
&lt;li&gt;Standardizing field names&lt;/li&gt;
&lt;li&gt;Adding metadata like host and source&lt;/li&gt;
&lt;li&gt;Ensuring consistent structure across all metrics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without this step, ingestion would fail even if the pipeline looked correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Raw Data to Queryable Format
&lt;/h2&gt;

&lt;p&gt;One important transformation was changing how metrics were structured.&lt;/p&gt;

&lt;p&gt;Instead of storing multiple values in a single record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cpu, memory, disk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data was reshaped into a row-based format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;metric_name = "cpu", value = ...
metric_name = "memory", value = ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made it easier to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query data in ClickHouse&lt;/li&gt;
&lt;li&gt;Aggregate metrics&lt;/li&gt;
&lt;li&gt;Maintain a consistent schema&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why This Was Hard
&lt;/h2&gt;

&lt;p&gt;Most of the time spent on this pipeline wasn’t on setup.&lt;/p&gt;

&lt;p&gt;It was here:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write transform → Run → Fail → Fix → Repeat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each iteration revealed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A type mismatch&lt;/li&gt;
&lt;li&gt;A missing field&lt;/li&gt;
&lt;li&gt;A formatting issue&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the pipeline actually gets built.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Changed After This
&lt;/h2&gt;

&lt;p&gt;Once the transforms were correct:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data started flowing reliably&lt;/li&gt;
&lt;li&gt;Inserts into ClickHouse succeeded&lt;/li&gt;
&lt;li&gt;Queries started returning meaningful results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the pipeline finally felt stable.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;Even after fixing transformations, one major challenge remained:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Debugging unexpected failures.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the next part, I’ll walk through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How I debugged pipeline issues&lt;/li&gt;
&lt;li&gt;What ClickHouse logs revealed&lt;/li&gt;
&lt;li&gt;And a mistake that cost me time&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector Pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 3: Writing transforms and handling data correctly (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Transforms are where pipelines either succeed or fail.&lt;/p&gt;

&lt;p&gt;Understanding how data needs to be shaped is more important than the tool itself.&lt;/p&gt;

&lt;p&gt;Once that becomes clear, everything else starts to fall into place.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>observability</category>
      <category>vector</category>
    </item>
    <item>
      <title>Understanding Vector Pipelines: From Config Files to Data Flow</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 09 Apr 2026 19:29:55 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k</link>
      <guid>https://forem.com/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 2 of a series on building a metrics pipeline into ClickHouse&lt;/em&gt;&lt;br&gt;
&lt;em&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Read Part 1: Why my metrics pipeline with Telegraf didn’t work&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Picking Up Where Things Broke
&lt;/h2&gt;

&lt;p&gt;In the previous part, I talked about trying to build a metrics pipeline using Telegraf - and why that approach didn’t work for my use case.&lt;/p&gt;

&lt;p&gt;The biggest issue wasn’t just tooling.&lt;/p&gt;

&lt;p&gt;It was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I didn’t have enough control over how data moved through the system.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s what led me to explore a different approach.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why Vector
&lt;/h2&gt;

&lt;p&gt;I came across Vector while looking for something more flexible.&lt;/p&gt;

&lt;p&gt;At a glance, it felt different.&lt;/p&gt;

&lt;p&gt;Instead of thinking in terms of plugins and configs, Vector is built around a &lt;strong&gt;pipeline model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And that changes everything.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Core Idea: Pipelines
&lt;/h2&gt;

&lt;p&gt;At the center of Vector is a simple concept:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sources → Transforms → Sinks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s it.&lt;/p&gt;

&lt;p&gt;But this model makes the flow of data explicit.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sources&lt;/strong&gt; → where data comes from&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transforms&lt;/strong&gt; → how data is modified&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sinks&lt;/strong&gt; → where data is sent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Compared to my earlier approach, this immediately felt clearer.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Means
&lt;/h2&gt;

&lt;p&gt;Instead of writing a config and hoping everything connects correctly, you define:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What data you are collecting&lt;/li&gt;
&lt;li&gt;How that data should be shaped&lt;/li&gt;
&lt;li&gt;Where that data should go&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That shift sounds small - but it changes how you think about the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  From Config Files to Data Flow
&lt;/h2&gt;

&lt;p&gt;With Telegraf, my thinking looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write config → Run → Debug errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Vector, it started becoming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Collect → Transform → Route → Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The focus moved from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“What config do I write?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“How does data move through each stage?”&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The New Learning Curve
&lt;/h2&gt;

&lt;p&gt;Of course, switching tools didn’t magically solve everything.&lt;/p&gt;

&lt;p&gt;There were new challenges.&lt;/p&gt;

&lt;p&gt;Vector uses YAML for configuration, which was different from the TOML I was used to.&lt;/p&gt;

&lt;p&gt;And more importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The pipeline only works if every stage is defined correctly.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Some of the early issues I ran into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Incorrect source definitions&lt;/li&gt;
&lt;li&gt;Misconfigured sinks&lt;/li&gt;
&lt;li&gt;Data not flowing as expected&lt;/li&gt;
&lt;li&gt;Silent failures when something didn’t connect properly&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At times, it felt like nothing was happening-even though everything looked “correct.”&lt;/p&gt;




&lt;h2&gt;
  
  
  First Realization: Everything Is Connected
&lt;/h2&gt;

&lt;p&gt;One important thing I learned quickly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If one stage breaks, the entire pipeline breaks.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unlike simpler setups, you can’t treat components independently.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A bad transform can stop data entirely&lt;/li&gt;
&lt;li&gt;A misconfigured sink can drop everything silently&lt;/li&gt;
&lt;li&gt;A source that doesn’t emit correctly makes debugging harder&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This forced me to start thinking in terms of &lt;strong&gt;end-to-end flow&lt;/strong&gt;, not individual pieces.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Improved Immediately
&lt;/h2&gt;

&lt;p&gt;Despite the challenges, a few things became better compared to before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Clear visibility into how data moves&lt;/li&gt;
&lt;li&gt;Better control over transformations&lt;/li&gt;
&lt;li&gt;More flexibility in shaping data before sending it to ClickHouse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though things weren’t fully working yet, I finally felt like I was closer to solving the actual problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Was Still Missing
&lt;/h2&gt;

&lt;p&gt;At this stage, the pipeline structure made sense.&lt;/p&gt;

&lt;p&gt;But one part was still unclear-and turned out to be the hardest:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How to correctly transform the data so that ClickHouse would accept it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is where most of the complexity showed up.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;In the next part, I’ll dive into the most challenging part of this setup:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Writing transforms using &lt;strong&gt;Vector Remap Language (VRL)&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Handling strict data types&lt;/li&gt;
&lt;li&gt;Fixing timestamp issues&lt;/li&gt;
&lt;li&gt;And shaping metrics into a format that ClickHouse can actually ingest&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;p&gt;This post is part of a series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h"&gt;Part 1: Why the Telegraf approach didn’t work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;Part 2: Understanding Vector pipelines (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data correctly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;Switching tools didn’t solve the problem immediately.&lt;/p&gt;

&lt;p&gt;But it did something more important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;It made the system visible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I could see how data moved through each stage, debugging stopped being guesswork-and started becoming structured.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>observability</category>
      <category>backend</category>
    </item>
    <item>
      <title>Why My Metrics Pipeline with Telegraf Didn’t Work (and What I Learned)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Tue, 07 Apr 2026 10:18:51 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h</link>
      <guid>https://forem.com/mohhddhassan/why-my-metrics-pipeline-with-telegraf-didnt-work-and-what-i-learned-4g6h</guid>
      <description>&lt;p&gt;&lt;em&gt;Part 1 of a series on building a metrics pipeline into ClickHouse&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Collecting metrics is easy.&lt;/p&gt;

&lt;p&gt;Shipping them to an analytical database without losing your mind is the hard part.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;At one point, the task seemed straightforward:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Collect system metrics (CPU, memory, GPU) and store them in ClickHouse for analysis.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is a common observability use case.&lt;br&gt;
You collect metrics, send them somewhere, and run queries on top.&lt;/p&gt;

&lt;p&gt;Simple enough.&lt;/p&gt;

&lt;p&gt;But in practice, it didn’t go as planned.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Initial Approach: Telegraf
&lt;/h2&gt;

&lt;p&gt;I started with Telegraf.&lt;/p&gt;

&lt;p&gt;It’s widely used for collecting system metrics and has a plugin-based architecture, which makes it a natural first choice.&lt;/p&gt;

&lt;p&gt;This was also where I first came across TOML.&lt;/p&gt;

&lt;p&gt;At first, it felt like I just needed to “write a config and run it.”&lt;br&gt;
But very quickly, I realized:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Configuration isn’t just syntax-it defines how your system behaves.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  What I Was Trying to Build
&lt;/h2&gt;

&lt;p&gt;The idea was simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Collect host-level metrics (CPU, memory, etc.)&lt;/li&gt;
&lt;li&gt;Collect GPU metrics&lt;/li&gt;
&lt;li&gt;Push everything into ClickHouse&lt;/li&gt;
&lt;li&gt;Run analytical queries on top&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Essentially, a basic observability pipeline.&lt;/p&gt;


&lt;h2&gt;
  
  
  Where Things Started Breaking
&lt;/h2&gt;

&lt;p&gt;On paper, Telegraf looked like it should work.&lt;/p&gt;

&lt;p&gt;In reality, I ran into a few issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No straightforward way to push data into ClickHouse&lt;/li&gt;
&lt;li&gt;Lack of a native ClickHouse output plugin&lt;/li&gt;
&lt;li&gt;Debugging wasn’t very intuitive&lt;/li&gt;
&lt;li&gt;Configurations became rigid as complexity increased&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point, I was spending more time trying to make the tool fit the use case than actually solving the problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  A Shift in Perspective
&lt;/h2&gt;

&lt;p&gt;This is where something important clicked.&lt;/p&gt;

&lt;p&gt;Up until this point, I was thinking in terms of:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write config → Run tool → Expect output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that approach wasn’t working.&lt;/p&gt;

&lt;p&gt;What I needed instead was a clearer understanding of how data actually flows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data source → Transformation → Destination
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem wasn’t just the tool-it was the lack of control over how data moved through the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I Decided to Move Away
&lt;/h2&gt;

&lt;p&gt;At this stage, it became clear that I needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More control over data transformations&lt;/li&gt;
&lt;li&gt;Better visibility into how data flows&lt;/li&gt;
&lt;li&gt;A system that is easier to debug&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Telegraf, while powerful, didn’t give me that level of flexibility for this use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;That’s when I decided to try a different approach using Vector.&lt;/p&gt;

&lt;p&gt;Instead of treating configuration as static setup, Vector treats it as a &lt;strong&gt;pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In the next part, I’ll walk through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How Vector pipelines work&lt;/li&gt;
&lt;li&gt;Why the &lt;em&gt;sources → transforms → sinks&lt;/em&gt; model made a difference&lt;/li&gt;
&lt;li&gt;And what changed when I adopted that approach&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Series Overview
&lt;/h2&gt;

&lt;p&gt;This post is part of a series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Part 1: Why the Telegraf approach didn’t work (this post)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/understanding-vector-pipelines-from-config-files-to-data-flow-i8k"&gt;Part 2: Understanding Vector pipelines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/from-pipelines-to-transforms-making-vector-work-with-clickhouse-5ho"&gt;Part 3: Writing transforms and handling data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/mohhddhassan/debugging-a-broken-metrics-pipeline-what-actually-went-wrong-4d3o"&gt;Part 4: Debugging and making the pipeline reliable&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;What started as a simple setup turned into a deeper lesson:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Tools don’t solve problems-understanding systems does.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once that became clear, the direction forward was much easier.&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>devops</category>
      <category>observability</category>
    </item>
    <item>
      <title>Full Text Search in ClickHouse: What Works in 2026</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 01 Apr 2026 06:46:21 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/full-text-search-in-clickhouse-what-works-and-what-doesnt-241c</link>
      <guid>https://forem.com/mohhddhassan/full-text-search-in-clickhouse-what-works-and-what-doesnt-241c</guid>
      <description>&lt;p&gt;ClickHouse is the undisputed heavyweight champion of analytics famed for fast aggregations, massive columnar storage, and processing trillions of rows. Historically, however, if you wanted "real" full-text search, the engineering consensus was clear: &lt;strong&gt;Don't use ClickHouse.&lt;/strong&gt; You had to pay the "architectural tax" of syncing your data to a dedicated engine like Elasticsearch or OpenSearch.&lt;/p&gt;

&lt;p&gt;But as of 2026, that consensus has shifted. With the General Availability of &lt;strong&gt;Inverted Indices&lt;/strong&gt; and native ranking functions, the question is no longer &lt;em&gt;if&lt;/em&gt; ClickHouse can do search, but how much of your infrastructure you can now simplify by moving it all into one place.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Do We Mean by "Full-Text Search"?
&lt;/h2&gt;

&lt;p&gt;Full-text search is fundamentally different from simple string filtering. In a dedicated search ecosystem, it typically requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization:&lt;/strong&gt; Breaking sentences into individual, searchable words.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inverted Indexing:&lt;/strong&gt; A specialized data structure that maps tokens to row IDs so the engine doesn't have to scan the entire table.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relevance Scoring:&lt;/strong&gt; Ranking results using algorithms like &lt;strong&gt;BM25&lt;/strong&gt; so the best matches appear first.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the past, ClickHouse only handled basic filtering. Today, it handles the entire stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Actually Works: The 2026 Reality
&lt;/h2&gt;

&lt;p&gt;ClickHouse now provides a tiered approach to text search. Depending on your performance needs, you have three primary tools:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Heavyweight: Native Inverted Indices
&lt;/h3&gt;

&lt;p&gt;This is the single biggest update to the ClickHouse ecosystem. You no longer need to rely on brute-force &lt;code&gt;LIKE&lt;/code&gt; patterns that scan every byte of data. By defining an Inverted Index, ClickHouse creates a mapping that allows it to jump directly to the relevant data blocks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Creating a high-performance inverted index&lt;/span&gt;
&lt;span class="k"&gt;ALTER&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;logs&lt;/span&gt; &lt;span class="k"&gt;ADD&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;inv_idx&lt;/span&gt; &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="k"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;inverted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;GRANULARITY&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The performance impact is massive. For datasets in the billions of rows, an indexed search can be &lt;strong&gt;10x to 100x faster&lt;/strong&gt; than a standard query because it narrows the search space to a few "granules" of data.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Industry-Standard Ranking (BM25)
&lt;/h3&gt;

&lt;p&gt;A search engine is only as good as its sorting. ClickHouse now supports &lt;strong&gt;BM25 scoring&lt;/strong&gt; natively. This allows you to find "connection errors" and ensure the rows where those terms appear most prominently are at the top of your result set.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Precision Tokenization
&lt;/h3&gt;

&lt;p&gt;Using functions like &lt;code&gt;hasToken()&lt;/code&gt;, ClickHouse understands word boundaries. It knows that a search for the word "log" should not return results for "logger" or "biological." This brings a level of precision previously reserved for dedicated search engines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where ClickHouse Excels
&lt;/h2&gt;

&lt;p&gt;In the current landscape, ClickHouse is the "sweet spot" for several specific high-growth use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log Analytics &amp;amp; Observability:&lt;/strong&gt; This is the primary "Elasticsearch killer." You can search billions of logs for a specific error message and, in the same query, calculate the average latency or error rate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architectural Simplicity:&lt;/strong&gt; Managing a ClickHouse cluster &lt;em&gt;and&lt;/em&gt; a search cluster is an operational nightmare. Moving both workloads to ClickHouse reduces your infrastructure footprint, simplifies your ingestion pipelines, and slashes your cloud bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hybrid Queries:&lt;/strong&gt; ClickHouse allows you to join search results with structured metadata (like user IDs or pricing tables) instantly - something that is notoriously difficult in traditional search engines.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Still "Doesn't Work"
&lt;/h2&gt;

&lt;p&gt;Despite these massive strides, ClickHouse is not a magic bullet for every search problem. There are still areas where dedicated engines hold the lead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Complex Linguistics:&lt;/strong&gt; If you need deep morphological analysis (e.g., matching "mice" to "mouse" or handling complex compounding in German), dedicated engines still have more mature language plugins.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fuzzy Matching &amp;amp; Auto-Correct:&lt;/strong&gt; While ClickHouse can calculate &lt;code&gt;levenshteinDistance()&lt;/code&gt;, it isn't yet optimized for high-concurrency "did you mean?" style suggestions found on major e-commerce sites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-tenant Search Products:&lt;/strong&gt; If you are building a consumer-facing product where search &lt;em&gt;is&lt;/em&gt; the entire product, the fine-grained tuning of a search-first engine is still superior.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  ClickHouse vs. Search Engines: The 2026 Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;ClickHouse (2026)&lt;/th&gt;
&lt;th&gt;Elasticsearch / OpenSearch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Primary Strength&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Analytics + Search&lt;/td&gt;
&lt;td&gt;High-Relevance Search&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Very Low (Columnar)&lt;/td&gt;
&lt;td&gt;High (Index Overhead)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aggregation Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Best-in-class&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Relevance (BM25)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fully Supported&lt;/td&gt;
&lt;td&gt;Industry Standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Operational Effort&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Low (Single System)&lt;/td&gt;
&lt;td&gt;High (Multiple Systems)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;The boundary between "Analytics" and "Search" has officially blurred. &lt;/p&gt;

&lt;p&gt;If you are analyzing logs, building internal observability tools, or need to search across massive datasets where cost and aggregation speed matter most, &lt;strong&gt;ClickHouse is now a full-text search engine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Choosing ClickHouse in 2026 means opting for a simpler architecture and better performance without sacrificing the core search capabilities your team needs.&lt;/p&gt;




</description>
      <category>clickhouse</category>
      <category>database</category>
      <category>backend</category>
      <category>opensource</category>
    </item>
    <item>
      <title>When Synthetic Data Lies: A Hidden Correlation Problem I Didn’t Expect</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Thu, 26 Mar 2026 10:06:38 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/when-synthetic-data-lies-a-hidden-correlation-problem-i-didnt-expect-3f8b</link>
      <guid>https://forem.com/mohhddhassan/when-synthetic-data-lies-a-hidden-correlation-problem-i-didnt-expect-3f8b</guid>
      <description>&lt;p&gt;While working on a small analytics setup using ClickHouse and Superset, I generated some synthetic data to test queries and dashboards.&lt;/p&gt;

&lt;p&gt;Initially, everything looked fine. The distributions seemed reasonable, and the dashboards behaved as expected.&lt;/p&gt;

&lt;p&gt;But as I increased the dataset size, a few patterns started to look off.&lt;/p&gt;

&lt;p&gt;Revenue seemed to concentrate in a single country.&lt;br&gt;
In some cases, certain countries had no purchases at all.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kaac2iu9fs1sfqn44t6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6kaac2iu9fs1sfqn44t6.png" alt=" " width="652" height="326"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At first, it looked like a simple distribution issue. But the patterns were too consistent to ignore.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Checking the Usual Suspects&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first assumption was that something was wrong with the queries or aggregations.&lt;/p&gt;

&lt;p&gt;So I checked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;query logic&lt;/li&gt;
&lt;li&gt;filters&lt;/li&gt;
&lt;li&gt;materialized views&lt;/li&gt;
&lt;li&gt;dashboard configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everything seemed correct.&lt;/p&gt;

&lt;p&gt;Which pointed to a different possibility:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The issue wasn’t in how the data was queried - it was in how the data was generated.&lt;/p&gt;
&lt;/blockquote&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Looking at the Data More Closely&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Instead of relying on dashboards, I went back to the raw data.&lt;/p&gt;

&lt;p&gt;A simple aggregation made things clearer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one country dominating purchases&lt;/li&gt;
&lt;li&gt;another missing entirely&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hdhnwpeki8684ehk29f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4hdhnwpeki8684ehk29f.png" alt=" " width="646" height="304"&gt;&lt;/a&gt;&lt;br&gt;
Overall event distribution looks reasonable - the issue isn’t obvious here&lt;/p&gt;

&lt;p&gt;At this point, it was clear that the data itself wasn’t behaving as expected.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The First Issue: Randomness That Wasn’t Quite Random&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The initial data generation logic used &lt;code&gt;rand()&lt;/code&gt; like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="n"&gt;multiIf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'India'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'US'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'UK'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;rand&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'Germany'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'UAE'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At a glance, this looks reasonable.&lt;/p&gt;

&lt;p&gt;But each &lt;code&gt;rand()&lt;/code&gt; call is evaluated independently.&lt;/p&gt;

&lt;p&gt;So instead of generating a single random value and assigning a category, the logic evaluates a new random value at each step.&lt;/p&gt;

&lt;p&gt;This leads to unintended distributions and subtle bias in the data.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Fixing That… and Introducing Another Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make the data more stable, I switched to a deterministic approach using &lt;code&gt;number&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;country_rand&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This made the distributions predictable and easier to reason about.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cfoboq9mi0t8drjt87w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4cfoboq9mi0t8drjt87w.png" alt=" " width="654" height="300"&gt;&lt;/a&gt;&lt;br&gt;
After fixing randomness - a different issue appears (some countries have zero purchases)&lt;/p&gt;

&lt;p&gt;But it introduced a different issue.&lt;/p&gt;

&lt;p&gt;Both &lt;code&gt;event_type&lt;/code&gt; and &lt;code&gt;country&lt;/code&gt; were now derived from the same base value: &lt;code&gt;number&lt;/code&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The Real Issue: Hidden Correlation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Even with different multipliers, these values were not independent.&lt;/p&gt;

&lt;p&gt;They were mathematically related.&lt;/p&gt;

&lt;p&gt;This meant:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;certain rows could only produce certain combinations&lt;/li&gt;
&lt;li&gt;some combinations would never occur&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this case, the rows that produced &lt;code&gt;"purchase"&lt;/code&gt; never aligned with the rows that produced &lt;code&gt;"UAE"&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Which resulted in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;UAE having zero purchases&lt;/li&gt;
&lt;li&gt;other countries showing skewed distributions&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;What Was Actually Wrong&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The issue wasn’t randomness.&lt;/p&gt;

&lt;p&gt;It was lack of independence.&lt;/p&gt;

&lt;p&gt;The variables in the synthetic dataset were not independent of each other.&lt;/p&gt;

&lt;p&gt;And that’s enough to produce misleading analytics.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;Fixing the Data Generation&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To resolve this, I changed how the values were generated:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;used different transformations&lt;/li&gt;
&lt;li&gt;added offsets&lt;/li&gt;
&lt;li&gt;ensured each variable had its own distribution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;event_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;country_rand&lt;/span&gt;
&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;47&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;device_rand&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This breaks the alignment between variables and restores independence.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpw35esilqryhqhncsx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmpw35esilqryhqhncsx4.png" alt=" " width="650" height="302"&gt;&lt;/a&gt;&lt;br&gt;
After this change, the distributions behaved as expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why This Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;At smaller scales, the issue wasn’t obvious.&lt;/p&gt;

&lt;p&gt;The data looked fine, and the dashboards seemed reasonable.&lt;/p&gt;

&lt;p&gt;But as the dataset grew:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;patterns became more consistent&lt;/li&gt;
&lt;li&gt;biases became more visible&lt;/li&gt;
&lt;li&gt;incorrect assumptions started to look like real insights&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Takeaway&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Synthetic data can look correct while still producing misleading results.&lt;/p&gt;

&lt;p&gt;The problem wasn’t query performance.&lt;/p&gt;

&lt;p&gt;It was data correctness.&lt;/p&gt;

&lt;p&gt;Scaling the data didn’t create the issue.&lt;/p&gt;

&lt;p&gt;It revealed it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final Thoughts&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This was a good reminder that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;data generation deserves as much attention as querying&lt;/li&gt;
&lt;li&gt;small assumptions can lead to large inconsistencies&lt;/li&gt;
&lt;li&gt;and “reasonable-looking” data isn’t always reliable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When working with analytics systems, it’s easy to trust what the data shows.&lt;/p&gt;

&lt;p&gt;But sometimes, it’s worth questioning how that data was created in the first place.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Related&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This came up while building an analytics setup using ClickHouse and Superset, where I was comparing raw tables and materialized views.&lt;/p&gt;

&lt;p&gt;If you're interested in that setup, you can read about it here:&lt;br&gt;
👉 &lt;a href="https://dev.to/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n"&gt;link to Blog 1&lt;/a&gt;&lt;/p&gt;




</description>
      <category>dataengineering</category>
      <category>clickhouse</category>
      <category>analytics</category>
      <category>debugging</category>
    </item>
    <item>
      <title>How ClickHouse + Superset Work Together for Analytics (And What Actually Matters)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 25 Mar 2026 00:59:24 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n</link>
      <guid>https://forem.com/mohhddhassan/how-clickhouse-superset-work-together-for-analytics-and-what-actually-matters-597n</guid>
      <description>&lt;p&gt;Modern analytics systems require more than just fast databases - they need a complete workflow from data storage to visualization.&lt;/p&gt;

&lt;p&gt;I set up a small analytics pipeline using ClickHouse and Apache Superset to understand how dashboards are built end to end.&lt;/p&gt;

&lt;p&gt;The setup itself was straightforward, but while testing it, one question kept coming up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Does query optimization actually matter at smaller scales?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To explore this, I compared queries on a raw table with queries on a materialized view. The difference wasn’t huge - but it was enough to reveal how things behave as data grows.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why I Built This&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The goal wasn’t to simulate a production system, but to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;understand how ClickHouse works in an analytics workflow&lt;/li&gt;
&lt;li&gt;explore how Superset interacts with a database&lt;/li&gt;
&lt;li&gt;observe how query performance changes with different data models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This was more of a hands-on exploration than a benchmark.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why a BI Tool?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Running SQL queries directly is sufficient for basic analysis. However, as requirements grow, teams need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reusable datasets&lt;/li&gt;
&lt;li&gt;interactive dashboards&lt;/li&gt;
&lt;li&gt;faster exploration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A BI tool provides a structured way to bridge raw data and decision-making.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Apache Superset Instead of Grafana&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Both tools serve different purposes:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Apache Superset&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;SQL-first analytics workflow&lt;/li&gt;
&lt;li&gt;rich visualization capabilities&lt;/li&gt;
&lt;li&gt;designed for OLAP use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Grafana&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;strong in monitoring and observability&lt;/li&gt;
&lt;li&gt;optimized for time-series metrics&lt;/li&gt;
&lt;li&gt;less flexible for ad-hoc analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For analytics workloads on ClickHouse, Superset provides greater flexibility and control.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why ClickHouse + Superset?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;ClickHouse and Superset complement each other in a typical analytics stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ClickHouse handles large-scale aggregations efficiently&lt;/li&gt;
&lt;li&gt;Superset enables exploration and visualization on top of SQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ClickHouse performs the computation, while Superset exposes it for analysis.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Architecture&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The overall architecture follows a simple flow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data → ClickHouse → Materialized View → Superset → Dashboard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This separation makes it easier to control performance - heavy computation stays in ClickHouse, while Superset focuses on visualization.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dataset Design&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;A simple events table was created in ClickHouse using synthetic data.&lt;/p&gt;

&lt;p&gt;The goal was not to simulate production-scale workloads, but to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;validate the integration&lt;/li&gt;
&lt;li&gt;build dashboards&lt;/li&gt;
&lt;li&gt;observe query behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpdqto00t8k3h1q07qlw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcpdqto00t8k3h1q07qlw.png" alt=" " width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dashboard Creation in Superset&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;After establishing the connection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;datasets were defined on ClickHouse tables&lt;/li&gt;
&lt;li&gt;charts were built using SQL queries&lt;/li&gt;
&lt;li&gt;dashboards were assembled with filters for interaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Superset acts as a visualization layer while still relying heavily on SQL for data definition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33zl7admx15nv5bl8jrc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F33zl7admx15nv5bl8jrc.png" alt=" " width="800" height="379"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Explore View&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwembd2vw0vmm477ajihl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwembd2vw0vmm477ajihl.png" alt="Final Dashboard" width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Final Dashboard&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Raw Table vs Materialized View&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To understand performance behavior, queries were executed on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the raw table&lt;/li&gt;
&lt;li&gt;a materialized view with pre-aggregated data&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Results&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Raw table → ~281 ms&lt;/li&gt;
&lt;li&gt;Materialized view → ~222 ms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0uvvn8jmutthrcxnlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffo0uvvn8jmutthrcxnlt.png" alt=" " width="800" height="218"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Raw Table&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezt2z03c345vuwjptad4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fezt2z03c345vuwjptad4.png" alt=" " width="800" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;MV Table&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Why Materialized Views Improve Performance&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Materialized views:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reduce the volume of data scanned&lt;/li&gt;
&lt;li&gt;pre-compute aggregations&lt;/li&gt;
&lt;li&gt;simplify query logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even though the dataset is small, the improvement is measurable.&lt;/p&gt;

&lt;p&gt;At this scale, the difference is minor - but it highlights something important:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As data grows, these small optimizations compound significantly.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Insight&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The performance difference is small at low scale, but the pattern is clear.&lt;/p&gt;

&lt;p&gt;As datasets grow, query performance becomes less about the BI tool and more about how the data is modeled.&lt;/p&gt;

&lt;p&gt;Materialized views, pre-aggregation, and query design matter far more than visualization tooling.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Challenges Faced&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Driver Not Detected by Superset&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Could not load database driver: ClickHouseConnectEngineSpec
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Root Cause
&lt;/h4&gt;

&lt;p&gt;Superset runs inside its own internal virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/app/.venv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The package was installed using system pip instead of the venv pip, making it invisible to Superset.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/app/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; ensurepip
/app/.venv/bin/python &lt;span class="nt"&gt;-m&lt;/span&gt; pip &lt;span class="nb"&gt;install &lt;/span&gt;clickhouse-connect
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;ClickHouse Not Visible in UI&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ClickHouse did not appear in the database dropdown.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Use manual connection string:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;clickhousedb://default:password@clickhouse:8123/default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h3&gt;
  
  
  &lt;strong&gt;Authentication Issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Authentication failures occurred due to existing volumes storing old credentials.&lt;/p&gt;

&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Reset the ClickHouse volume and restart containers.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;SQLite Migration Errors&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Error:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table ab_permission already exists
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Fix
&lt;/h4&gt;

&lt;p&gt;Rebuild containers and allow Superset to handle initialization automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Key Learnings&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Data modeling plays a critical role in analytics performance&lt;/li&gt;
&lt;li&gt;Materialized views are essential for scalable query performance&lt;/li&gt;
&lt;li&gt;Superset relies on a properly optimized backend&lt;/li&gt;
&lt;li&gt;Docker environment isolation can introduce subtle issues&lt;/li&gt;
&lt;li&gt;Understanding internal environments (like virtualenvs) is crucial&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A Note on Synthetic Data&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One interesting issue I ran into during this process was with synthetic data generation.&lt;/p&gt;

&lt;p&gt;At first, everything looked correct - but as the dataset grew, some unexpected patterns started to appear in the results.&lt;/p&gt;

&lt;p&gt;It turned out to be a subtle problem related to how the data was being generated, not queried.&lt;/p&gt;

&lt;p&gt;I’ll cover that in a follow-up post.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This setup was a good way to understand how modern analytics systems are put together - combining storage, computation, and visualization.&lt;/p&gt;

&lt;p&gt;Even with a small dataset, experimenting with different query strategies shows how systems behave as they scale.&lt;/p&gt;

&lt;p&gt;The tools themselves are powerful, but performance ultimately depends on how the data is structured and queried.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;References&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://superset.apache.org/user-docs/intro/" rel="noopener noreferrer"&gt;Apache Superset Documentation&lt;/a&gt;&lt;br&gt;
&lt;a href="https://clickhouse.com/docs/integrations/superset" rel="noopener noreferrer"&gt;Superset to ClickHouse&lt;/a&gt;&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>superset</category>
      <category>analytics</category>
      <category>database</category>
    </item>
    <item>
      <title>Managing Large PostgreSQL Tables with Native Partitioning and pg_partman</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Mon, 16 Mar 2026 08:31:49 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/managing-large-postgresql-tables-with-native-partitioning-and-pgpartman-59ak</link>
      <guid>https://forem.com/mohhddhassan/managing-large-postgresql-tables-with-native-partitioning-and-pgpartman-59ak</guid>
      <description>&lt;p&gt;As databases grow, tables that store large volumes of time-based data can quickly become difficult to manage.&lt;/p&gt;

&lt;p&gt;Over time, this leads to several issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queries become slower&lt;/li&gt;
&lt;li&gt;indexes grow larger&lt;/li&gt;
&lt;li&gt;maintenance operations like &lt;code&gt;VACUUM&lt;/code&gt; take longer&lt;/li&gt;
&lt;li&gt;managing old data becomes complicated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PostgreSQL provides &lt;strong&gt;native table partitioning&lt;/strong&gt; to help address these problems.&lt;/p&gt;

&lt;p&gt;However, while partitioning improves performance and data management, operating partitioned tables manually can introduce operational complexity.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how &lt;strong&gt;native PostgreSQL partitioning&lt;/strong&gt; works&lt;/li&gt;
&lt;li&gt;the operational challenges of managing partitions manually&lt;/li&gt;
&lt;li&gt;how pg_partman automates partition management&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Native PostgreSQL Partitioning
&lt;/h1&gt;

&lt;p&gt;PostgreSQL supports table partitioning, allowing a large logical table to be split into multiple smaller physical tables called &lt;strong&gt;partitions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;PostgreSQL currently supports three partitioning methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Range
List
Hash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For time-based data, &lt;strong&gt;range partitioning&lt;/strong&gt; is the most common approach.&lt;/p&gt;

&lt;p&gt;For example, imagine storing application events with a timestamp.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;BIGSERIAL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;data&lt;/span&gt; &lt;span class="n"&gt;JSONB&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;RANGE&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;code&gt;events&lt;/code&gt; becomes the &lt;strong&gt;parent table&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Actual data is stored in child tables called &lt;strong&gt;partitions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Example partitions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2026_03_20&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-20'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2026_03_21&lt;/span&gt;
&lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-22'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each partition is a &lt;strong&gt;physical table inside PostgreSQL&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Data Is Inserted into Partitions
&lt;/h1&gt;

&lt;p&gt;Applications still insert data into the &lt;strong&gt;parent table&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-03-21 10:15:00'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'{"event":"login"}'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL automatically routes the row to the correct partition based on the partition key.&lt;/p&gt;

&lt;p&gt;In this example, the row would be stored inside:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;events_2026_03_21
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Queries still run against the parent table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'7 days'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL internally performs &lt;strong&gt;partition pruning&lt;/strong&gt;, meaning only the relevant partitions are scanned.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Partitioning Improves Performance
&lt;/h1&gt;

&lt;p&gt;Partitioning helps in several ways.&lt;/p&gt;

&lt;p&gt;Instead of scanning a very large table, PostgreSQL only scans the partitions that match the query conditions.&lt;/p&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          Query for last 7 days
                                   ↓
                 PostgreSQL scans only recent partitions
                                   ↓
                        Older partitions are skipped
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Partitioning also simplifies operations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;archiving historical data&lt;/li&gt;
&lt;li&gt;dropping old data quickly&lt;/li&gt;
&lt;li&gt;managing index sizes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, dropping old data can be done instantly by dropping a partition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;events_2024_03_01&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is far faster than deleting millions of rows from a single large table.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Operational Challenge with Native Partitioning
&lt;/h1&gt;

&lt;p&gt;While native partitioning is powerful, managing partitions manually introduces operational challenges.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;new partitions must be &lt;strong&gt;created ahead of time&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;old partitions must be &lt;strong&gt;removed manually&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;retention policies must be implemented manually&lt;/li&gt;
&lt;li&gt;missing partitions can cause insert failures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example scenario:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                  application inserts event at midnight
                                   ↓
                        new partition does not exist
                                   ↓
                              insert fails
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As systems scale and tables grow, manually managing partitions becomes increasingly difficult.&lt;/p&gt;

&lt;p&gt;This is where automation becomes valuable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Introducing pg_partman
&lt;/h1&gt;

&lt;p&gt;pg_partman is a PostgreSQL extension designed to automate partition management.&lt;/p&gt;

&lt;p&gt;It builds on top of PostgreSQL’s native partitioning and helps manage partitioned tables more efficiently.&lt;/p&gt;

&lt;p&gt;pg_partman can automatically handle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;creation of future partitions&lt;/li&gt;
&lt;li&gt;retention and removal of old partitions&lt;/li&gt;
&lt;li&gt;partition maintenance operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This reduces the operational overhead of managing partitioned tables manually.&lt;/p&gt;




&lt;h1&gt;
  
  
  How pg_partman Works
&lt;/h1&gt;

&lt;p&gt;pg_partman works by managing a &lt;strong&gt;parent partitioned table&lt;/strong&gt; and automatically maintaining its partitions.&lt;/p&gt;

&lt;p&gt;A simplified workflow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              Parent table
                                   ↓
                         pg_partman configuration
                                   ↓
                   Automatic creation of future partitions
                                   ↓
               Optional retention policies for old partitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example, if a table is partitioned by &lt;strong&gt;day&lt;/strong&gt;, pg_partman can automatically create upcoming partitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;events_p2026_03_22
events_p2026_03_23
events_p2026_03_24
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that new inserts always have a valid partition.&lt;/p&gt;




&lt;h1&gt;
  
  
  Benefits of Using pg_partman
&lt;/h1&gt;

&lt;p&gt;Compared to managing partitions manually, pg_partman provides several advantages.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    Automatic partition creation
                                   ↓
                    Reduced operational overhead
                                   ↓
                   Safer data retention management
                                   ↓
                   Less risk of missing partitions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of maintaining partition logic manually in application code or scripts, pg_partman handles this inside the database.&lt;/p&gt;




&lt;h1&gt;
  
  
  When pg_partman Is Useful
&lt;/h1&gt;

&lt;p&gt;pg_partman is particularly useful for workloads involving &lt;strong&gt;large append-only datasets&lt;/strong&gt;, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;event logs&lt;/li&gt;
&lt;li&gt;analytics data&lt;/li&gt;
&lt;li&gt;application activity tracking&lt;/li&gt;
&lt;li&gt;time-series data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these scenarios, new data continuously arrives while older data eventually becomes less important.&lt;/p&gt;

&lt;p&gt;Partition automation helps manage this lifecycle efficiently.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;PostgreSQL’s native partitioning provides powerful capabilities for managing large datasets.&lt;/p&gt;

&lt;p&gt;However, operating partitioned tables manually can introduce additional operational complexity.&lt;/p&gt;

&lt;p&gt;Extensions like pg_partman simplify this process by automating partition creation and maintenance.&lt;/p&gt;

&lt;p&gt;By combining PostgreSQL’s native partitioning features with pg_partman’s automation, teams can manage large time-based datasets more reliably and with less manual intervention.&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>database</category>
      <category>backend</category>
      <category>devops</category>
    </item>
    <item>
      <title>PostgreSQL Backups and Point-in-Time Recovery with pgBackRest</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Fri, 13 Mar 2026 15:10:54 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/postgresql-backups-and-point-in-time-recovery-with-pgbackrest-13gp</link>
      <guid>https://forem.com/mohhddhassan/postgresql-backups-and-point-in-time-recovery-with-pgbackrest-13gp</guid>
      <description>&lt;p&gt;This article is part of a series on PostgreSQL WAL.&lt;/p&gt;

&lt;p&gt;Part 1 - &lt;a href="https://dev.to/mohhddhassan/how-postgresql-wal-actually-works-and-why-everything-depends-on-it-2g81"&gt;How PostgreSQL WAL Actually Works&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Part 2 - &lt;a href="https://dev.to/mohhddhassan/how-debezium-uses-postgresql-wal-for-change-data-capture-3l19"&gt;How Debezium Uses PostgreSQL WAL for CDC&lt;/a&gt;&lt;br&gt;&lt;br&gt;
Part 3 - PostgreSQL Backups with pgBackRest and PITR&lt;/p&gt;

&lt;p&gt;In the previous articles, we've explored how PostgreSQL records every change inside the database using &lt;strong&gt;Write-Ahead Logging (WAL)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;We also saw how tools like Debezium can read WAL to stream database changes in real time.&lt;/p&gt;

&lt;p&gt;But WAL has another very important role.&lt;/p&gt;

&lt;p&gt;It enables &lt;strong&gt;database backups and recovery&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;By combining a base backup with archived WAL segments, PostgreSQL can restore a database to &lt;strong&gt;any specific moment in time&lt;/strong&gt;. This capability is known as &lt;strong&gt;Point-in-Time Recovery (PITR)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore how this works and how to implement it using pgBackRest.&lt;/p&gt;


&lt;h1&gt;
  
  
  Why Backups Alone Are Not Enough
&lt;/h1&gt;

&lt;p&gt;A common backup strategy is to take periodic full backups of a database.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;02:00 → Full backup taken
14:00 → Accidental DELETE executed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we restore the 02:00 backup, the database returns to that state.&lt;/p&gt;

&lt;p&gt;But everything between &lt;strong&gt;02:00 and 14:00 is lost&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          restore backup
                                 ↓
                      database returns to 02:00
                                 ↓
                        12 hours of data lost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where WAL archiving becomes important.&lt;/p&gt;

&lt;p&gt;Instead of relying only on periodic backups, PostgreSQL can archive WAL segments continuously.&lt;/p&gt;

&lt;p&gt;This allows the database to replay changes until a specific moment in time.&lt;/p&gt;




&lt;h1&gt;
  
  
  How WAL Enables Point-in-Time Recovery
&lt;/h1&gt;

&lt;p&gt;Point-in-Time Recovery works by combining two components:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              Base Backup
                                   +
                        Continuous WAL Archiving
                                   =
                         Point-in-Time Recovery
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recovery flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
                               Base Backup
                                    │
                                    ▼
                              Restore Backup
                                    │
                                    ▼
                           Replay WAL Segments
                                    │
                                    ▼
                            Reach Target Time
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because WAL contains every change in chronological order, PostgreSQL can replay those changes to reconstruct the database state at any moment.&lt;/p&gt;




&lt;h1&gt;
  
  
  Introducing pgBackRest
&lt;/h1&gt;

&lt;p&gt;pgBackRest is a popular backup and restore solution designed specifically for PostgreSQL.&lt;/p&gt;

&lt;p&gt;It provides several capabilities including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;managing full, differential, and incremental backups&lt;/li&gt;
&lt;li&gt;archiving WAL segments&lt;/li&gt;
&lt;li&gt;enforcing backup retention policies&lt;/li&gt;
&lt;li&gt;restoring databases efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;pgBackRest integrates directly with PostgreSQL’s WAL archiving system and stores backups in a repository.&lt;/p&gt;

&lt;p&gt;This repository can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local storage&lt;/li&gt;
&lt;li&gt;remote servers&lt;/li&gt;
&lt;li&gt;object storage&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For testing in this article, we will use &lt;strong&gt;MinIO&lt;/strong&gt; as an object storage backend.&lt;/p&gt;




&lt;h1&gt;
  
  
  Testing Architecture
&lt;/h1&gt;

&lt;p&gt;For local testing, the setup consists of two containers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                         PostgreSQL + pgBackRest
                                   │
                                   ▼
                                 MinIO
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The PostgreSQL container runs both the database and pgBackRest.&lt;/li&gt;
&lt;li&gt;MinIO acts as an object storage service where backups and WAL segments are stored.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture makes it easy to test backup and restore workflows locally.&lt;/p&gt;




&lt;h1&gt;
  
  
  Enabling WAL Archiving
&lt;/h1&gt;

&lt;p&gt;To enable WAL archiving in PostgreSQL, the following parameters must be configured.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;archive_mode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;on&lt;/span&gt;
&lt;span class="py"&gt;archive_command&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;'pgbackrest --stanza=demo archive-push %p'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here is what happens internally:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                         WAL segment reaches 16MB
                                   ↓
                   PostgreSQL executes archive_command
                                   ↓
                      pgBackRest uploads WAL segment
                                   ↓
                      WAL stored in backup repository
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures WAL segments are continuously archived.&lt;/p&gt;




&lt;h1&gt;
  
  
  Running a Base Backup
&lt;/h1&gt;

&lt;p&gt;Once pgBackRest is configured, a base backup can be taken using:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;demo backup &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pgBackRest supports multiple backup types.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Complete backup of the database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Differential&lt;/td&gt;
&lt;td&gt;Backup of changes since last full backup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Incremental&lt;/td&gt;
&lt;td&gt;Backup of changes since last backup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These options allow flexible backup strategies.&lt;/p&gt;




&lt;h1&gt;
  
  
  Simulating Data Loss
&lt;/h1&gt;

&lt;p&gt;To verify the backup pipeline, a sample table was created.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;serial&lt;/span&gt; &lt;span class="k"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;text&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sample data was inserted:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Bob'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Charlie'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After confirming the data, a full backup was taken.&lt;/p&gt;

&lt;p&gt;Next, the table was intentionally removed to simulate a failure scenario.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;DROP&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, the table and its data no longer existed.&lt;/p&gt;




&lt;h1&gt;
  
  
  Restoring the Database
&lt;/h1&gt;

&lt;p&gt;To restore the database, PostgreSQL must first be stopped.&lt;/p&gt;

&lt;p&gt;However, in this setup PostgreSQL and pgBackRest are running inside the &lt;strong&gt;same container&lt;/strong&gt;.&lt;br&gt;
If the PostgreSQL process stops, the container also stops, which means we can no longer access pgBackRest inside that container.&lt;/p&gt;

&lt;p&gt;To work around this, a &lt;strong&gt;temporary container&lt;/strong&gt; was started using the same data volume. This allowed us to run the restore operation while PostgreSQL remained stopped.&lt;/p&gt;

&lt;p&gt;The restore process then involved the following steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            Restore base backup
                                   ↓
                         Replay archived WAL segments
                                   ↓
                      Database reaches consistent state
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside the temporary container, the restore command was executed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;demo restore &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;immediate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the restore completed, the original PostgreSQL container was started again.&lt;/p&gt;

&lt;p&gt;At first glance, it seemed like the restore had worked.&lt;br&gt;
However, the result was not exactly what we expected - something interesting was happening behind the scenes.&lt;/p&gt;

&lt;p&gt;We'll see why in a moment.&lt;/p&gt;


&lt;h1&gt;
  
  
  A PITR Behaviour Observed During Testing
&lt;/h1&gt;

&lt;p&gt;The following sequence occurred:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                             table created
                                   ↓
                              backup taken
                                   ↓
                             table deleted
                                   ↓
                           restore base backup
                                   ↓
                               WAL replay
                                   ↓
                           table still deleted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even after restoring the backup, the table remained deleted. This happened because PostgreSQL continued &lt;strong&gt;replaying WAL segments&lt;/strong&gt; during recovery. Since the delete operation was recorded in WAL, it was replayed again during restore.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; PostgreSQL recovery is "greedy" by default—it will try to replay every piece of WAL it can find in the archive to bring you to the most recent state possible. By using &lt;code&gt;--type=immediate&lt;/code&gt;, we tell pgBackRest to tell PostgreSQL: "Stop as soon as you reach a consistent state," effectively ignoring any changes (like our accidental DELETE) that happened after the backup was finalized.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To prevent this, the restore command used:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inside the temporary container&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;demo restore &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;immediate

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Understanding how WAL replay works is important when performing PITR.&lt;/p&gt;




&lt;h1&gt;
  
  
  Automating Backups with Cron
&lt;/h1&gt;

&lt;p&gt;Backups are usually automated using scheduled jobs.&lt;/p&gt;

&lt;p&gt;In this setup, a full backup is taken weekly, and differential backups are taken on the remaining days.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 0 pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;demo backup &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full
0 2 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1-6 pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;demo backup &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;diff
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration creates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one &lt;strong&gt;full backup per week&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;daily &lt;strong&gt;differential backups&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This strategy balances storage efficiency and recovery speed.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Throughout this series, we’ve seen how &lt;strong&gt;Write-Ahead Logging (WAL)&lt;/strong&gt; forms the foundation for several critical PostgreSQL features.&lt;/p&gt;

&lt;p&gt;The same WAL mechanism powers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;crash recovery&lt;/li&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;change data capture&lt;/li&gt;
&lt;li&gt;backups&lt;/li&gt;
&lt;li&gt;point-in-time recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding WAL makes it easier to understand how PostgreSQL maintains reliability and data safety in production environments.&lt;/p&gt;

&lt;p&gt;Tools like pgBackRest build on top of this mechanism to provide robust backup and restore workflows.&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>database</category>
      <category>devops</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How Debezium Uses PostgreSQL WAL for Change Data Capture</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Wed, 11 Mar 2026 15:10:41 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/how-debezium-uses-postgresql-wal-for-change-data-capture-3l19</link>
      <guid>https://forem.com/mohhddhassan/how-debezium-uses-postgresql-wal-for-change-data-capture-3l19</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;This article is part of a series on PostgreSQL WAL.

Part 1 — How PostgreSQL WAL Actually Works  
Part 2 — How Debezium Uses PostgreSQL WAL for CDC  
Part 3 — PostgreSQL Backups with pgBackRest and PITR
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;In the previous article, we've explored how PostgreSQL uses &lt;strong&gt;Write-Ahead Logging (WAL)&lt;/strong&gt; to ensure durability and recover from crashes.&lt;/p&gt;

&lt;p&gt;If you haven't read it yet, you can check it out here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/mohhddhassan/how-postgresql-wal-actually-works-and-why-everything-depends-on-it-2g81"&gt;How PostgreSQL WAL Actually Works (And Why Everything Depends on It)&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But WAL is not just an internal recovery mechanism.&lt;/p&gt;

&lt;p&gt;Because WAL records &lt;strong&gt;every change that happens inside the database&lt;/strong&gt;, it also enables systems to stream those changes in real time.&lt;/p&gt;

&lt;p&gt;This is exactly how &lt;strong&gt;Change Data Capture (CDC)&lt;/strong&gt; systems work.&lt;/p&gt;

&lt;p&gt;Instead of polling tables repeatedly, CDC tools read database changes directly from WAL and turn them into events.&lt;/p&gt;

&lt;p&gt;One of the most widely used CDC platforms for PostgreSQL is Debezium.&lt;/p&gt;

&lt;p&gt;In this article, we’ll explore how Debezium reads PostgreSQL WAL and converts database operations into a stream of change events.&lt;/p&gt;




&lt;h1&gt;
  
  
  What is Change Data Capture (CDC)
&lt;/h1&gt;

&lt;p&gt;Applications often need to react to changes happening inside a database.&lt;/p&gt;

&lt;p&gt;For example, imagine a simple update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;UPDATE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="k"&gt;SET&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Alice Cooper'&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That change might need to trigger other systems.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;updating a search index&lt;/li&gt;
&lt;li&gt;syncing data into a data warehouse&lt;/li&gt;
&lt;li&gt;updating analytics pipelines&lt;/li&gt;
&lt;li&gt;triggering downstream workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A traditional approach would be to &lt;strong&gt;periodically query the database&lt;/strong&gt; to detect changes.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;poll database every few seconds
compare results
detect differences
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But this approach has several problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;inefficient&lt;/li&gt;
&lt;li&gt;high database load&lt;/li&gt;
&lt;li&gt;delayed updates&lt;/li&gt;
&lt;li&gt;difficult to scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Change Data Capture solves this by streaming database changes &lt;strong&gt;as they happen&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why WAL Makes CDC Possible
&lt;/h1&gt;

&lt;p&gt;PostgreSQL writes every change into &lt;strong&gt;WAL&lt;/strong&gt; before modifying the actual table files.&lt;/p&gt;

&lt;p&gt;Because of this, WAL effectively contains a &lt;strong&gt;chronological record of all database operations&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of scanning tables, CDC systems can simply read this log.&lt;/p&gt;

&lt;p&gt;The idea looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            Application Query
                                    ↓
                                PostgreSQL
                                    ↓
                            WAL record created
                                    ↓
                            CDC tool reads WAL
                                    ↓
                             Event generated
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This makes CDC both efficient and reliable.&lt;/p&gt;

&lt;p&gt;The CDC system simply consumes changes in the order they occurred.&lt;/p&gt;

&lt;p&gt;Debezium works exactly this way.&lt;/p&gt;

&lt;p&gt;It does not read tables directly.&lt;br&gt;
Instead, it reads &lt;strong&gt;PostgreSQL WAL&lt;/strong&gt;.&lt;/p&gt;


&lt;h1&gt;
  
  
  Enabling Logical Decoding in PostgreSQL
&lt;/h1&gt;

&lt;p&gt;For CDC systems to read WAL, PostgreSQL must provide enough information about row-level changes.&lt;/p&gt;

&lt;p&gt;This is controlled by the &lt;code&gt;wal_level&lt;/code&gt; configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight conf"&gt;&lt;code&gt;&lt;span class="n"&gt;wal_level&lt;/span&gt; = &lt;span class="n"&gt;logical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL supports three WAL levels:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;wal_level&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;minimal&lt;/td&gt;
&lt;td&gt;crash recovery only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;replica&lt;/td&gt;
&lt;td&gt;physical replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;logical&lt;/td&gt;
&lt;td&gt;logical decoding (CDC tools)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When &lt;code&gt;wal_level&lt;/code&gt; is set to &lt;strong&gt;logical&lt;/strong&gt;, PostgreSQL records additional metadata needed for decoding row-level changes.&lt;/p&gt;

&lt;p&gt;This makes it possible for external systems to interpret WAL records and reconstruct database events.&lt;/p&gt;




&lt;h1&gt;
  
  
  Replication Slots
&lt;/h1&gt;

&lt;p&gt;When Debezium connects to PostgreSQL, it creates something called a &lt;strong&gt;replication slot&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A replication slot is managed by PostgreSQL and tracks the progress of a consumer reading WAL.&lt;/p&gt;

&lt;p&gt;Its job is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                            Debezium reads WAL
                                    ↓
                   Postgres tracks last consumed position
                                    ↓
                        WAL not deleted until consumed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents PostgreSQL from removing WAL segments that Debezium still needs.&lt;/p&gt;

&lt;p&gt;The position in WAL is identified using something called an &lt;strong&gt;LSN (Log Sequence Number)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;An LSN represents a specific position inside the WAL stream.&lt;/p&gt;

&lt;p&gt;Because the replication slot tracks the last processed LSN, Debezium can resume streaming from the correct position even if it restarts.&lt;/p&gt;




&lt;h1&gt;
  
  
  Logical Decoding Plugins
&lt;/h1&gt;

&lt;p&gt;WAL records are stored in a &lt;strong&gt;binary format&lt;/strong&gt; that is not directly readable by external systems.&lt;/p&gt;

&lt;p&gt;PostgreSQL solves this using &lt;strong&gt;logical decoding plugins&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;These plugins translate WAL records into a logical representation of database changes.&lt;/p&gt;

&lt;p&gt;Some commonly used plugins include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;pgoutput&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;wal2json&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;decoderbufs&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default plugin used by PostgreSQL is &lt;strong&gt;pgoutput&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It converts WAL records into logical change events that CDC tools like Debezium can consume.&lt;/p&gt;




&lt;h1&gt;
  
  
  How Debezium Streams Database Changes
&lt;/h1&gt;

&lt;p&gt;When everything is configured, the end-to-end flow looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          Application writes data
                                    ↓
                      PostgreSQL writes change to WAL
                                    ↓
                    Logical decoding plugin interprets WAL
                                    ↓
                Debezium reads changes using replication protocol
                                    ↓
                     Debezium publishes events to Kafka
                                    ↓
                    Downstream systems consume the events
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture allows applications to react to database changes in near real time.&lt;/p&gt;

&lt;p&gt;A simplified architecture might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                                PostgreSQL
                                    │
                                    ▼
                                 Debezium
                                    │
                                    ▼ 
                                  Kafka
                                    │
                                    ▼
                                Consumers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Consumers could include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;microservices&lt;/li&gt;
&lt;li&gt;analytics pipelines&lt;/li&gt;
&lt;li&gt;search indexing systems&lt;/li&gt;
&lt;li&gt;data warehouses&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Example Change Event
&lt;/h1&gt;

&lt;p&gt;When Debezium captures a change from WAL, it converts it into a structured event.&lt;/p&gt;

&lt;p&gt;A simplified example might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"op"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"c"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"table"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"users"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"before"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Alice"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;op&lt;/code&gt; field indicates the type of operation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;c&lt;/td&gt;
&lt;td&gt;create&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;u&lt;/td&gt;
&lt;td&gt;update&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;d&lt;/td&gt;
&lt;td&gt;delete&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These events can then be consumed by downstream systems to trigger further processing.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why WAL Is So Powerful
&lt;/h1&gt;

&lt;p&gt;What makes this architecture powerful is that &lt;strong&gt;multiple systems can reuse the same WAL stream&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The same WAL records that power CDC can also be used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;backups&lt;/li&gt;
&lt;li&gt;point-in-time recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, WAL acts as a &lt;strong&gt;universal source of truth for database changes&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Change Data Capture allows systems to react to database changes without constantly querying tables.&lt;/p&gt;

&lt;p&gt;PostgreSQL makes this possible because every change is recorded in WAL.&lt;/p&gt;

&lt;p&gt;Tools like Debezium simply read that stream of changes and convert them into events that other systems can consume.&lt;/p&gt;

&lt;p&gt;This makes it possible to build event-driven architectures directly on top of a relational database.&lt;/p&gt;




&lt;h1&gt;
  
  
  Next Article in This Series
&lt;/h1&gt;

&lt;p&gt;In the next article, we’ll explore how PostgreSQL WAL is used for &lt;strong&gt;database backups and point-in-time recovery&lt;/strong&gt; using tools like pgBackRest.&lt;/p&gt;

&lt;p&gt;We’ll walk through how WAL archiving enables restoring a PostgreSQL database to a specific moment in time.&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>backend</category>
      <category>database</category>
      <category>devops</category>
    </item>
    <item>
      <title>How PostgreSQL WAL Actually Works (And Why Everything Depends On It)</title>
      <dc:creator>Mohamed Hussain S</dc:creator>
      <pubDate>Mon, 09 Mar 2026 15:01:18 +0000</pubDate>
      <link>https://forem.com/mohhddhassan/how-postgresql-wal-actually-works-and-why-everything-depends-on-it-2g81</link>
      <guid>https://forem.com/mohhddhassan/how-postgresql-wal-actually-works-and-why-everything-depends-on-it-2g81</guid>
      <description>&lt;p&gt;Every change in PostgreSQL is written twice.&lt;/p&gt;

&lt;p&gt;First to &lt;strong&gt;WAL&lt;/strong&gt;.&lt;br&gt;
Then later to the &lt;strong&gt;actual table files&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This mechanism is called &lt;strong&gt;Write-Ahead Logging (WAL)&lt;/strong&gt; and it’s one of the most important parts of PostgreSQL’s architecture.&lt;/p&gt;

&lt;p&gt;Without WAL, PostgreSQL would not be able to safely recover from crashes, replicate data, or support tools like CDC pipelines and backup systems.&lt;/p&gt;

&lt;p&gt;In this Post we’ll break down:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how PostgreSQL stores data internally&lt;/li&gt;
&lt;li&gt;what WAL is and why it exists&lt;/li&gt;
&lt;li&gt;what WAL files actually contain&lt;/li&gt;
&lt;li&gt;how PostgreSQL recovers from crashes&lt;/li&gt;
&lt;li&gt;why replication, CDC, and backups depend on WAL&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  How Data Is Stored in PostgreSQL (Quick Glance)
&lt;/h1&gt;

&lt;p&gt;Before understanding WAL, it helps to know how PostgreSQL actually stores data on disk.&lt;/p&gt;

&lt;p&gt;When you create a table in PostgreSQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="nb"&gt;SERIAL&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL creates a &lt;strong&gt;table file on disk&lt;/strong&gt; inside its data directory.&lt;/p&gt;

&lt;p&gt;But rows are not stored randomly inside that file.&lt;/p&gt;

&lt;p&gt;Instead, PostgreSQL organizes table storage into &lt;strong&gt;fixed-size pages&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                               Table
                                 │
                                 ▼
                          Pages (8 KB each)
                                 │
                                 ▼
                       Rows stored inside pages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each page has a default size of &lt;strong&gt;8 KB&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Inside a page PostgreSQL stores multiple rows (also called &lt;strong&gt;tuples&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;Example simplified view:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Page (8 KB)

+---------------------+
| Row: (1, Alice)     |
| Row: (2, Bob)       |
| Row: (3, Charlie)   |
| Row: (4, David)     |
+---------------------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When new data is inserted, PostgreSQL finds a page with enough free space and places the row there.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why PostgreSQL Writes to WAL Before Writing Data
&lt;/h1&gt;

&lt;p&gt;PostgreSQL does not directly write changes to the table files.&lt;/p&gt;

&lt;p&gt;Instead it first records the change in &lt;strong&gt;WAL (Write-Ahead Log)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The rule PostgreSQL follows is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                        Write change to WAL
                                 ↓
                         Flush WAL to disk
                                 ↓
                 Apply change to table pages later
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This guarantees that if the database crashes, PostgreSQL still knows &lt;strong&gt;what changes were supposed to happen&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;WAL acts as a durable record of database operations.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Problem Databases Need to Solve
&lt;/h1&gt;

&lt;p&gt;Consider a simple query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;INSERT&lt;/span&gt; &lt;span class="k"&gt;INTO&lt;/span&gt; &lt;span class="n"&gt;users&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Alice'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For PostgreSQL to execute this, it must:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;modify a page inside the table file&lt;/li&gt;
&lt;li&gt;write the updated page to disk&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But what happens if the database crashes &lt;strong&gt;mid-write&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                         write page to disk
                                 ↓
                           power failure
                                 ↓
                       page partially written
                                 ↓
                     database becomes corrupted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Databases must guarantee &lt;strong&gt;durability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Once a transaction commits, the data should survive crashes.&lt;/p&gt;

&lt;p&gt;This is exactly the problem &lt;strong&gt;Write-Ahead Logging solves&lt;/strong&gt;.&lt;/p&gt;




&lt;h1&gt;
  
  
  What WAL Actually Contains
&lt;/h1&gt;

&lt;p&gt;A common misconception is that WAL stores full rows.&lt;/p&gt;

&lt;p&gt;It doesn’t.&lt;/p&gt;

&lt;p&gt;Instead, WAL stores &lt;strong&gt;instructions describing what changed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A simplified idea of a WAL record might look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;page&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;123&lt;/span&gt;
&lt;span class="na"&gt;offset&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;45&lt;/span&gt;
&lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;insert&lt;/span&gt;
&lt;span class="na"&gt;tuple&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;('Alice')&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In other words:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;WAL records describe &lt;em&gt;how to reproduce the change&lt;/em&gt;, not the entire dataset.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This makes WAL much smaller and faster to write than copying full data pages.&lt;/p&gt;




&lt;h1&gt;
  
  
  Where WAL Is Stored
&lt;/h1&gt;

&lt;p&gt;All WAL files are stored in the PostgreSQL directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pg_wal/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Inside this directory you’ll see files like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;000000010000000000000001
000000010000000000000002
000000010000000000000003
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These files are called &lt;strong&gt;WAL segments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Each segment has a fixed size.&lt;/p&gt;

&lt;p&gt;Default size:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;16 MB
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a segment fills up, PostgreSQL automatically creates a new one.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                           segment fills
                                 ↓
                       new WAL segment created
                                 ↓
                         logging continues
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This process happens continuously while the database is running.&lt;/p&gt;




&lt;h1&gt;
  
  
  When Table Pages Are Actually Written
&lt;/h1&gt;

&lt;p&gt;Once WAL is safely flushed to disk, PostgreSQL can update the actual table pages.&lt;/p&gt;

&lt;p&gt;These writes happen through background processes such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;background writer&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;checkpoints&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows PostgreSQL to batch writes efficiently and reduce disk I/O.&lt;/p&gt;

&lt;p&gt;Even if the database crashes before the page write finishes, WAL already contains enough information to recover.&lt;/p&gt;




&lt;h1&gt;
  
  
  Crash Recovery
&lt;/h1&gt;

&lt;p&gt;When PostgreSQL starts after a crash, it performs &lt;strong&gt;WAL replay&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The process looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                         database crashes
                                 ↓
                       postgres starts again
                                 ↓
                          read WAL records
                                 ↓
                           replay changes
                                 ↓
                     database becomes consistent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that committed transactions are not lost.&lt;/p&gt;

&lt;p&gt;WAL acts as the &lt;strong&gt;source of truth&lt;/strong&gt; for rebuilding the database state.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why WAL Is More Than Crash Recovery
&lt;/h1&gt;

&lt;p&gt;Although WAL was originally designed for crash recovery, it powers several important PostgreSQL capabilities today.&lt;/p&gt;




&lt;h2&gt;
  
  
  Replication
&lt;/h2&gt;

&lt;p&gt;Standby servers replicate data by &lt;strong&gt;replaying WAL records&lt;/strong&gt; from the primary server.&lt;/p&gt;

&lt;p&gt;This allows replicas to stay synchronized with the primary database.&lt;/p&gt;




&lt;h2&gt;
  
  
  Change Data Capture (CDC)
&lt;/h2&gt;

&lt;p&gt;Tools like Debezium read WAL using logical decoding.&lt;/p&gt;

&lt;p&gt;Instead of scanning database tables, they stream changes directly from WAL.&lt;/p&gt;

&lt;p&gt;This enables architectures like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                              Postgres
                                 │
                                 ▼
                              Debezium
                                 │
                                 ▼
                               Kafka
                                 │
                                 ▼
                             Consumers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Applications can react to database changes in near real time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Backups and Point-in-Time Recovery
&lt;/h2&gt;

&lt;p&gt;Backup tools such as pgBackRest continuously archive WAL segments.&lt;/p&gt;

&lt;p&gt;This enables &lt;strong&gt;Point-in-Time Recovery (PITR)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                          base backup taken
                                 ↓
                      WAL archived continuously
                                 ↓
                          restore database
                                 ↓
                   replay WAL until target timestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without WAL archiving, restoring a database to an exact moment in time would not be possible.&lt;/p&gt;




&lt;h1&gt;
  
  
  A Simple Mental Model
&lt;/h1&gt;

&lt;p&gt;One way to think about PostgreSQL is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tables store current state
WAL stores history of changes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Whenever PostgreSQL needs to recover, replicate, or stream changes, it uses that history.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Write-Ahead Logging is one of the core mechanisms that makes PostgreSQL reliable.&lt;/p&gt;

&lt;p&gt;It enables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;crash recovery&lt;/li&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;change data capture&lt;/li&gt;
&lt;li&gt;backup systems&lt;/li&gt;
&lt;li&gt;point-in-time recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In many ways, WAL is the &lt;strong&gt;foundation that many PostgreSQL features are built on&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Understanding WAL makes it much easier to understand how the rest of the PostgreSQL ecosystem works.&lt;/p&gt;




&lt;h1&gt;
  
  
  Next Article in This Series
&lt;/h1&gt;

&lt;p&gt;In the next article, we’ll look at how tools like Debezium read PostgreSQL WAL to stream database changes in real time using logical decoding.&lt;/p&gt;

&lt;p&gt;Later in the series, we’ll also explore how WAL is used for &lt;strong&gt;database backups and point-in-time recovery&lt;/strong&gt; with tools like pgBackRest.&lt;/p&gt;




</description>
      <category>postgres</category>
      <category>database</category>
      <category>backend</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
