Forem: Sasi Kumar T

Tracing the Express Middleware Nobody Talks About: Compression

Sasi Kumar T — Thu, 05 Mar 2026 15:19:12 +0000

When we talk about observability in Node.js, we usually trace things like:

HTTP requests
Database queries
External API calls

But there’s a hidden performance layer most tracing setups completely ignore.

👉 Response compression

Many Express applications use the popular compression middleware to reduce response size and improve latency.

But here’s the problem.
Nobody traces it.

Why Compression Deserves Tracing

Compression happens after your application logic finishes but before the response is sent to the client.

That means:

HTTP Request
     │
     ▼
Express Route
     │
     ▼
Application Logic
     │
     ▼
Compression (zlib)
     │
     ▼
Response Sent

If compression is slow, you may see:

slower response times
increased CPU usage
delayed response delivery

But in most observability setups, this time is invisible.

Your traces might show something like this:

HTTP Request
  └── Route Handler
        └── DB Query

But you never see compression time.

That’s exactly the gap I wanted to close.

Introducing Compression Instrumentation

I built a small OpenTelemetry instrumentation package to trace compression operations.

👉 https://www.npmjs.com/package/@sasikumart/compression-instrumentation

This package instruments Node.js zlib compression calls and exposes them as OpenTelemetry spans.

It allows you to observe:

compression duration
compression algorithm used
impact of compression on request latency

What the Trace Looks Like

Without compression instrumentation:

HTTP GET /users
  └── controller.getUsers
        └── mongodb query

With instrumentation:

HTTP GET /users
  └── controller.getUsers
        └── mongodb query
  └── gzip compression

Now compression becomes a first-class span in your trace.

Installing the Instrumentation

npm install @sasikumart/compression-instrumentation

Basic Setup

import { NodeSDK } from '@opentelemetry/sdk-node';
import { ZlibInstrumentation } from '@sasikumart/compression-instrumentation';

const sdk = new NodeSDK({
  instrumentations: [
    new ZlibInstrumentation()
  ]
});

sdk.start();

Once enabled, any gzip compression performed via Node.js zlib will automatically generate spans.

Example Express Application

Now let’s use Express with the compression middleware.

import express from "express";
import compression from "compression";

const app = express();

app.use(compression());

app.get("/data", (req, res) => {
  const payload = Array(1000).fill("OpenTelemetry is awesome");
  res.json(payload);
});

app.listen(3000, () => {
  console.log("Server running on port 3000");
});

When a request is served, compression will trigger a zlib gzip operation, which will now appear in your traces.

Jaeger Trace Example

Here’s an example trace captured in Jaeger where the gzip compression span appears alongside the HTTP request span.

The trace typically shows:

HTTP request span
route handler span
gzip compression span generated by the instrumentation

What Gets Captured

Each compression operation becomes its own span, allowing you to observe:

how long compression takes
how much latency compression adds to the request
whether compression becomes a CPU bottleneck

Current Limitation

At the moment, the instrumentation only captures gzip compression.

This is because most Express applications using the compression middleware default to gzip via Node's zlib module.

Future versions may add support for additional algorithms such as:

deflate
brotli

Why This Matters

Compression is one of those performance optimizations everyone enables but rarely measures.

Observability should show everything that affects latency, including middleware and runtime operations.

Tracing compression helps you:

identify CPU bottlenecks
understand hidden latency in responses
gain deeper insight into the full request lifecycle

Real Use Cases

High traffic APIs: Compression CPU overhead can become significant under heavy load.
Large JSON responses: You can verify whether compression is actually helping reduce payload sizes.
Performance tuning: You can determine whether compression should happen at the application layer or a reverse proxy.

Final Thoughts

Observability often focuses on business logic, but real latency comes from every layer of the request lifecycle.

Compression is one of those layers that has remained invisible for too long.

Now it doesn’t have to be.

If you're interested in Node.js observability and OpenTelemetry, you might also like my previous deep dive:

👉 https://dev.to/sasikumart/tracing-the-part-of-mongoose-nobody-talks-about-3gj2

About Me

I’m a backend developer with extensive experience in designing and optimizing scalable backend systems. My expertise includes tackling complex performance challenges. I’ve led numerous database performance initiatives and have also been deeply involved in system design and revamping existing systems. My focus is on enhancing efficiency, ensuring reliability, and delivering robust solutions that scale effectively.

Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.

Tracing the part of Mongoose nobody talks about

Sasi Kumar T — Fri, 27 Feb 2026 10:11:38 +0000

I have published an OpenTelemetry plugin that instruments the Mongoose hydration lifecycle.
@sasikumart/mongoose-hydrate-instrumentation

What's the problem?

If you're already using OTEL with Mongoose, you're probably tracing your queries — find, insertOne, updateMany, and so on. That covers a lot, but it misses something that happens after the query returns.

Document hydration.

When Mongoose gets raw BSON/JSON back from MongoDB, it doesn't just hand it to you as-is. It runs it through a whole initialization process, applying defaults, attaching virtuals, setting up getters/setters, populating nested subdocuments. For simple schemas with small result sets this is negligible. But for complex schemas or queries returning hundreds of documents, hydration can be a real cost that's completely invisible in your traces.

That blind spot is what this plugin fixes.

How it works

It patches Mongoose's internal hydration lifecycle and wraps it with OTEL spans, so you get visibility into:

mongoose.document.init — which is invoked by mongoose.hydrate to initialize a new Document from raw data

Each span includes attributes like the model name and document count.

Getting started

npm install @sasikumart/mongoose-hydrate-instrumentation

import { NodeSDK } from '@opentelemetry/sdk-node';
import { MongooseHydrateInstrumentation } from '@sasikumart/mongoose-hydrate-instrumentation';

const sdk = new NodeSDK({
  instrumentations: [new MongooseHydrateInstrumentation()],
});

sdk.start();

That's it. Your existing OTEL backend (Jaeger, Datadog, Honeycomb, OTLP — whatever you use) will start receiving the hydration spans automatically.

The package is on npm and the source is on GitHub. If you run into any issues or have ideas for improvement, PRs and issues are very welcome.

About Me

Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.

Postgres “almost” Outage Postmortem: The Hidden Dangers of Replication Slots and Autovacuum

Sasi Kumar T — Sat, 01 Mar 2025 17:20:16 +0000

One day, I noticed something odd with our Postgres instance. It was experiencing intermittent performance spikes, load averages were hitting unusually high levels, and I/O utilization occasionally maxed out at 100%. Most workloads were running smoothly, but these spikes felt off. So, I put on my debugging hat and got to work.

Investigating the Spikes

While analyzing the running queries, I discovered that Autovacuum was actively running on the tasks database. This database is used for scheduling pre-scheduled tasks from one of our services, which meant it had a high volume of write activity. Seeing Autovacuum run wasn’t surprising, so I suspected it was the only contributor to the performance issue.

To mitigate the problem, we decided to move the tasks database to a new VM, so that the other systems are not impacted.

Once the migration was complete, the results were immediate and significant
Below are the screenshots for the same

The graphs showed a significant improvement, I/O utilization no longer spiked beyond 40%, and the overall average had dropped below 15%. A big win, right? Well, yes and no. While the system was in much better shape, I still wasn’t fully satisfied.

I also noticed occasional Autovacuum processes running during business hours. While they weren’t as aggressive as they had been for the tasks database, they still contributed to some system load. This got me thinking, could I optimize things even further?

With the immediate issue resolved, I swapped my debugging hat for my analyzing glasses and dug deeper.

Optimizing Storage with Manual Vacuum

Since the tasks database had previously triggered frequent Autovacuum processes, I decided to run a manual Autovacuum on its tables. With the actual database now running on a different machine and the current one no longer in use, this seemed like a safe and logical step

Before running the vacuum, I took note of the storage usage across all databases on the VM:

database_name	size
tasks	42 GB
wickets	12 GB
leash	6833 MB
toolbox	5857 MB
bases	1126 MB

Which totals upto, 67.816 GB.

Vacuuming tasks

Here’s a before-and-after comparison of the vacuum operation on the tasks database:

Before

relname	n_live_tup	n_dead_tup	total_size	table_size	indexes_size
tasks_p1	1,026,187	176,574	13 GB	4956 MB	8211 MB
tasks_p2	1,023,375	57,710	13 GB	4666 MB	7748 MB
tasks_p3	1,020,129	12369	12 GB	4562 MB	7782 MB

After

relname	n_live_tup	total_size	table_size	indexes_size
tasks_p1	1,026,187	1298 MB	1138 MB	140 MB
tasks_p2	1,023,375	1296 MB	1137 MB	140 MB
tasks_p3	1,020,129	1297 MB	1136 MB	140 MB

The Shocking Discovery

Manual vacuuming cut our storage usage by 32 GB, reducing tasks from 36 GB to just 4 GB, an 89% reduction. Seeing such a drastic change made me wonder: What if I applied the same process to the other databases on this Postgres instance?

So, I did exactly that. The result is that the storage usage dropped from ~86% to 76%

That’s when I stumbled upon something strange. While all the databases combined used around 70 GB, but the metrics shows that the cluster was consuming nearly 1.06 TB of storage.

Something wasn’t adding up. Where was all that extra space being used?

So, I started digging into PostgreSQL’s data directory and quickly discovered a directory filled with WAL (Write-Ahead Log) files consuming the majority of the disk space. This didn’t make sense at all, our cluster wasn’t configured to retain WAL files indefinitely.

Write-Ahead Logging (WAL) is a critical feature in PostgreSQL that ensures data integrity and durability. It records changes to the database before they are applied, allowing PostgreSQL to recover data and restore the database to its most recent state in case of a crash or failure.

Curious about why so many WAL files were piling up, I turned to Google for possible explanations. One potential culprit stood out: inactive replication slots.

A replication slot is a feature that ensures WAL (Write-Ahead Log) files are retained until a replica (standby server) or a logical consumer has processed them. It prevents WAL files from being removed before they are received by the subscriber, ensuring smooth replication.

To confirm this, I ran the following query to check for active replication slots:

SELECT * FROM pg_replication_slots;

Ideally, there should have been zero active slots, but instead, I found two logical replication slots, remnants of an old experiment that were never removed. Since these slots were inactive, the cluster was holding onto WAL files indefinitely, waiting for the corresponding consumers to process them. But those consumers were never coming back online.

This was the root cause of the massive storage bloat.

Next, I took the logical step of removing the inactive replication slots. As soon as I did, the disk usage dropped to just 40 GB, freeing up a massive amount of space instantly.

I had achieved something great, a massive reduction in storage. But that wasn’t my primary goal. My real objective was to reduce system load.

Strangely, after removing the replication slots, Autovacuum didn’t run as frequently as before. While the outcome was exactly what I wanted, the "why" behind it kept bothering me. I needed to understand how this change had affected Autovacuum’s behavior.

The following are the CPU and the load average, for the week

After removing the replication slots, CPU usage dropped from ~80% to <10% during business hours and from ~35% to < 1.5% during off-hours.

How and Why?

It was only later that I realized why Autovacuum had been running so aggressively before, it wasn't actually cleaning up dead tuples.

To break it down, let’s go through the 5 Whys:

1. Why wasn’t Autovacuum removing dead tuples, even though it was running frequently?

The problem was caused by inactive replication slots. Replication had been initially set up on the toolbox database, which prevented Autovacuum from cleaning up dead tuples across the entire cluster.

2. How do replication slots impact Autovacuum?

Replication slots track a value called catalog_xmin, which represents the oldest transaction ID affecting system catalogs that must be retained.

When a replication slot is inactive, PostgreSQL holds onto WAL (Write-Ahead Log) entries for all transactions after the catalog_xmin. Because of this, Autovacuum cannot remove dead tuples, and they keep accumulating until the replication slot catches up or is removed.

3. How does replication slot present in one database block Autovacuum on another?

catalog_xmin exists at both the cluster level and individual database level. Since it applies to the entire cluster, an inactive replication slot on one database can prevent Autovacuum from removing dead tuples in all databases.

This meant that Autovacuum was running but skipping cleanup, as the replication slot still "needed" those old tuples.

4. If Autovacuum couldn't remove dead tuples, how did a manual vacuum reclaim space?

A FULL VACUUM (manual vacuum) works differently:

It creates a new physical table file and copies only live tuples into it.
The old table (containing dead tuples) is completely discarded, freeing up space.
Unlike Autovacuum, FULL VACUUM ignores catalog_xmin, so it isn’t blocked by replication slots.

This explains why manual vacuuming instantly freed up disk space, while Autovacuum had been ineffective.

5. If FULL VACUUM is so effective, why isn't it recommended?

FULL VACUUM is a blocking operation, it locks the table and can cause performance issues on production systems. This is why it's not a standard solution for routine maintenance.

The Key Takeaway

Inactive replication slots had been blocking Autovacuum from doing its job, causing a buildup of dead tuples and high system load. Once the replication slots were removed, Autovacuum could work properly again, leading to drastically lower CPU usage and a much healthier database.

About Me

Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.

Boosting PostgreSQL Performance: Optimising Queries with the != Operator

Sasi Kumar T — Wed, 25 Sep 2024 15:27:20 +0000

Introduction:

PostgreSQL is known for its flexibility and efficiency in handling a variety of queries. However, certain query patterns, such as those involving the != operator, can lead to performance challenges.

In this post, we'll discuss how to optimize queries involving != and demonstrate the performance improvements achieved by creating a partial index.

The Performance Challenge with `!=`

Using the != operator can make it difficult for PostgreSQL to utilize indexes efficiently. This often results in slower query performance because PostgreSQL may need to scan a larger portion of the table or perform more complex operations.

Consider the query:

SELECT count(*) 
FROM "ticket" 
WHERE "assignedTo" = 'sasikumar' AND
      "businessId" = 'B00154' AND
      "statusId" != 2;

While this query seems simple, it presents a challenge to PostgreSQL’s query planner. The != operator cannot efficiently use traditional B-tree indexes because it involves excluding a value, rather than narrowing down to a specific match. When indexes can’t be used effectively, PostgreSQL may have to resort to a full table scan, which can severely degrade performance, especially on large datasets.

The execution plan of the above query is as follows

Aggregate (cost=34899.83..34899.84 rows=1 width=8) (actual time=21912.045..21912.047 rows=1 loops=1)
-> Index Only Scan using ticket_businessId_assignedto_statusid_idx on ticket (cost=0.56..34899.13 rows=282 width=0) (actual time=21912.039..21912.040 rows=0 loops=1)
Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Filter: ("statusId" <> 2)
Rows Removed by Filter: 25617
Heap Fetches: 23630

Planning Time: 0.394 ms
Execution Time: 230.119 ms

If you are not familiar with what’s happening dont worry, read below

Execution Plan Breakdown:

1. Aggregate

Aggregate (cost=34899.83..34899.84 rows=1 width=8) (actual time=21912.045..21912.047 rows=1 loops=1)

Aggregate: This represents the final step where the aggregation (counting rows in this case) takes place.
Cost: The estimated cost to execute is 34899.83 to 34899.84. The cost is an internal PostgreSQL unit representing the estimated computational effort.
Rows: The planner estimates that 1 row will be output after aggregation.
Width: The width of the output row is 8 bytes.
Actual Time: The actual time taken is 21912.045 ms to 21912.047 ms, indicating the duration to perform the aggregation.
Loops: This is executed once, as it is the final aggregation step.

2. Index Only Scan

-> Index Only Scan using ticket_businessId_assignedto_statusid_idx on ticket (cost=0.56..34899.13 rows=282 width=0) (actual time=21912.039..21912.040 rows=0 loops=1)

Index Only Scan: This indicates that an index scan is used to access the data. Since the scan is “only” on the index, it means no heap (table) access is needed initially.
Index Name: ticket_businessId_assignedto_statusid_idx is the index being used.
Cost: The cost to perform the index scan ranges from 0.56 to 34899.13. This range estimates the computational effort required to access the index.
Rows: The planner estimates that 282 rows will match the index condition.
Actual Time: The actual time taken for the index scan is 21912.039 ms to 21912.040 ms.
Rows: Despite the estimate, no rows matched the index condition in reality (rows=0).

3. Index Conditions and Filters

Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Filter: ("statusId" <> 2)
Rows Removed by Filter: 25617
Heap Fetches: 23630

Index Cond: The index condition is that both businessId and assignedTo match the given values. This is used to narrow down the potential rows from the index.
Filter: After applying the index condition, the query filters out rows where statusId is not equal to 2.
Rows Removed by Filter: 25,617 rows were removed because they did not meet the filter condition statusId <> 2.
Heap Fetches: Even though the scan is index-only, PostgreSQL sometimes needs to fetch additional data from the heap to validate rows or retrieve data not included in the index. Here, 23,630 heap fetches occurred.

Key Points

Index Usage: The query uses an index-only scan to access rows based on businessId and assignedTo. This is efficient as it avoids accessing the table's heap initially.
Filtering: The filter (statusId <> 2) is applied after retrieving rows from the index. Since the filter was not selective enough, a large number of rows (25,617) were filtered out.
Heap Fetches: The number of heap fetches (23,630) suggests that many rows needed additional data from the heap, despite the use of an index-only scan.

One thing I want you the reader to know here is that, the statusId field has cardinality of 5, having values ranging from 0 to 4

Now, based on this information we do the following

Approach 1: Query Rewrite

    SELECT count(*)
    FROM "ticket" 
    WHERE "assignedTo" = 'sasikumar' AND
           "businessId" = 'B00154' AND
           ("statusId" > 2 OR "statusId" < 2)

The filter is modified to to get statusId’s > 2 and statusId’s < 2 which is the mathematical equivalent of statusId != 2 .

The execution plan of the above query is as follows

Aggregate (cost=1142.07..1142.08 rows=1 width=8) (actual time=691.201..691.204 rows=1 loops=1)
-> Bitmap Heap Scan on ticket (cost=28.77..1141.37 rows=281 width=0) (actual time=691.195..691.197 rows=0 loops=1)
Recheck Cond: ((("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text) AND ("statusId" > 2)) OR (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text) AND ("statusId" < 2)))
Heap Blocks: exact=574
-> BitmapOr (cost=28.77..28.77 rows=282 width=0) (actual time=23.966..23.968 rows=0 loops=1)
-> Bitmap Index Scan on ticket_businessId_assignedto_statusid_idx (cost=0.00..14.37 rows=145 width=0) (actual time=23.864..23.864 rows=628 loops=1)

Planning Time: 1.249 ms
Execution Time: 4.881 ms

Summary of Execution

Bitmap Index Scan: The query starts with scanning the index ticket_businessId_assignedto_statusid_idx to find rows where the businessId and assignedTo match, and then the statusId condition is applied. This index scan is efficient in finding rows but does not directly give the full row data.
BitmapOr: This combines the results from multiple bitmap index scans. Here, it combines results where statusId is either greater than 2 or less than 2.
Bitmap Heap Scan: Using the results from the BitmapOr, it reads the actual rows from the heap and applies the conditions again. This step verifies the rows from the index scan and ensures they meet the statusId conditions.
Aggregate: Finally, it aggregates the results from the heap scan, counting the number of rows that satisfy the conditions.

Congrats! You might have noticed that the first query returned 1 row, while the second one returned 0 rows. That's not a mistake—it's just because these queries were run on a live database, so the results are not the same.

Approach 2: Using Partial Index

A partial index is an index that only includes rows meeting a specific condition defined in the WHERE clause of the CREATE INDEX statement. This allows you to create an index that is smaller and more efficient by excluding rows that don't match the condition.

Characteristics of Partial Indexes:

Selective Indexing:
- A partial index is created with a condition, which means it only indexes the rows that satisfy this condition. For example, you might only index rows where a column value is greater than a certain threshold.
Reduced Index Size:
- Because it only indexes a subset of the table, a partial index is usually smaller and consumes less disk space than a full index. This can make index scans faster and reduce memory usage.
Improved Performance for Specific Queries:
- Partial indexes are beneficial for queries that frequently filter on specific conditions. They can speed up these queries by avoiding the overhead of indexing all rows in the table.

In our case, we need to create a index with fields (businessId, assignedTo) where statusId is ≠ 2. The index creation command is as follows

CREATE INDEX CONCURRENTLY IF NOT EXISTS
ticket_businessId_assignedto_statusid_ne_2_idx
    ON ticket ("businessId", "assignedTo")
    WHERE ticket."statusId" != 2;

SELECT count(*) FROM "ticket" 
    WHERE "assignedTo" = 'sasikumar' 
    AND "businessId" = 'B00154'
    AND "statusId"!=2;

Aggregate (cost=599.68..599.69 rows=1 width=8) (actual time=0.022..0.023 rows=1 loops=1)
-> Index Only Scan using ticket_businessId_assignedto_statusid_ne_2_idx on ticket (cost=0.42..598.97 rows=283 width=0) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text))
Heap Fetches: 0
Planning Time: 0.198 ms
Execution Time: 0.052 ms

Execution Plan Breakdown:

Aggregate:
- Actual Time: 0.022..0.023 ms — The actual time taken for the aggregate operation to complete is between 0.022 and 0.023 milliseconds.
- Rows: 1 — The aggregate operation returned 1 row.
- Loops: 1 — This aggregate operation ran once.
Index Only Scan:
- Actual Time: 0.019..0.019 ms — The actual time taken for the index scan is 0.019 milliseconds.
- Rows: 283 — The planner estimates that 283 rows would be returned by the index scan.
- Width: 0 — The width of the rows (data returned) is 0, indicating that only the index entries are being scanned, not the table rows.
- Index Cond: ("businessId" = 'B00154'::text) AND ("assignedTo" = 'sasikumar'::text) — The condition used to filter rows in the index scan.
The index-only scan uses the ticket_businessId_assignedto_statusid_ne_2_idx index to locate rows where businessId and assignedTo match the specified values. It retrieves the index entries without needing to access the actual table rows because all necessary data is available in the index.
Heap Fetches:
- Heap Fetches: 0 — Indicates that no heap fetches were needed. Since the index contains all required data, PostgreSQL did not need to access the actual table rows (the heap) for additional information.
Planning Time:
- Planning Time: 0.198 ms — The time taken by PostgreSQL to plan the query execution, including generating the execution plan.
Execution Time:
- Execution Time: 0.052 ms — The total time taken to execute the query and retrieve the result.

Originally, the query used a full index ticket_businessId_assignedto_statusid_idx, which was 727 MB. Now, it's using a partial index that's only 3.17 MB.
In database terms, this means faster index scans, reduced memory usage when loading the index, and improved query performance.

Conclusion:

In the world of PostgreSQL, optimizing queries can sometimes feel like navigating a maze, especially when dealing with operators like !=. As we explored, such queries can often lead to performance hurdles due to inefficient index utilization and the need for table scans.

Key Takeaways:

Challenges with !=: Queries that use the != operator can be less efficient because they force PostgreSQL to perform broader scans and potentially filter out a significant number of rows. This was evident in our initial example, where the query needed to filter out many rows, leading to a high number of heap fetches and considerable execution time.
Query Rewrite Efficiency: By reformulating the query to use statusId > 2 OR statusId < 2, we leveraged PostgreSQL's ability to optimize these conditions more effectively. This approach helped reduce the execution time significantly, demonstrating that a slight change in query logic can yield substantial performance improvements.
The Power of Partial Indexes: The use of partial indexes can be a game-changer. By creating an index specifically for rows where statusId != 2, we reduced the index size and optimized the scan process. This specialized index not only sped up the query but also eliminated the need for additional heap fetches, showcasing how targeted indexing can enhance performance.

About Me

Feel free to connect with me on LinkedIn to learn more about my professional journey and projects.

OverComing Weak Consistency using Cache

Sasi Kumar T — Fri, 25 Nov 2022 10:42:54 +0000

Recently I had faced off with weak consistency for the first time in my professional career(< 2 yrs). I didn’t know that the DB which we were using in our PROD env was weakly consistent when reads are done using secondary replica. (how silly of me 😓)

Weak/Eventual Consistency is a guarantee that when an update is made in a distributed database, that update will eventually be reflected in all nodes that store the data, resulting in the same response every time the data is queried.

What was I doing?

Consider two tables

Users table, having the below columns

id name age
Relationship table, which has the below columns

contactId id

id is the primary key of the users table

Usually the relationship is created whenever some info is to be persisted for a particular user in the users table. When there is no relationship existing, a new DB row is created. And a new entry is saved in the relationship table mapping to the new user entry

When, new details are to be saved, the users id is fetched from the relationship table using the contactId and the details are persisted.

What was problem?

Consider this scenario, when someone lands on a webpage, we will have some details like ip, timezone, email which can be auto collected.

When trying to save the first field, email we check the Relationship table, if there is no data present, a new row is created in the users table and add a new relationship entry is also created.

When the next update of ip is to be done we check the Relationship table, and get the id and do the updates.

function getRelationship(contactId) {}
function createRelationship(contactId, user) {}
function createUser() {}
function updateUser(relationship, dataToUpdate) {}

function forEachUpdate(contactId, dataToUpdate) {
    relationship = getRelationship(contactId);
    if(!relationship) {
        newUser = createUser()
        relationship = createRelationship(contactId, user);
    }
    updateUser(relationship, dataToUpdate);
}

// main
forEachUpdate('sasi@hotmail.com', {email: 'sasi@hotmail.com'})
forEachUpdate('sasi@hotmail.com', {ip: '127.0.0.1'})

The problem happens when we don’t find a relationship when checking after a few milliseconds of delay. But how? A new entry was added previously? What happened was, The writes are done in a primary replica, the reads goes to secondary replica. Which meant that the reads and writes were going to different servers which will not have the exact same data at any given time. This is called replication lag. Because of this duplicate user entries were created in the users table.

What is the intermediate fix?

Pointing the read queries to primary solved the issue. But, there is a reason why we have read replicas so that the primary will not be overloaded with all the read queries and respond quickly for writes. So this was not a permanent fix. For a permanent fix I had to switch to a different database, for which I didn’t have the time.

Solution:

Every time, I have to update some data it’s done sequentially. So, whenever a relationship is created, a copy of it is stored in Redis(an in-memory cache) too.

The sudo-code will look something like the below.

function saveInRedis(contactId, user) {}
function saveInDB(contactId, user) {}
function createRelationship(contactId, user) {
    saveInDB(contactId, user);
    saveInRedis(contactId, user);
}

Therefore, before every update Redis is checked first to see if there is any relationship, if there is none, then the DB is checked

The function to getRelationship has been transformed like the below

function getDataFromRedis(contactId) {}
function getRelationship(contactId) {
    relationShip = getDataFromRedis(contactId);
    if(!relationShip) {
        // check in DB
    }
    return relationShip
}

Forem: Sasi Kumar T

Tracing the Express Middleware Nobody Talks About: Compression

Why Compression Deserves Tracing

Introducing Compression Instrumentation

What the Trace Looks Like

Installing the Instrumentation

Basic Setup

Example Express Application

Jaeger Trace Example

What Gets Captured

Current Limitation

Why This Matters

Real Use Cases

Final Thoughts

About Me

Tracing the part of Mongoose nobody talks about

What's the problem?

How it works

Getting started

About Me

Postgres “almost” Outage Postmortem: The Hidden Dangers of Replication Slots and Autovacuum

Investigating the Spikes

Optimizing Storage with Manual Vacuum

Vacuuming tasks

The Shocking Discovery

How and Why?

1. Why wasn’t Autovacuum removing dead tuples, even though it was running frequently?

2. How do replication slots impact Autovacuum?

3. How does replication slot present in one database block Autovacuum on another?

4. If Autovacuum couldn't remove dead tuples, how did a manual vacuum reclaim space?

5. If FULL VACUUM is so effective, why isn't it recommended?

The Key Takeaway

About Me

Boosting PostgreSQL Performance: Optimising Queries with the != Operator

Introduction:

The Performance Challenge with !=

Execution Plan Breakdown:

1. Aggregate

2. Index Only Scan

3. Index Conditions and Filters

Key Points

Approach 1: Query Rewrite

Summary of Execution

Approach 2: Using Partial Index

Characteristics of Partial Indexes:

Execution Plan Breakdown:

Conclusion:

Key Takeaways:

About Me

OverComing Weak Consistency using Cache

The Performance Challenge with `!=`