Tracking Postgres "fsyncs" with bpftrace

#database #postgres #bpftrace

Data systems such as Postgres make strong guarantees when it comes to durability. The aim is to reduce the chances of data loss at any cost. The fsync and fdatasync system calls are often used to provide this guarantee to the user that data has been safely flushed to the storage device. The user can then be certain that in any case of power failure they would not lose their data.

However, these calls are not cheap. An application would block while waiting for the system calls to return. Napkin math latency numbers puts it at around 1ms. I did a test for fsync system call on my laptop just to compare whether I'd get the same latency values. The bpftrace script I used looks like this


BEGIN
{
    printf("Tracing fsync latency... Hit Ctrl-C to end.\n");
}

tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync
{
    @start[tid] = nsecs;
}

tracepoint:syscalls:sys_exit_fsync,
tracepoint:syscalls:sys_exit_fdatasync 
/@start[tid]/
{
    cat("/proc/%d/cmdline", pid);  // Who is making the call?
    printf(" ==> %d\n", (nsecs - @start[tid]));
    delete(@start, tid);
}

The script for making the fsync call is written in Golang here

package main

import (
    "log"
    "os"
)

func main()  {
    buf := []byte("hello world")
    file, err := os.OpenFile("/tmp/testfile.txt", os.O_CREATE|os.O_WRONLY, 0777)

    if err != nil {
        log.Fatalf("OpenFile: %v", err)
    }
    defer file.Close()

    file.Write(buf)
    file.Sync()  // Flush to disk
}

After running the Go script a couple of times, the results I got were as follows:

Run bpftrace --unsafe fysnc_lat.bt

Attaching 5 probes...
Tracing fsync latency... Hit Ctrl-C to end.
 ==> 1095741
 ==> 967711
 ==> 1086844
 ==> 1095359

Not too far from the napkin value. The measured values are in nanoseconds.

Having done this, I wanted to know how often Postgres does flush data to disk. My assumption at this point would be that a single transaction makes at least one fsync call to ensure durabilty(for every commit a WAL record is generated and flushed to disk). I wanted to find out if it holds using Postgres. If 1 fsync call takes 1ms, then you'd expect to do around 1000 transactions per second for a single thread scenario. This is mostly an over-simplification because transactions do more than just making fsync calls, so the latency would be over 1ms in a real system.

The bpftrace script to track fsync calls made by Postgres

BEGIN
{
    printf("Tracing Postgres fsync calls... Hit Ctrl-C to end.\n");
    printf("%-6s %6s %s\n", "PID", "FD", "ProcName < - > Filename");
}

tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "postgres"/ {
    @procs[pid] = count();
    @sum["total"] = count();

    if (!@files_fsyncd[args->fd]) {
        @files_fsyncd[args->fd] = 1;  // Mark FD as seen

        printf("\n%-6d %6d ", pid, args->fd);
        cat("/proc/%d/cmdline", pid);
        printf(" < - > ");
        system("readlink /proc/%d/fd/%d", pid, args->fd); // Get filename from fd
    }
}

END {
    clear(@files_fsyncd);
}

The script fires when a Postgres process makes an fsync or fdatasync call. It will display the process name(for example the checkpointer, walwriter etc), process ID, total number of calls made by all Postgres processes, the name of the file and the file descriptor number.

The stats from running a single client, using sysbench for a write-only workload for about 600 seconds is as follows

SQL statistics:
    queries performed:
        read:                            0
        write:                           1435372
        other:                           717686
        total:                           2153058
    transactions:                        358843 (598.07 per sec.)
    queries:                             2153058 (3588.42 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)

The client performs 358843 transactions while doing about 2153058 queries. That is around 6 queries per transaction.

Here is the result from running the bpftrace script bpftrace --unsafe fsync_files.bt

Attaching 4 probes...
Tracing fsync calls... Hit Ctrl-C to end.
PID        FD ProcName < - > Filename
6202       14 postgres: sbtest sbtest 192.168.33.10(50984) idle < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3094        6 postgres: walwriter  < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED

3091        7 postgres: checkpointer  < - >
6212        8 postgres: autovacuum worker sbtest < - > /usr/local/pgsql/data/pg_wal/000000010000001100000001

3091       12 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2841

3091       16 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24741

3091        9 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840

3091       13 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24730

3091       11 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840_vm

3091       18 postgres: checkpointer  < - > /usr/local/pgsql/data/pg_wal

3091       17 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24759

3091       10 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/2840_fsm

3091       15 postgres: checkpointer  < - > /usr/local/pgsql/data/base/16416/24730_vm

3091       19 postgres: checkpointer  < - >
^C


@procs[6253]: 1
@procs[6257]: 1
@procs[6255]: 1
@procs[6221]: 1
@procs[6223]: 1
@procs[6251]: 1
@procs[6259]: 1
@procs[6212]: 1
@procs[6216]: 1
@procs[6262]: 1
@procs[3091]: 152
@procs[3094]: 253
@procs[6202]: 358898
@sum[total]: 359313

The client process with PID 6202 is reported as having made 358898 fsync calls. This is around the same number reported by sysbench above. Every transaction command made a fsync call. But the actual number of transactions per second is about 598 when the ideal number is 1000. What could be the reason for the difference? I had a few thoughts(there could be more...)

One reason could be due to the number of queries being performed in a single transaction(avg 6) causing some overhead in query processing. Also, there are other processes competing for shared resources with this client. The checkpointer, walwriter, autovaccum, bgwriter are the background processes that Postgres runs to perform other essential services. The checkpointer and walwriter(PIDs 3094 and 3091) made a combined 405 fsync calls. Checkpointing will flush all data that is in memory to disk. This can be a lot of data especially if there are many dirty buffers still in memory. Flushing lots of data at once can be very slow and increase response times in the system.

Other factors such as query processing depending on the number of queries in the transaction, locks, memory IO, network etc can increase transaction processing times. The transaction processing time, therefore, is not just the cost of a fsync call and would likely take more than 1ms. In a highly concurrent system, an optimization the database would use is to group multiple commits and only issue one fsync call to flush all the data at once to disk which reduces the number of IOs.

DEV Community

Tracking Postgres "fsyncs" with bpftrace

Top comments (0)

Okay