Data systems such as Postgres make strong guarantees when it comes to durability. The aim is to reduce the chances of data loss at any cost. The fsync
and fdatasync
system calls are often used to provide this guarantee to the user that data has been safely flushed to the storage device. The user can then be certain that in any case of power failure they would not lose their data.
However, these calls are not cheap. An application would block while waiting for the system calls to return. Napkin math latency numbers puts it at around 1ms. I did a test for fsync
system call on my laptop just to compare whether I'd get the same latency values. The bpftrace script I used looks like this
BEGIN
{
printf("Tracing fsync latency... Hit Ctrl-C to end.\n");
}
tracepoint:syscalls:sys_enter_fsync,
tracepoint:syscalls:sys_enter_fdatasync
{
@start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_fsync,
tracepoint:syscalls:sys_exit_fdatasync
/@start[tid]/
{
cat("/proc/%d/cmdline", pid); // Who is making the call?
printf(" ==> %d\n", (nsecs - @start[tid]));
delete(@start, tid);
}
The script for making the fsync
call is written in Golang here
package main
import (
"log"
"os"
)
func main() {
buf := []byte("hello world")
file, err := os.OpenFile("/tmp/testfile.txt", os.O_CREATE|os.O_WRONLY, 0777)
if err != nil {
log.Fatalf("OpenFile: %v", err)
}
defer file.Close()
file.Write(buf)
file.Sync() // Flush to disk
}
After running the Go script a couple of times, the results I got were as follows:
Run bpftrace --unsafe fysnc_lat.bt
Attaching 5 probes...
Tracing fsync latency... Hit Ctrl-C to end.
==> 1095741
==> 967711
==> 1086844
==> 1095359
Not too far from the napkin value. The measured values are in nanoseconds.
Having done this, I wanted to know how often Postgres does flush data to disk. My assumption at this point would be that a single transaction makes at least one fsync call to ensure durabilty(for every commit a WAL record is generated and flushed to disk). I wanted to find out if it holds using Postgres. If 1 fsync call takes 1ms, then you'd expect to do around 1000 transactions per second for a single thread scenario. This is mostly an over-simplification because transactions do more than just making fsync calls, so the latency would be over 1ms in a real system.
The bpftrace script to track fsync calls made by Postgres
BEGIN
{
printf("Tracing Postgres fsync calls... Hit Ctrl-C to end.\n");
printf("%-6s %6s %s\n", "PID", "FD", "ProcName < - > Filename");
}
tracepoint:syscalls:sys_enter_fsync,tracepoint:syscalls:sys_enter_fdatasync
/comm == "postgres"/ {
@procs[pid] = count();
@sum["total"] = count();
if (!@files_fsyncd[args->fd]) {
@files_fsyncd[args->fd] = 1; // Mark FD as seen
printf("\n%-6d %6d ", pid, args->fd);
cat("/proc/%d/cmdline", pid);
printf(" < - > ");
system("readlink /proc/%d/fd/%d", pid, args->fd); // Get filename from fd
}
}
END {
clear(@files_fsyncd);
}
The script fires when a Postgres process makes an fsync
or fdatasync
call. It will display the process name(for example the checkpointer, walwriter etc), process ID, total number of calls made by all Postgres processes, the name of the file and the file descriptor number.
The stats from running a single client, using sysbench for a write-only workload for about 600 seconds is as follows
SQL statistics:
queries performed:
read: 0
write: 1435372
other: 717686
total: 2153058
transactions: 358843 (598.07 per sec.)
queries: 2153058 (3588.42 per sec.)
ignored errors: 0 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)
The client performs 358843 transactions while doing about 2153058 queries. That is around 6 queries per transaction.
Here is the result from running the bpftrace script bpftrace --unsafe fsync_files.bt
Attaching 4 probes...
Tracing fsync calls... Hit Ctrl-C to end.
PID FD ProcName < - > Filename
6202 14 postgres: sbtest sbtest 192.168.33.10(50984) idle < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED
3094 6 postgres: walwriter < - > /usr/local/pgsql/data/pg_wal/0000000100000010000000ED
3091 7 postgres: checkpointer < - >
6212 8 postgres: autovacuum worker sbtest < - > /usr/local/pgsql/data/pg_wal/000000010000001100000001
3091 12 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/2841
3091 16 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/24741
3091 9 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/2840
3091 13 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/24730
3091 11 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/2840_vm
3091 18 postgres: checkpointer < - > /usr/local/pgsql/data/pg_wal
3091 17 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/24759
3091 10 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/2840_fsm
3091 15 postgres: checkpointer < - > /usr/local/pgsql/data/base/16416/24730_vm
3091 19 postgres: checkpointer < - >
^C
@procs[6253]: 1
@procs[6257]: 1
@procs[6255]: 1
@procs[6221]: 1
@procs[6223]: 1
@procs[6251]: 1
@procs[6259]: 1
@procs[6212]: 1
@procs[6216]: 1
@procs[6262]: 1
@procs[3091]: 152
@procs[3094]: 253
@procs[6202]: 358898
@sum[total]: 359313
The client process with PID 6202 is reported as having made 358898 fsync calls. This is around the same number reported by sysbench above. Every transaction command made a fsync call. But the actual number of transactions per second is about 598 when the ideal number is 1000. What could be the reason for the difference? I had a few thoughts(there could be more...)
One reason could be due to the number of queries being performed in a single transaction(avg 6) causing some overhead in query processing. Also, there are other processes competing for shared resources with this client. The checkpointer
, walwriter
, autovaccum
, bgwriter
are the background processes that Postgres runs to perform other essential services. The checkpointer and walwriter(PIDs 3094 and 3091) made a combined 405 fsync calls. Checkpointing will flush all data that is in memory to disk. This can be a lot of data especially if there are many dirty buffers still in memory. Flushing lots of data at once can be very slow and increase response times in the system.
Other factors such as query processing depending on the number of queries in the transaction, locks, memory IO, network etc can increase transaction processing times. The transaction processing time, therefore, is not just the cost of a fsync call and would likely take more than 1ms. In a highly concurrent system, an optimization the database would use is to group multiple commits and only issue one fsync call to flush all the data at once to disk which reduces the number of IOs.
Top comments (0)