DEV Community

Apache SeaTunnel
Apache SeaTunnel

Posted on

Troubleshooting SeaTunnel Cluster Split-Brain: A Deep Dive into Hazelcast Configuration and GC-Induced Failures

Cluster Configuration

Item Description
Quantity 3 servers
Specifications Alibaba Cloud ECS 16C64G
Slot Mode Static 50 slots
ST Memory Config -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g

Exception Issues

Since April, there have been three occurrences of cluster split-brain phenomena, all involving a certain node going split-brain or shutting down automatically.

Core logs are as follows:

Master Node

Hazelcast monitoring thread prints a Slow Operation log.

file

After a 60s Hazelcast heartbeat timeout, we see that 198 has left the cluster.

file

198 Worker Node

We can see that it is already unable to obtain heartbeats from Hazelcast cluster nodes, with timeouts exceeding 60000ms.

file

Attempting to reconnect to the cluster

file

Afterward, any requests sent to this node—such as status queries or job submissions—become stuck and return no status.

file

At this point, the entire cluster becomes unavailable, entering a deadlock state. The health check interfaces we wrote for the nodes are all unreachable. Service downtime occurred during peak morning hours, so after observing in the logs that the cluster entered a split-brain state, we quickly restarted the cluster.

1

After parameter tuning, the issue of automatic node shutdown even occurred.

file

file

Problem Analysis

The issue may lie in Hazelcast cluster network setup failure, with the following possible causes:

  • NTP time of the ECS instances in the cluster is inconsistent;
  • Network jitter on the ECS instances in the cluster causes access to be unavailable;
  • SeaTunnel experiences FULL GC leading to JVM stalling, resulting in setup failure;

The first two issues were ruled out by our operations colleagues: no network problems were identified, Alibaba Cloud NTP service is functioning normally, and the server clocks are synchronized across all three servers.

Regarding the third issue, in the last Hazelcast health check log before the anomaly on node 198, we found that the cluster time was at -100 milliseconds, which seems to have a limited impact.

file

So during subsequent startups, we added JVM GC logging parameters to observe full GC durations. We observed that in the worst case it lasted 27s. Since the three nodes monitor each other via ping, this easily leads to Hazelcast exceeding the 60s heartbeat timeout.

We also discovered that when synchronizing a 1.4 billion-row ClickHouse table, full GC exceptions are likely to occur after the job has been running for some time.

file

Solution

Increase ST Cluster Heartbeat Timeout

Hazelcast’s cluster failure detector is responsible for determining whether a cluster member is unreachable or has crashed.

But according to the well-known FLP result, in asynchronous systems, it’s impossible to distinguish between a crashed member and a slow member. A solution to this limitation is to use an unreliable failure detector. An unreliable detector allows members to suspect others of failure, typically based on liveness criteria, though it may make mistakes.

Hazelcast provides the following built-in failure detectors: Deadline Failure Detector and Phi Accrual Failure Detector.

By default, Hazelcast uses the Deadline Failure Detector.

There is also a Ping Failure Detector which, if enabled, works in parallel with the above detectors and can detect failures at OSI Layer 3 (network layer). This detector is disabled by default.

Deadline Failure Detector

Uses absolute timeout for missed/lost heartbeats. After the timeout, a member is considered crashed/unreachable and marked as suspect.

Relevant parameters and descriptions:

hazelcast:
  properties:
    hazelcast.heartbeat.failuredetector.type: deadline
    hazelcast.heartbeat.interval.seconds: 5
    hazelcast.max.no.heartbeat.seconds: 120
Enter fullscreen mode Exit fullscreen mode
Configuration Item Description
hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: deadline.
hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other.
hazelcast.max.no.heartbeat.seconds Timeout for suspecting a cluster member if no heartbeat is received.

Phi-accrual Failure Detector

Tracks intervals between heartbeats within a sliding time window, measures the average and variance of these samples, and calculates a suspicion level (Phi).

As the time since the last heartbeat increases, the phi value increases. If the network becomes slow or unreliable, leading to increased mean and variance, members are suspected after a longer time without a heartbeat.

Relevant parameters and descriptions:

hazelcast:
  properties:
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 1
    hazelcast.max.no.heartbeat.seconds: 60
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100
Enter fullscreen mode Exit fullscreen mode
Configuration Item Description
hazelcast.heartbeat.failuredetector.type Cluster failure detector mode: phi-accrual
hazelcast.heartbeat.interval.seconds Interval at which members send heartbeat messages to each other
hazelcast.max.no.heartbeat.seconds Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector
hazelcast.heartbeat.phiaccrual.failuredetector.threshold Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10
hazelcast.heartbeat.phiaccrual.failuredetector.sample.size Number of samples to retain in history. Default is 200
hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis Minimum standard deviation for phi calculation in a normal distribution

Reference docs:

To improve accuracy, we adopted community recommendations to use phi-accrual in hazelcast.yml, and set the timeout to 180s:

hazelcast:
  properties:
    # Newly added parameters
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 1
    hazelcast.max.no.heartbeat.seconds: 180
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100
Enter fullscreen mode Exit fullscreen mode

GC Configuration Optimization

SeaTunnel uses the G1 garbage collector by default. The larger the memory configuration, the more likely it is that YoungGC or MixedGC won’t reclaim enough memory efficiently (even with multithreading), triggering frequent FullGC.

(Java 8 handles FullGC in single-threaded mode, which is very slow.)

If multiple cluster nodes trigger FullGC simultaneously, the chances of cluster networking failures greatly increase.

Therefore, our goal is to make sure YoungGC/MixedGC can reclaim sufficient memory as much as possible using multithreaded processing.

Unoptimized parameters:

-Xms32g

-Xmx32g

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server

-XX:MaxMetaspaceSize=8g

-XX:+UseG1GC

-XX:+PrintGCDetails

-Xloggc:/alidata1/za-seatunnel/logs/gc.log

-XX:+PrintGCDateStamps
Enter fullscreen mode Exit fullscreen mode

So, we tried increasing the allowed GC pause duration:

-- This parameter sets the target maximum pause time. The default is 200 milliseconds.

-XX:MaxGCPauseMillis=5000
Enter fullscreen mode Exit fullscreen mode

Mixed Garbage Collections use this parameter and the historical GC durations to estimate how many Regions can be collected within the target 200ms. If only a few are collected and the desired GC effect is not achieved, G1 has a special strategy: after a Stop-The-World (STW) pause and collection, it resumes system threads and then performs another STW, executing a mixed GC to collect a portion of the Regions. This repeats up to ‐XX:G1MixedGCCountTarget=8 (default is 8 times).

For example: if 400 Regions need to be collected, and the 200ms limit allows only 50 Regions to be collected at a time, then repeating the process 8 times will collect them all. This avoids a long STW pause from a single collection.

JVM parameters after the first optimization:

-Xms32g
-Xmx32g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
Enter fullscreen mode Exit fullscreen mode

Mixed GC logs:

file

Mixed GC pause times — this parameter is only a target value, and the observed results were all within the expected range:

file

Full GC logs:

file

However, Full GCs still couldn't be avoided, and each took about 20 seconds. These additional parameters only marginally improved GC performance.

By observing the logs, we noticed that during MixedGC scenarios, the old generation wasn’t being properly collected — a large amount of data remained in the old generation without being cleaned up.

file

So we tried tuning the old generation memory and several performance-related G1 GC parameters.

Optimized parameters were as follows:

-Xms42g
-Xmx42g
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M
Enter fullscreen mode Exit fullscreen mode

Heap memory (-Xms / -Xmx) increased from 32G to 42G, indirectly increasing the upper limit of the old generation, which should theoretically reduce the frequency of Full GCs.

The CPU time ratio used by GC threads (-XX:GCTimeRatio) increased from 10% to 20%. The formula is 100/(1+GCTimeRatio). This increases the CPU time allowed for GC.

Reserved space (-XX:G1ReservePercent) increased from 10% to 15%. Evacuation Failure refers to when G1 fails to allocate new regions in the heap space and triggers a Full GC. Increasing the reserved space can help avoid such scenarios, though it reduces the usable space in the old generation. So we increased heap memory. This adjustment helps in the following cases:

  • No free regions available when copying live objects from the young generation.
  • No free regions available when evacuating live objects from the old generation.
  • No contiguous space available in the old generation for allocating large objects.

Heap Region Size (-XX:G1HeapRegionSize) — set to 32MB to optimize large object collection.

JVM parameters after the second optimization:

-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M
Enter fullscreen mode Exit fullscreen mode

After the optimization, we observed a noticeable decrease in the number of FullGCs that day, but we still didn't reach the ideal state of having zero FullGCs.

file

file

Further log analysis showed that the parallel intersection stage consumed a lot of time and frequently encountered aborts.

file

We applied the following parameter tuning:

-XX:ConcGCThreads=12
-XX:InitiatingHeapOccupancyPercent=50
Enter fullscreen mode Exit fullscreen mode

The number of GC threads running concurrently with the application (-XX:ConcGCThreads) was increased from 4 to 12. The lower this value, the higher the application throughput, but too low will make GC take too long. When the concurrent GC cycle is too long, increasing the number of GC threads can help. However, this reduces the number of threads available for application logic, which will impact throughput. For offline data sync scenarios, avoiding FullGCs is more important, making this parameter crucial.

Concurrent marking threshold for the old generation (-XX:InitiatingHeapOccupancyPercent) increased from 45% to 50%, triggering concurrent marking earlier and improving MixedGC performance.

2

JVM parameters after the third optimization:

-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:InitiatingHeapOccupancyPercent=50
-XX:+UseStringDeduplication
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:ConcGCThreads=12
-XX:G1HeapRegionSize=32M
Enter fullscreen mode Exit fullscreen mode

JVM Tuning References:

Optimization Results

Since the configuration was optimized on April 26, no cluster split-brain incidents have occurred, and service availability monitoring shows all nodes have returned to normal.

After the JVM tuning on April 30, during the May Day holiday, we achieved zero FullGCs on three nodes, and no further stalling or anomalies were detected in the system’s health check interfaces.

Although this came at the cost of some application thread throughput, we ensured the stability of the cluster, laying a solid foundation for the internal large-scale rollout of Zeta.

file

file

file

Tiger Data image

🐯 🚀 Timescale is now TigerData

Building the modern PostgreSQL for the analytical and agentic era.

Read more

Top comments (0)

Heroku

Save time with this productivity hack.

See how Heroku MCP Server connects tools like Cursor to Heroku, so you can build, deploy, and manage apps—right from your editor.

Learn More

👋 Kindness is contagious

Dive into this thoughtful piece, beloved in the supportive DEV Community. Coders of every background are invited to share and elevate our collective know-how.

A sincere "thank you" can brighten someone's day—leave your appreciation below!

On DEV, sharing knowledge smooths our journey and tightens our community bonds. Enjoyed this? A quick thank you to the author is hugely appreciated.

Okay