Apache SeaTunnel

Posted on May 28

Troubleshooting SeaTunnel Cluster Split-Brain: A Deep Dive into Hazelcast Configuration and GC-Induced Failures

#dataengineering #apacheseatunnel #programming

Cluster Configuration

Item	Description
Quantity	3 servers
Specifications	Alibaba Cloud ECS 16C64G
Slot Mode	Static 50 slots
ST Memory Config	-Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g

Exception Issues

Since April, there have been three occurrences of cluster split-brain phenomena, all involving a certain node going split-brain or shutting down automatically.

Core logs are as follows:

Master Node

Hazelcast monitoring thread prints a Slow Operation log.

After a 60s Hazelcast heartbeat timeout, we see that 198 has left the cluster.

198 Worker Node

We can see that it is already unable to obtain heartbeats from Hazelcast cluster nodes, with timeouts exceeding 60000ms.

Attempting to reconnect to the cluster

Afterward, any requests sent to this node—such as status queries or job submissions—become stuck and return no status.

At this point, the entire cluster becomes unavailable, entering a deadlock state. The health check interfaces we wrote for the nodes are all unreachable. Service downtime occurred during peak morning hours, so after observing in the logs that the cluster entered a split-brain state, we quickly restarted the cluster.

After parameter tuning, the issue of automatic node shutdown even occurred.

Problem Analysis

The issue may lie in Hazelcast cluster network setup failure, with the following possible causes:

NTP time of the ECS instances in the cluster is inconsistent;
Network jitter on the ECS instances in the cluster causes access to be unavailable;
SeaTunnel experiences FULL GC leading to JVM stalling, resulting in setup failure;

The first two issues were ruled out by our operations colleagues: no network problems were identified, Alibaba Cloud NTP service is functioning normally, and the server clocks are synchronized across all three servers.

Regarding the third issue, in the last Hazelcast health check log before the anomaly on node 198, we found that the cluster time was at -100 milliseconds, which seems to have a limited impact.

So during subsequent startups, we added JVM GC logging parameters to observe full GC durations. We observed that in the worst case it lasted 27s. Since the three nodes monitor each other via ping, this easily leads to Hazelcast exceeding the 60s heartbeat timeout.

We also discovered that when synchronizing a 1.4 billion-row ClickHouse table, full GC exceptions are likely to occur after the job has been running for some time.

Solution

Increase ST Cluster Heartbeat Timeout

Hazelcast’s cluster failure detector is responsible for determining whether a cluster member is unreachable or has crashed.

But according to the well-known FLP result, in asynchronous systems, it’s impossible to distinguish between a crashed member and a slow member. A solution to this limitation is to use an unreliable failure detector. An unreliable detector allows members to suspect others of failure, typically based on liveness criteria, though it may make mistakes.

Hazelcast provides the following built-in failure detectors: Deadline Failure Detector and Phi Accrual Failure Detector.

By default, Hazelcast uses the Deadline Failure Detector.

There is also a Ping Failure Detector which, if enabled, works in parallel with the above detectors and can detect failures at OSI Layer 3 (network layer). This detector is disabled by default.

Deadline Failure Detector

Uses absolute timeout for missed/lost heartbeats. After the timeout, a member is considered crashed/unreachable and marked as suspect.

Relevant parameters and descriptions:

hazelcast:
  properties:
    hazelcast.heartbeat.failuredetector.type: deadline
    hazelcast.heartbeat.interval.seconds: 5
    hazelcast.max.no.heartbeat.seconds: 120

Configuration Item	Description
`hazelcast.heartbeat.failuredetector.type`	Cluster failure detector mode: deadline.
`hazelcast.heartbeat.interval.seconds`	Interval at which members send heartbeat messages to each other.
`hazelcast.max.no.heartbeat.seconds`	Timeout for suspecting a cluster member if no heartbeat is received.

Phi-accrual Failure Detector

Tracks intervals between heartbeats within a sliding time window, measures the average and variance of these samples, and calculates a suspicion level (Phi).

As the time since the last heartbeat increases, the phi value increases. If the network becomes slow or unreliable, leading to increased mean and variance, members are suspected after a longer time without a heartbeat.

Relevant parameters and descriptions:

hazelcast:
  properties:
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 1
    hazelcast.max.no.heartbeat.seconds: 60
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100

Configuration Item	Description
`hazelcast.heartbeat.failuredetector.type`	Cluster failure detector mode: phi-accrual
`hazelcast.heartbeat.interval.seconds`	Interval at which members send heartbeat messages to each other
`hazelcast.max.no.heartbeat.seconds`	Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector
`hazelcast.heartbeat.phiaccrual.failuredetector.threshold`	Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10
`hazelcast.heartbeat.phiaccrual.failuredetector.sample.size`	Number of samples to retain in history. Default is 200
`hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis`	Minimum standard deviation for phi calculation in a normal distribution

Reference docs:

https://docs.hazelcast.com/hazelcast/5.1/system-properties

https://docs.hazelcast.com/hazelcast/5.1/clusters/failure-detector-configuration

https://docs.hazelcast.com/hazelcast/5.4/clusters/phi-accrual-detector

To improve accuracy, we adopted community recommendations to use phi-accrual in hazelcast.yml, and set the timeout to 180s:

hazelcast:
  properties:
    # Newly added parameters
    hazelcast.heartbeat.failuredetector.type: phi-accrual
    hazelcast.heartbeat.interval.seconds: 1
    hazelcast.max.no.heartbeat.seconds: 180
    hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
    hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
    hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100

GC Configuration Optimization

SeaTunnel uses the G1 garbage collector by default. The larger the memory configuration, the more likely it is that YoungGC or MixedGC won’t reclaim enough memory efficiently (even with multithreading), triggering frequent FullGC.

(Java 8 handles FullGC in single-threaded mode, which is very slow.)

If multiple cluster nodes trigger FullGC simultaneously, the chances of cluster networking failures greatly increase.

Therefore, our goal is to make sure YoungGC/MixedGC can reclaim sufficient memory as much as possible using multithreaded processing.

Unoptimized parameters:

-Xms32g

-Xmx32g

-XX:+HeapDumpOnOutOfMemoryError

-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server

-XX:MaxMetaspaceSize=8g

-XX:+UseG1GC

-XX:+PrintGCDetails

-Xloggc:/alidata1/za-seatunnel/logs/gc.log

-XX:+PrintGCDateStamps

So, we tried increasing the allowed GC pause duration:

-- This parameter sets the target maximum pause time. The default is 200 milliseconds.

-XX:MaxGCPauseMillis=5000

Mixed Garbage Collections use this parameter and the historical GC durations to estimate how many Regions can be collected within the target 200ms. If only a few are collected and the desired GC effect is not achieved, G1 has a special strategy: after a Stop-The-World (STW) pause and collection, it resumes system threads and then performs another STW, executing a mixed GC to collect a portion of the Regions. This repeats up to ‐XX:G1MixedGCCountTarget=8 (default is 8 times).

For example: if 400 Regions need to be collected, and the 200ms limit allows only 50 Regions to be collected at a time, then repeating the process 8 times will collect them all. This avoids a long STW pause from a single collection.

JVM parameters after the first optimization:

-Xms32g
-Xmx32g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000

Mixed GC logs:

Mixed GC pause times — this parameter is only a target value, and the observed results were all within the expected range:

Full GC logs:

However, Full GCs still couldn't be avoided, and each took about 20 seconds. These additional parameters only marginally improved GC performance.

By observing the logs, we noticed that during MixedGC scenarios, the old generation wasn’t being properly collected — a large amount of data remained in the old generation without being cleaned up.

So we tried tuning the old generation memory and several performance-related G1 GC parameters.

Optimized parameters were as follows:

-Xms42g
-Xmx42g
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M

Heap memory (-Xms / -Xmx) increased from 32G to 42G, indirectly increasing the upper limit of the old generation, which should theoretically reduce the frequency of Full GCs.

The CPU time ratio used by GC threads (-XX:GCTimeRatio) increased from 10% to 20%. The formula is 100/(1+GCTimeRatio). This increases the CPU time allowed for GC.

Reserved space (-XX:G1ReservePercent) increased from 10% to 15%. Evacuation Failure refers to when G1 fails to allocate new regions in the heap space and triggers a Full GC. Increasing the reserved space can help avoid such scenarios, though it reduces the usable space in the old generation. So we increased heap memory. This adjustment helps in the following cases:

No free regions available when copying live objects from the young generation.
No free regions available when evacuating live objects from the old generation.
No contiguous space available in the old generation for allocating large objects.

Heap Region Size (-XX:G1HeapRegionSize) — set to 32MB to optimize large object collection.

JVM parameters after the second optimization:

-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M

After the optimization, we observed a noticeable decrease in the number of FullGCs that day, but we still didn't reach the ideal state of having zero FullGCs.

Further log analysis showed that the parallel intersection stage consumed a lot of time and frequently encountered aborts.

We applied the following parameter tuning:

-XX:ConcGCThreads=12
-XX:InitiatingHeapOccupancyPercent=50

The number of GC threads running concurrently with the application (-XX:ConcGCThreads) was increased from 4 to 12. The lower this value, the higher the application throughput, but too low will make GC take too long. When the concurrent GC cycle is too long, increasing the number of GC threads can help. However, this reduces the number of threads available for application logic, which will impact throughput. For offline data sync scenarios, avoiding FullGCs is more important, making this parameter crucial.

Concurrent marking threshold for the old generation (-XX:InitiatingHeapOccupancyPercent) increased from 45% to 50%, triggering concurrent marking earlier and improving MixedGC performance.

JVM parameters after the third optimization:

-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:InitiatingHeapOccupancyPercent=50
-XX:+UseStringDeduplication
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:ConcGCThreads=12
-XX:G1HeapRegionSize=32M

JVM Tuning References:

Optimization Results

Since the configuration was optimized on April 26, no cluster split-brain incidents have occurred, and service availability monitoring shows all nodes have returned to normal.

After the JVM tuning on April 30, during the May Day holiday, we achieved zero FullGCs on three nodes, and no further stalling or anomalies were detected in the system’s health check interfaces.

Although this came at the cost of some application thread throughput, we ensured the stability of the cluster, laying a solid foundation for the internal large-scale rollout of Zeta.