Cluster Configuration
Item | Description |
---|---|
Quantity | 3 servers |
Specifications | Alibaba Cloud ECS 16C64G |
Slot Mode | Static 50 slots |
ST Memory Config | -Xms32g -Xmx32g -XX:MaxMetaspaceSize=8g |
Exception Issues
Since April, there have been three occurrences of cluster split-brain phenomena, all involving a certain node going split-brain or shutting down automatically.
Core logs are as follows:
Master Node
Hazelcast monitoring thread prints a Slow Operation log.
After a 60s Hazelcast heartbeat timeout, we see that 198 has left the cluster.
198 Worker Node
We can see that it is already unable to obtain heartbeats from Hazelcast cluster nodes, with timeouts exceeding 60000ms.
Attempting to reconnect to the cluster
Afterward, any requests sent to this node—such as status queries or job submissions—become stuck and return no status.
At this point, the entire cluster becomes unavailable, entering a deadlock state. The health check interfaces we wrote for the nodes are all unreachable. Service downtime occurred during peak morning hours, so after observing in the logs that the cluster entered a split-brain state, we quickly restarted the cluster.
After parameter tuning, the issue of automatic node shutdown even occurred.
Problem Analysis
The issue may lie in Hazelcast cluster network setup failure, with the following possible causes:
- NTP time of the ECS instances in the cluster is inconsistent;
- Network jitter on the ECS instances in the cluster causes access to be unavailable;
- SeaTunnel experiences FULL GC leading to JVM stalling, resulting in setup failure;
The first two issues were ruled out by our operations colleagues: no network problems were identified, Alibaba Cloud NTP service is functioning normally, and the server clocks are synchronized across all three servers.
Regarding the third issue, in the last Hazelcast health check log before the anomaly on node 198, we found that the cluster time
was at -100 milliseconds, which seems to have a limited impact.
So during subsequent startups, we added JVM GC logging parameters to observe full GC durations. We observed that in the worst case it lasted 27s. Since the three nodes monitor each other via ping, this easily leads to Hazelcast exceeding the 60s heartbeat timeout.
We also discovered that when synchronizing a 1.4 billion-row ClickHouse table, full GC exceptions are likely to occur after the job has been running for some time.
Solution
Increase ST Cluster Heartbeat Timeout
Hazelcast’s cluster failure detector is responsible for determining whether a cluster member is unreachable or has crashed.
But according to the well-known FLP result, in asynchronous systems, it’s impossible to distinguish between a crashed member and a slow member. A solution to this limitation is to use an unreliable failure detector. An unreliable detector allows members to suspect others of failure, typically based on liveness criteria, though it may make mistakes.
Hazelcast provides the following built-in failure detectors: Deadline Failure Detector
and Phi Accrual Failure Detector
.
By default, Hazelcast uses the Deadline Failure Detector.
There is also a Ping Failure Detector which, if enabled, works in parallel with the above detectors and can detect failures at OSI Layer 3 (network layer). This detector is disabled by default.
Deadline Failure Detector
Uses absolute timeout for missed/lost heartbeats. After the timeout, a member is considered crashed/unreachable and marked as suspect.
Relevant parameters and descriptions:
hazelcast:
properties:
hazelcast.heartbeat.failuredetector.type: deadline
hazelcast.heartbeat.interval.seconds: 5
hazelcast.max.no.heartbeat.seconds: 120
Configuration Item | Description |
---|---|
hazelcast.heartbeat.failuredetector.type |
Cluster failure detector mode: deadline. |
hazelcast.heartbeat.interval.seconds |
Interval at which members send heartbeat messages to each other. |
hazelcast.max.no.heartbeat.seconds |
Timeout for suspecting a cluster member if no heartbeat is received. |
Phi-accrual Failure Detector
Tracks intervals between heartbeats within a sliding time window, measures the average and variance of these samples, and calculates a suspicion level (Phi).
As the time since the last heartbeat increases, the phi value increases. If the network becomes slow or unreliable, leading to increased mean and variance, members are suspected after a longer time without a heartbeat.
Relevant parameters and descriptions:
hazelcast:
properties:
hazelcast.heartbeat.failuredetector.type: phi-accrual
hazelcast.heartbeat.interval.seconds: 1
hazelcast.max.no.heartbeat.seconds: 60
hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100
Configuration Item | Description |
---|---|
hazelcast.heartbeat.failuredetector.type |
Cluster failure detector mode: phi-accrual |
hazelcast.heartbeat.interval.seconds |
Interval at which members send heartbeat messages to each other |
hazelcast.max.no.heartbeat.seconds |
Timeout for suspecting a member due to missed heartbeats. With an adaptive detector, this can be shorter than the timeout defined for deadline detector |
hazelcast.heartbeat.phiaccrual.failuredetector.threshold |
Phi threshold to consider a member unreachable. A lower value detects failures more aggressively but increases false positives; higher values are more conservative but detect actual failures more slowly. Default is 10 |
hazelcast.heartbeat.phiaccrual.failuredetector.sample.size |
Number of samples to retain in history. Default is 200 |
hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis |
Minimum standard deviation for phi calculation in a normal distribution |
Reference docs:
To improve accuracy, we adopted community recommendations to use phi-accrual
in hazelcast.yml
, and set the timeout to 180s:
hazelcast:
properties:
# Newly added parameters
hazelcast.heartbeat.failuredetector.type: phi-accrual
hazelcast.heartbeat.interval.seconds: 1
hazelcast.max.no.heartbeat.seconds: 180
hazelcast.heartbeat.phiaccrual.failuredetector.threshold: 10
hazelcast.heartbeat.phiaccrual.failuredetector.sample.size: 200
hazelcast.heartbeat.phiaccrual.failuredetector.min.std.dev.millis: 100
GC Configuration Optimization
SeaTunnel uses the G1 garbage collector by default. The larger the memory configuration, the more likely it is that YoungGC or MixedGC won’t reclaim enough memory efficiently (even with multithreading), triggering frequent FullGC.
(Java 8 handles FullGC in single-threaded mode, which is very slow.)
If multiple cluster nodes trigger FullGC simultaneously, the chances of cluster networking failures greatly increase.
Therefore, our goal is to make sure YoungGC/MixedGC can reclaim sufficient memory as much as possible using multithreaded processing.
Unoptimized parameters:
-Xms32g
-Xmx32g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
So, we tried increasing the allowed GC pause duration:
-- This parameter sets the target maximum pause time. The default is 200 milliseconds.
-XX:MaxGCPauseMillis=5000
Mixed Garbage Collections use this parameter and the historical GC durations to estimate how many Regions can be collected within the target 200ms. If only a few are collected and the desired GC effect is not achieved, G1 has a special strategy: after a Stop-The-World (STW) pause and collection, it resumes system threads and then performs another STW, executing a mixed GC to collect a portion of the Regions. This repeats up to ‐XX:G1MixedGCCountTarget=8 (default is 8 times).
For example: if 400 Regions need to be collected, and the 200ms limit allows only 50 Regions to be collected at a time, then repeating the process 8 times will collect them all. This avoids a long STW pause from a single collection.
JVM parameters after the first optimization:
-Xms32g
-Xmx32g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
Mixed GC logs:
Mixed GC pause times — this parameter is only a target value, and the observed results were all within the expected range:
Full GC logs:
However, Full GCs still couldn't be avoided, and each took about 20 seconds. These additional parameters only marginally improved GC performance.
By observing the logs, we noticed that during MixedGC scenarios, the old generation wasn’t being properly collected — a large amount of data remained in the old generation without being cleaned up.
So we tried tuning the old generation memory and several performance-related G1 GC parameters.
Optimized parameters were as follows:
-Xms42g
-Xmx42g
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M
Heap memory (-Xms / -Xmx) increased from 32G to 42G, indirectly increasing the upper limit of the old generation, which should theoretically reduce the frequency of Full GCs.
The CPU time ratio used by GC threads (-XX:GCTimeRatio) increased from 10% to 20%. The formula is 100/(1+GCTimeRatio). This increases the CPU time allowed for GC.
Reserved space (-XX:G1ReservePercent) increased from 10% to 15%. Evacuation Failure refers to when G1 fails to allocate new regions in the heap space and triggers a Full GC. Increasing the reserved space can help avoid such scenarios, though it reduces the usable space in the old generation. So we increased heap memory. This adjustment helps in the following cases:
- No free regions available when copying live objects from the young generation.
- No free regions available when evacuating live objects from the old generation.
- No contiguous space available in the old generation for allocating large objects.
Heap Region Size (-XX:G1HeapRegionSize) — set to 32MB to optimize large object collection.
JVM parameters after the second optimization:
-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:G1HeapRegionSize=32M
After the optimization, we observed a noticeable decrease in the number of FullGCs that day, but we still didn't reach the ideal state of having zero FullGCs.
Further log analysis showed that the parallel intersection stage consumed a lot of time and frequently encountered aborts.
We applied the following parameter tuning:
-XX:ConcGCThreads=12
-XX:InitiatingHeapOccupancyPercent=50
The number of GC threads running concurrently with the application (-XX:ConcGCThreads) was increased from 4 to 12. The lower this value, the higher the application throughput, but too low will make GC take too long. When the concurrent GC cycle is too long, increasing the number of GC threads can help. However, this reduces the number of threads available for application logic, which will impact throughput. For offline data sync scenarios, avoiding FullGCs is more important, making this parameter crucial.
Concurrent marking threshold for the old generation (-XX:InitiatingHeapOccupancyPercent) increased from 45% to 50%, triggering concurrent marking earlier and improving MixedGC performance.
JVM parameters after the third optimization:
-Xms42g
-Xmx42g
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/seatunnel/dump/zeta-server
-XX:MaxMetaspaceSize=8g
-XX:+UseG1GC
-XX:+PrintGCDetails
-Xloggc:/alidata1/za-seatunnel/logs/gc.log
-XX:+PrintGCDateStamps
-XX:MaxGCPauseMillis=5000
-XX:InitiatingHeapOccupancyPercent=50
-XX:+UseStringDeduplication
-XX:GCTimeRatio=4
-XX:G1ReservePercent=15
-XX:ConcGCThreads=12
-XX:G1HeapRegionSize=32M
JVM Tuning References:
- https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc_tuning.html#sthref56
- https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/g1_gc.html#pause_time_goal
- https://zhuanlan.zhihu.com/p/181305087
- https://blog.csdn.net/qq_32069845/article/details/130594667
Optimization Results
Since the configuration was optimized on April 26, no cluster split-brain incidents have occurred, and service availability monitoring shows all nodes have returned to normal.
After the JVM tuning on April 30, during the May Day holiday, we achieved zero FullGCs on three nodes, and no further stalling or anomalies were detected in the system’s health check interfaces.
Although this came at the cost of some application thread throughput, we ensured the stability of the cluster, laying a solid foundation for the internal large-scale rollout of Zeta.
Top comments (0)