Akka Cluster Sharding - long GC pauses causing unreachable nodes

Hello everyone,

we are running a cluster with Akka Cluster Sharding (Java) operating on about 30 million sharded entities.
The cluster itself has 5 akka nodes with 80 GB RAM each and uses 3 cassandra instances as persistence layer.

Right now we are having severe issues with long stop-the-world gc pauses between 30 and 50 seconds during RAM intensive operations (like f.e. passivation) causing the cluster to frequently DOWN a node because it is frozen.

We tried changing from G1C to Shenandoah with no real success. We know it’s possible to change “heartbeat-interval”, “acceptable-heartbeat-pause” and “threshold” but this feel just fighting symptoms.

Anyone having this issue before? Are there any akka (not JVM) specific solutions for this? Any pitfalls or best practices for this? Or is it just necessary to tweak the garbage collector until it behaves well enough?

Any help would be appretiated.

Thanks in advance and best regards
Thomas

Have you tried using more Akka nodes? Each with smaller heap size.

Do you passivate the entities that are not active? In new APIs that is automatic, but might need some tweaking of config.

Thanks for your reply Patrick.

>Have you tried using more Akka nodes? Each with smaller heap size.
Not yet. This sounds quite easy but unfornunately this is not one of the easiest options for us because of operating expenses. Do you have any suggestions for an “acceptable” heap size per node, to get a healthy cluster with “normal” cluster settings?

>Do you passivate the entities that are not active? In new APIs that is automatic, but might need some tweaking of config.
No. We are also using “remember-entities”. The main concept behind the solution is to reduce the request times accessing the entities during high load working times. So passivating all the entities would require to restore all of the entities from the persistence in a very short time. Which would result in the opposite what we wanted to achive. Also passivation and freeing RAM seems to be the biggest issue (nightly hours).

After tweaking the “Shenandoah” with “ShenandoahGuaranteedGCInterval” and " ConcGCThreads" to low latency the garbage collection pauses went away and the cluster stays healthy.

2 Likes