we are running a cluster with Akka Cluster Sharding (Java) operating on about 30 million sharded entities.
The cluster itself has 5 akka nodes with 80 GB RAM each and uses 3 cassandra instances as persistence layer.
Right now we are having severe issues with long stop-the-world gc pauses between 30 and 50 seconds during RAM intensive operations (like f.e. passivation) causing the cluster to frequently DOWN a node because it is frozen.
We tried changing from G1C to Shenandoah with no real success. We know it’s possible to change “heartbeat-interval”, “acceptable-heartbeat-pause” and “threshold” but this feel just fighting symptoms.
Anyone having this issue before? Are there any akka (not JVM) specific solutions for this? Any pitfalls or best practices for this? Or is it just necessary to tweak the garbage collector until it behaves well enough?
Any help would be appretiated.
Thanks in advance and best regards