Akka Cluster Sharding Performance Issue with Many Active Entities (>10 MM)


(Mihajlo J.) #1

Hi, I wrote a small prototype app using the Cluster Sharding module and have observed some surprising (at least to me) results when testing with large number of active persistent entities (over 10 million). I understand that typically you’d want to passivate entities whenever possible, but I wanted to see if it was possible to keep a large number of them active across several (up to 8) fairly good size nodes (running on AWS EC2 c5.2xlarge instances). What I observed was that, as the number of entities grows, so does the memory footprint and CPU utilisation, to the point where the entire system stops responding due to resource starvation. While I do expect resource usage to grow in this case, I guess I wasn’t expecting it to grow as fast; I was under the impression that Akka dispatcher model is fairly efficient with very small overhead per actor (~350 bytes), capable of scaling to millions of actors in a single JVM process. Here I had what I thought was more than enough resources, but the system was not scaling well. Perhaps there’s a setting or a flag that I missed? I tried taking thread/heap dumps, but could find nothing obvious other than to speculate there must be large overhead related to intra-cluster communication required for cluster sharding. Any help or explanation for what might be happening would be greatly appreciated. Thanks!


(Johan Andrén) #2

As far as I recall, the latest bigger benchmark we did was up to somewhere around 200M actors across a cluster, so 10M doesn’t sound like it should be a problem.

Intracluster communications shouldn’t grow much with a larger number of sharded actors, it would grow with a larger number of messages sent between the actors on different nodes though, so maybe something in your messaging patterns is causing it?

Can you share the sources of this prototype?


(Mihajlo J.) #3

Thanks for your response Johan. Your benchmark result is more in line what I was expecting, that is why I was surprised by poor resource utilization of my prototype. Most likely it is something stupid that I’m doing. Unfortunately with my code being in AWS CodeCommit it is not easy for me to share, but really it is rather simple and heavily based on examples of how to do cluster sharding. My sharded actor has state which is really just a single Long encapsulated in a case class; then all I do in my test is set that state and keep the actor in memory. I am using Akka HTTP with json; request is unmarshalled and simply passed on to the shard region actor which uses consistent hashing of a single String id to create sharded entity.

May I ask what type/size of cluster did you use in your large benchmark?


(Johan Andrén) #4

It was 215m persistent actors using Cassandra as backend on a 50 node Akka cluster, I think we also did something like 40m “lightweight” actors on a 10 node cluster.


(Ismael Hamed) #5

@johanandren out of curiosity, are those benchmarks (code) public? I’m interested in how did you guys manage to make Cassandra survive a cold start with a cluster of that size (215M persistent actors are gonna hammer your persistent store really hard during recovery)


(Johan Andrén) #6

The sources are public, https://github.com/akka/apps, but not amazingly accessible (not that much documentation etc) and I’m not sure which branch/PR was that specific scaling run (or if it is even something that was pushed to that.

I think it did some form of incremental start to not overwhelm Cassandra rather than starting all 215m at once.