Akka-Cluster: Decreasing system performance having many active actors

motmot80 · September 24, 2020, 9:51am

Hello everyone,

we are running an actor system (cluster sharding) having about 5 Million per node. With a constant inbound message rate we are seeing the system slowing down having more and more actors in memory.

few actors: 1 ms
5M actors: > 200 ms

Does the amount of active but unused actors have a significant negative impact on the latency?

System:

5 node cluster
akka-cluster-sharding
akka-persistence-cassandra

F.e. does the dispatcher need to check 5M mailboxes when only 1K mailboxes have messages?

Thanks in advance and best regards
Thomas S.

manuelbernhardt · September 24, 2020, 10:41am

Do you observe the increased latency also after a period of stability (when the 5 million actors have been created, rebalanced and the shard regions have all had a chance to cache the entities’ shards whereabouts)? I would need to check but my gut feeling tells me this has something to do with the communication path more than with individual actors replying to messages.

Wondering what you’re doing with 5 million actors per node :) I mean at this scale it may make sense to have a few more nodes around

manuelbernhardt · September 24, 2020, 10:54am

One thing just crossed my mind: < 1ms means you didn’t cross a network boundary - so the additional latency you see is possibly related to the actors not all being local anymore but instead rebalanced onto other nodes

Scratch that - I was thinking in μs.

To get a better sense of what is goin on could you share more details about the nodes, mainly how many CPU cores they have (or vCPUs if this runs on virtualized hardware)?

johanandren · September 24, 2020, 12:54pm

The dispatcher only acts on actors with messages, it does not go through all actor mailboxes.

The JVM heap is a shared resource though, so GC times could be one thing to look into.

motmot80 · September 29, 2020, 7:33pm

12 CPUs and 64 GB RAM each. We already changed from G1C to Shenandoah because of GC stop-the-world problems.

In our use case we constantly are spawning (empty recovery) and passivating about 20% of these actors (sharded entities). The cassandra write and read latency seems to be fine.

Just for testing purposes we changed the AbstractPersistentActor to an AbstractActor and implemented the recovery and persist ourselfs by loading and saving the state snapshot using the cassandra directly (blocking i/o *).

I know this is evil magic. It was just for testing … We won’t do it again : )

Test results:

The throughput doubled. Which is confusing me because in this case we are saving the entire state to cassandra every time instead of using AbstractPersistentActor.persist to save the commands only (CQRS), which should be much more efficient. (We already checked the setting max-concurrent-recoveries)
The AbstractPersistentActor implementation itself seems to consume about +600 bytes heap per actor instance (AbstractActor = 400 bytes ).

Maybe the whole problem is caused by GC collections. Right now it seems like the AbstractPersistentActor implementation is heap expensive but unfortunately not very fast…

Thanks for your support
Thomas S.

johanandren · September 30, 2020, 8:45am

That the actors are persistent actors are relevant, because that means they also share the connection pool to the database and database as a resource. There is a limit on concurrent replays configured through akka.persistence.max-concurrent-recoveries that you could try tweaking. It is also possible to trim Cassandra plugin replay by turning off features of the plugin you do not use (deletions for example).

motmot80 · October 3, 2020, 11:34am

As I mentioned we cheched max-concurrent-recoveries and increased it. Which was an improvement of about 5-8%.

akka.persistence.cassandra: Do you mean something like disalbe events-by-tag.disable and journal.support-deletes? This seems to be very promising. We’ll definitely try this.

Also we’ll try to reduce the heap size per node by using more nodes.

johanandren · October 6, 2020, 6:10pm

Yes, at least support-deletes leads to an extra query on when a persistent actor recovers (since it needs to check to what if any offset the events were deleted). Not entirely sure events by tag causes any overhead unless you actually use tags.

motmot80 · October 28, 2020, 7:44am

By optimizing the heap size, solving a message serialization size issue and using smaller nodes the system is now performing as expected.

Thanks for your help,
Thomas

Topic		Replies	Views
Akka Cluster Sharding Performance Issue with Many Active Entities (>10 MM) Akka Cluster	5	1943	October 21, 2018
Actor Cluster Slowness when Large size messages produced by a ShardRegion proxy Akka Cluster akka , akka-typed , scala , akka-cluster	6	696	September 26, 2022
Actor slowness in Akka Cluster when multiple threads send messages to ShardRegion with large Message size Akka Cluster akka , akka-typed , akka-cluster	5	609	September 19, 2022
Akka Cluster slowness when multiple threads sends messages to entity actor from a ShardRegion Proxy Akka akka-cluster	1	392	September 13, 2022
[PERF] Cluster Sharding perf issue Akka Cluster akka-cluster	0	628	August 19, 2021

Akka-Cluster: Decreasing system performance having many active actors

Related Topics