Inefficiencies in recovering with remember entities


(Justin Peel) #1

I run an Akka cluster sharded app with a large number of entities using distributed data for the storage of those ids. The recovery strategy is constant. I’ve noticed when rolling the app that there are a bunch of messages about StartEntityAck going to dead letters. This is due to the acks not coming back within 2 seconds and being sent again. Personally I don’t see the point of resending the StartEntitys, but that is more of a symptom that led me to to the inefficiency. This app has many messages being sent to the shards right after the cluster is joined.

I looked in the DDataShard and found something that I think is majorly slowing down the processing of messages by the shard actors. When a message is received by a shard for an entity that hasn’t been started, the message is sent to a buffer and the entity id is added to the ddata. The shard becomes a state where it waiting for acknowledgement that the ddata has been updated and stashes all messages. When the acknowledgement is received, all messages are unstashed. The main inefficiency I see with this is that there isn’t a check to see if the entity id is in the remembered entities before an attempt is made to add it to the ddata. This means that there is a lot of unnecessary stashing and unstashing going on with delays in between to wait for ddata to be updated. With a large number of messages in the mailbox, this can be very inefficient.

On a separate note, I’m not sure that it is necessary to wait for the ddata to be updated before processing more messages in the shard. Is it necessary? Is it possible to at least have a config that allows the shards to not wait for the ddata to be updated?


ShardRegion bottleneck
(Patrik Nordwall) #2

In general the remember entities facility is rather heavy because it must keep track of each and every entity. I would explore other solutions before using it on a large number of entities that are frequently started, stopped or rebalanced.

That said, performance improvements are welcome. Please create an issue with a reproducer example that illustrates the problem, or a pull request with a reproducer test.