Akka - Actor Cluster Inconsistent Shard rebalance while Rolling restart of nodes

Issue

When more no of actors in a shard the rebalancing is not happening properly when rolling restart of nodes.
When the nodes restarted one by one sequentially some shard not rebalancing to other node properly.
And other cluster sharding operations not working after the sequential restart.But the nodes still connected to each other in the cluster.

Cluster Setup

Using Akka-Cluster 2.5.32. Cluster formed with 3 nodes.
Each node will have 2 shards. In each shard there will be 600 entity Actors.
So totally 6 shards → 3600 Entity Actors in the Cluster.
Enabled

akka.cluster.sharding.remember-entities = on
akka.cluster.sharding.remember-entities-store = ddata
akka.cluster.sharding.distributed-data.durable.keys = []

akka.remote.artery{
        enabled = on
		transport = tcp
}

At Start

  1. Node 1 Shard 4,3 with [ 600,600 ] entities
  2. Node 2 Shard 1,5 with [ 600,600 ] entities
  3. Node 3 Shard 0,2 with [ 600,600 ] entities

Restart & Rebalance Flow

When Restarting the node will leave the cluster by invoking cluster.leave(cluster.selfAddress());

i)Node 1 Restarts

14:59:45:024 Node 1 Leaves the Cluster.
14:59:48:800 In Node 1 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
14:59:48:800 In Node 1 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:00:48:805 In Node 3 there is a log Rebalance shard [Shard_4] done [false] & Rebalance shard [Shard_3] done [false]
15:00:51:935 In Node 2 The Shard_3 is Rebalanced And all 600 entity Actors recreated.
15:00:48:980 In Node 2 The Shard_4 is Rebalanced And all 600 entity Actors recreated.

After Node 1 Restarts Now

  1. Node 1 Down
  2. Node 2 Shard 1,5,3 with [ 600,600,600 ] entities
  3. Node 3 Shard 0,2,4 with [ 600,600,600 ] entities

ii)Node 2 Restarts

15:01:23:209 Node 1 Rejoins the Cluster.
15:01:24:052 Node 2 Leaves the Cluster.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_3 to terminate 600 entities.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_5 to terminate 600 entities.
15:01:26:804 In Node 2 : Starting HandOffStopper for shard Shard_1 to terminate 600 entities.
15:02:26:794 In Node 3 there is a log Rebalance shard [Shard_3] done [false],Rebalance shard [Shard_5] done [false] & Rebalance shard [Shard_1] done [false]
15:02:27:028 In Node 1 The Shard_5 is Rebalanced And Only 580 entity Actors reacreated.
15:02:28:237 In Node 1 The Shard_1 is Rebalanced And Only 61 entity Actors recreated.

15:02:32:967 In Node 1 The Shard_3 is Rebalanced And Only 35 entity Actors recreated.

After Node 2 Restarts Now

  1. Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
  2. Node 2 Down
  3. Node 3 Shard 0,2,4 with [ 600,600,600 ] entities

iii)Node 3 Restarts

15:02:51:133 Node 2 Rejoins the Cluster.
15:02:51:799 Node 3 Leaves the Cluster.
15:02:55:621 In Node 3 : Starting HandOffStopper for shard Shard_0 to terminate 600 entities.
15:02:55:621 In Node 3 : Starting HandOffStopper for shard Shard_2 to terminate 600 entities.
15:02:55:622 In Node 3 : Starting HandOffStopper for shard Shard_4 to terminate 600 entities.
15:03:55:772 In Node 2 The Shard_0 is Rebalanced And Only 116 entity Actors recreated.
And Shard_2 & Shard_4 not rebalance to Node 2.
15:04:27:943 Node 3 Rejoins the Cluster

After Node 3 Restarts Now

  1. Node 1 Shard 1,5,3 with [ 35,580,61 ] entities
  2. Node 2 Shard 0 with [ 116 ] entities
  3. Node 3 No Shards.

We have a Singleton actor in the Cluster, which Logs the Current Cluster member Status for a time interval. And it will log the No of Actors in the Cluster by invoking

GetClusterShardingStats getStats = new ShardRegion.GetClusterShardingStats(FiniteDuration.create(5000, TimeUnit.MILLISECONDS)); 
 
Future<Object> ack = Patterns.ask(region, getStats, timeout).toCompletableFuture(); 
 
ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);

With this cluster stats The singlton will Log the Active Actor Count with Shard wise split-up.

Before this sequential Restart it logs

Active Actor Count 3600  &  Members [ Node1:Up,Node2:Up,Node3:Up ] 

After this sequential Restart it logs

Members [ Node1:Up,Node2:Up,Node3:Up ] 

And it unable to fetch the Active Actors Count from in

ClusterShardingStats ss = (ClusterShardingStats) ack.get(5000, TimeUnit.MILLISECONDS);

Queries

  1. In my use case the nodes will be restarted frequently like this.
  2. Why the Cluster Sharding rebalance is not properly done here and any clue on what may be the problem here?
  3. Is there any more information needed to further debug this issue?

You are using a very old version of Akka. We have made many fixes and improvements to remember entities feature in later versions. I recommend that you update to Akka 2.7.0.

Even using 500 Entites this issue occurs.
Moving to the latest version is fine will do it.
But we need to know this issue occuring because of the oldest version we are using or any configuration or design wise we are missing. ?

Because even after upgrading to latest version surely it won’t occur?.

@patriknw @leviramsey @davidogren

If you see the same with the latest version we would be happy to investigate logs and such.

See my thoughts on this on the thread you started on StackOverflow: Akka - Actor Cluster Inconsistent Shard rebalance while Rolling restart of nodes - Stack Overflow

1 Like