We are facing a problem when doing rolling updates in our OpenSource project, Eclipse Ditto: https://github.com/eclipse/ditto
We are using Akka 2.5.17 and rely on Akka Persistence together with Sharding.
During rolling updates we see many (> 150) WARN messages from the
The ShardCoordinator was unable to get an initial state within 'waiting-for-state-timeout': 5000 millis (retrying). Has ClusterSharding been started on all nodes?
We have about 40 services instances running when doing the rolling update - the instance types for which this fails have different amounts of instances:
- one has 4 instances
- one has 3
- two have only 2 instances
I assume that having only 2 instances of a cluster-sharding role and rolling update 1 of them would cause that error message as the 1 remaining instance has no majority.
But what about the cluster-sharding roles with 4 and 3 instances? If we rolling update 1 instance at a time, the remaining instances should still have a majority, right?
I just stumbled upon
akka.cluster.min-nr-of-members which we currently don’t have set. We however set the cluster role: https://github.com/eclipse/ditto/blob/master/services/things/starter/src/main/resources/things.conf#L112
We also see WARN messages from other services which try to send message to the affected shard regions:
Retry request for shard  homes from coordinator at [Actor[akka.tcp://email@example.com:2552/system/sharding/thingCoordinator/singleton/coordinator#-1696191213]].  buffered messages.
Could we have misconfigured something or need different amount of instances per role in order to correctly do rolling updates?
Thanks in advance and best regards