"The ShardCoordinator was unable to get an initial state" during rolling update

Hi there.

We are facing a problem when doing rolling updates in our OpenSource project, Eclipse Ditto: https://github.com/eclipse/ditto
We are using Akka 2.5.17 and rely on Akka Persistence together with Sharding.

During rolling updates we see many (> 150) WARN messages from the DDataShardCoordinator saying:

The ShardCoordinator was unable to get an initial state within 'waiting-for-state-timeout': 5000 millis (retrying). Has ClusterSharding been started on all nodes?

We have about 40 services instances running when doing the rolling update - the instance types for which this fails have different amounts of instances:

  • one has 4 instances
  • one has 3
  • two have only 2 instances

I assume that having only 2 instances of a cluster-sharding role and rolling update 1 of them would cause that error message as the 1 remaining instance has no majority.
But what about the cluster-sharding roles with 4 and 3 instances? If we rolling update 1 instance at a time, the remaining instances should still have a majority, right?

I just stumbled upon akka.cluster.min-nr-of-members which we currently don’t have set. We however set the cluster role: https://github.com/eclipse/ditto/blob/master/services/things/starter/src/main/resources/things.conf#L112

We also see WARN messages from other services which try to send message to the affected shard regions:

Retry request for shard [52] homes from coordinator at [Actor[akka.tcp://ditto-cluster@]]. [1] buffered messages.

Could we have misconfigured something or need different amount of instances per role in order to correctly do rolling updates?

Thanks in advance and best regards

That sounds strange. It should be able to do rolling updates for all these numbers.

How do you shutdown the nodes? Best is to leave the cluster gracefully, by CoordinatedShutdown e.g. triggered by sigterm.

If the leaving isn’t successful, what downing do you use to remove the shutdown node from the cluster membership?

Also, it is best to start the rolling update with the youngest node and leave the oldest until last.

The nodes are shut down via sigterm (triggered by Docker when it stops a container), CoordinatedShutdown should be used. We also hook in the CoordinatedShutdown in order to stop HTTP routes, etc.

Now that you mention it: we don’t really have a strategy for removing a node where leaving was not successful. As downing strategy we have a simple “majority” mechanism which - in case of a network partition - downs the minority. But that wouldn’t be triggered during rolling update, I just saw that this even only runs and checks for minority/majority every 30s:

So I guess this is where we would have to improve.
How can we find out if leaving wasn’t successful?

Also, it is best to start the rolling update with the youngest node and leave the oldest until last.

Is this possible in a Kubernetes managed cluster?

It would be unreachable, but still have status Up.

That is the order for Deployments.