Node stuck in Leaving after deploy

Hey guys, we have a cluster with 3 Nodes and using Rolling-Update as deployment strategy. Current version of akka is 2.6.3 and for management is 1.0.5

The situation goes like this: we have the cluster with all the three nodes healthy and responding incoming requests. In a certain moment we trigger a deploy and old nodes stuck in Leaving causing that new nodes can’t join.

Disclaimer: This behaviour doesn’t occur in all deploys. Some deploys goes OK and some others causes this situation.

We are using akka-dns and deploying through Kubernetess. I’m going to attach logs from kibana and cluster status from Akka Managament API info.

In the following gist you have the logs from Kibana, I first made a deploy at 18:34:22.487Z and it finishes OK at 18:38:12.333Z. Then at 19:31:15.179Z I made another deploy causing the problem.

Also in this gist is the response of Akka Managamente API /cluster/members/

The ShardCoordinator was unable to update a distributed state within 'updating-state-timeout': 5000 millis (retrying). Perhaps the ShardRegion has not started on all active nodes yet? event=ShardRegionTerminated(Actor[akka://Catalog-Domain@])

I think I’ve found where is the root problem, but I’m not 100% sure. When new nodes are trying to Join the cluster the Sharding actors will be rebalanced across the new nodes, and updating that state on new nodes is what is failing.

As in docs the default method is DistributedData and the other is Persistence. All of our sharded actors are Persitent; is it recommendable to use sharding.state-store-mode = persistence in our scenario?

I’ve also read in docs:

Cluster sharding will not be active on members with status WeaklyUp if that feature is enabled.

and I can see some WeaklyUp members on my /cluster/members, so one idea that comes up to me is: when a new member state is WeaklyUp , the state of our cluster wont be able to be distributed causing the problem described; and sometimes any of the new members is in WeaklyUp state, cluster will be in a convergent state.

As I mentioned, I’m not sure of what I’m saying

The gist from management shows that there are several unreachable. Those must probably be downed before it can continue removals and joining.

In clean rolling updates there shouldn’t be any such situation. Not sure what happens in your case. Maybe nodes were killed before they completed the leaving-removal?