Sick Cluster - WeaklyUp Leader

nathanmbrown · May 14, 2020, 1:33pm

A number of occasions now we have encountered the situation where the cluster Leader has moved on to a WeaklyUp node. This is I guess reasonable if all other Up nodes are not accessible due to partition - this is what we have observed and in this case there is no other choice - however this scenario doesn’t seem to be handled at all. The Leader does not bring itself Up or any further nodes Up, so even if new nodes are brought in and old ones removed, it cannot recover - the cluster is effectively ‘sick’ and needs to be fully restarted. There is no management API to bring the Leader Up either so manual intervention is not an option.
On top of this the DowningProvider doesn’t seem to kick in unless the node is Up and so the Leader cannot automatically perform any downing needed.
Has this scenario been considered at all? Is there some further configuration that needs to be done to avoid this happening or allow recovery?

johanandren · May 15, 2020, 9:22am

What downing provider are you using?

nathanmbrown · May 15, 2020, 12:17pm

We are using a majority/minority based one we’ve coded ourselves, but we don’t see that it is activated on a node until it is Up so that doesn’t seem to be able to influence this.

johanandren · May 15, 2020, 12:38pm

It should be started as soon as the cluster extension starts up (preStart of the ClusterDaemon), it is not tied to the state of the node in any other way, not even that it has tried to Join yet, so that sounds very surprising.

nathanmbrown · May 15, 2020, 2:01pm

Thank you for this, you’re right, I now see the issue is that the DowningProvider is only subscribing to the Cluster when the member node is up, which in this case it will never be. We’ll subscribe more eagerly and look at having our DowningProvider help resolve the situation when it occurs.
Thanks again for helping me see this.

chbatey · May 18, 2020, 1:50pm

@nathanmbrown we’ll soon have a downing provider as part of Cluster: https://github.com/akka/akka/issues/29085

Topic		Replies	Views
Hook into cluster leader changes Akka Cluster	6	1271	March 25, 2019
RoleLeaderChanged appoints 2 leaders during start-up of a node Akka Cluster	2	462	September 24, 2020
Cluster losing all singletons Akka Cluster akka-cluster	4	1872	April 19, 2018
Downed member is not removed from cluster and becomes reachable again Akka Cluster	1	1056	July 13, 2018
New incarnation of existing member is trying to join. Existing will be removed from the cluster and then new member will be allowed to join Akka	7	1277	April 13, 2018

Sick Cluster - WeaklyUp Leader

Related Topics