Sick Cluster - WeaklyUp Leader

A number of occasions now we have encountered the situation where the cluster Leader has moved on to a WeaklyUp node. This is I guess reasonable if all other Up nodes are not accessible due to partition - this is what we have observed and in this case there is no other choice - however this scenario doesn’t seem to be handled at all. The Leader does not bring itself Up or any further nodes Up, so even if new nodes are brought in and old ones removed, it cannot recover - the cluster is effectively ‘sick’ and needs to be fully restarted. There is no management API to bring the Leader Up either so manual intervention is not an option.
On top of this the DowningProvider doesn’t seem to kick in unless the node is Up and so the Leader cannot automatically perform any downing needed.
Has this scenario been considered at all? Is there some further configuration that needs to be done to avoid this happening or allow recovery?

What downing provider are you using?

We are using a majority/minority based one we’ve coded ourselves, but we don’t see that it is activated on a node until it is Up so that doesn’t seem to be able to influence this.

It should be started as soon as the cluster extension starts up (preStart of the ClusterDaemon), it is not tied to the state of the node in any other way, not even that it has tried to Join yet, so that sounds very surprising.

Thank you for this, you’re right, I now see the issue is that the DowningProvider is only subscribing to the Cluster when the member node is up, which in this case it will never be. We’ll subscribe more eagerly and look at having our DowningProvider help resolve the situation when it occurs.
Thanks again for helping me see this.

@nathanmbrown we’ll soon have a downing provider as part of Cluster: https://github.com/akka/akka/issues/29085