Cluster.get(context().system()).state() returns unreachable members in both state.getUnreachable() and state.getMembers() sets

darshan · April 23, 2018, 3:29am

My understanding is that Cluster.get(context().system()).state() should return the cluster state having down/unreachable members only in the state.getUnreachable() set but not in state.getMembers() set.

This is true for all clusters we have except one.

What I am trying to find out is which configuration / setting could cause this?

Most likely that setting does not match the value in other clusters but I could not find any difference in any of the settings between the clusters - any help / pointers highly appreciated.

Thanks,
Darshan.

johanandren · April 23, 2018, 2:31pm

getMembers() can contain both unreachable and reachable nodes. When a node is downed or leaves the cluster gracefully, it becomes removed, and after that it is not in getMembers() anymore.

If a node is unreachable, it will also end up in the getUnreachable() set.

Note that the Cluster.state is the nodes own view of the cluster, so if there is a network partition for example, the view on the different nodes will be different until the partition heals (or one side of the partition is downed). It is also driven by the cluster gossip, meaning that it is eventually consistent, there is no guarantee that at a given point in time, all nodes will perceive the state as the exact same.

This should not be affected by any settings, when the node is part of the cluster it will get information about all the other members of the cluster. It can be affected by a “split brain”, if you use auto-downing for example, you may end up with a cluster split into two clusters that both think that they are the cluster and that the other side was shutdown.

darshan · April 23, 2018, 3:08pm

Thanks for the details - I agree that Cluster.state is not guaranteed to be correct at any given time but will be eventually consistent and also about the split brain problem.

In the scenario I mentioned, the down node is showing up in the unreachable set and the member.status() is still showing as “Up”.

Is there any configuration/setting that can cause delay in removing a down node from the cluster?
Is there any configuration/setting that can cause a down node to still have the status as “Up” for a long time?

I see that in an older cluster, for a down node, the member.status() whould show as “Down” and that node would not be present in the getMembers() set. So one of the developers must have changed some setting in the new cluster to cause this behavior to change which is what I am trying to figure out!

Thanks
P.S.: I am using akka 2.4.8

Topic		Replies	Views
Downed member is not removed from cluster and becomes reachable again Akka Cluster	1	1057	July 13, 2018
Erroneous split-brain situation in cluster (with properly working sbr) Akka Cluster akka-cluster	3	1422	September 26, 2018
Cluster losing all singletons Akka Cluster akka-cluster	4	1873	April 19, 2018
Akka cluster node unreachable when update distribute data in different nodes akka-cluster	0	377	August 10, 2022
How to NOT use akka.cluster.auto-down-unreachable-after Akka Cluster	3	2764	October 2, 2018

Cluster.get(context().system()).state() returns unreachable members in both state.getUnreachable() and state.getMembers() sets

Related Topics