Cluster gets down automatically with "Shutting down myself" message

akka-cluster

#1

Hi, we use akka version 2.4.10 and have a 14 member cluster.
Frequently, the cluster gets down automatically with “Shutting down myself” message, without any notable exception.
We are not using “auto-down-unreachable-after”.
Could you please help us with the possible cause for this issue?

Log content:-

 INFO 2018-03-27 19:16:32,269 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-24] akka.cluster.Cluster(akka://ESA_FMCluster-afgha-slet) - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.42:9810] is JOINING, roles [NON-MASTER]
 INFO 2018-03-28 02:00:25,054 [ESA_FMCluster-afgha-slet-FaultResolutionActor-35] com.ericsson.esa.cluster.actor.FaultResolutionActor - Unreachable member found UnreachableMember(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down))
 INFO 2018-03-28 02:00:25,055 [ESA_FMCluster-afgha-slet-FaultResolutionActor-35] com.ericsson.esa.cluster.actor.FaultResolutionActor - Current cluster size: 7, Unreachable members: Set(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down))
 WARN 2018-03-28 02:00:25,256 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-2] akka.cluster.ClusterCoreDaemon - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down)]. Node roles [NON-MASTER]
 INFO 2018-03-28 02:00:28,573 [ESA_FMCluster-afgha-slet-MemberInfoActor-36] com.ericsson.esa.cluster.ActorServiceImpl - Another member left the cluster ESA_FMCluster-afgha-slet@10.117.110.34:9810
 WARN 2018-03-28 02:00:28,574 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-25] akka.remote.Remoting - Association to [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810] having UID [-947022606] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
 INFO 2018-03-28 04:04:05,022 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-15] akka.cluster.Cluster(akka://ESA_FMCluster-afgha-slet) - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Shutting down myself

Regards,
Makesh


(Johan Andrén) #2

This happens if a node is downed by another and this information reaches it, if you do not have auto-downing enabled and are not using the commercial Split Brain Resolver (SBR) there must be some other logic in you application doing this.


#3

Hi, we use akka version 2.4.10 and have a 14 member cluster.
We are facing an issue on recovering a node after it cluster shut down itself.

We are not using auto-down-unreachable-after.

In our application, when a node receives an UnreachableMember cluster message, we down that unreachable member from current cluster.
The gossip reaches the same unreachable member, result in its cluster shutting down itself.

On Node 10.117.110.48:

Line 1713: INFO 2018-04-10 12:35:50,768 [ESA_FMCluster-afgha-slet-FaultResolutionActor-49] com.ericsson.esa.cluster.actor.FaultResolutionActor - Unreachable member found UnreachableMember(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.44:9810, status = Down))

On Node 10.117.110.44:

Line 8470: DEBUG 2018-04-10 12:35:51,360 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-15] akka.cluster.ClusterCoreDaemon - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.44:9810] - Receiving gossip from [UniqueAddress(akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.48:9810,1649935648)]

The node 10.117.110.44 is marked down and the gossip reaches the same node, result in the node 10.117.110.44 shutting down itself.

Now, we have below queries :-

  1. Is there any procedure available to automatically recover the node 10.117.110.44 after ‘shutting down itself’ ?

  2. What’s the need of sending gossip to a node (10.117.110.48) that has been already declared ‘unreachable’ from other node (10.117.110.44) ?

  3. Why a node has to down its own cluster upon receiving gossip that it has been declared ‘down’ from other random node ?

  4. Is there an option available to skip above option 3 behavior?

  5. When to down a node ?
    a) even if one cluster member sees it as unreachable (we follow this approach) or
    b) all cluster members should declare it ‘unreachable’ to proceed downing.

  6. Does auto-down-unreachable-after follows option 5.a or 5.b ?

Please suggest a way forward for our problem.

Regards,
Makesh


(Johan Andrén) #4

You should try to upgrade to latest stable, the 2.5 branch. 2.4. has already reached end of life and will not likely see any further updates or bug fixes (see https://akka.io/blog/news/2018/01/11/akka-2.5.9-released-2.4.x-end-of-life for announcement)

After the system has left the cluster it will need to terminate and restart for a node to rejoin the cluster, this is usually done by stopping the JVM and starting a new one. In Akka 2.5 the default graceful shutdown will be triggered when a node is downed ending with terminating the JVM allowing some external logic restarting it.

Just having logic that reacts on unreachable and downing nodes based on that duplicates the auto-down-unreachable-after and will give you the same problem with potential split brain where there is two separate parts of the cluster who both thinks the other side has shut down when there is a network partition.

You will make sure that when a partition happens, both sides makes the exact same decision about which nodes are downed, this is what the commercial split brain resolver does for you. If you cannot use that I’d recommend that you have operations do manual downing of nodes.

Also, note that you should prefer gracefully leaving the cluster over some form of unreachable-resolution whenever there isn’t a machine crash. Graceful shutdown in 2.5 will deal with this as well to some extent by being triggered by a JVM shutdown hook and then trying to leave the cluster before letting the JVM exit.