Cluster gets down automatically with "Shutting down myself" message

akka-cluster

#1

Hi, we use akka version 2.4.10 and have a 14 member cluster.
Frequently, the cluster gets down automatically with “Shutting down myself” message, without any notable exception.
We are not using “auto-down-unreachable-after”.
Could you please help us with the possible cause for this issue?

Log content:-

 INFO 2018-03-27 19:16:32,269 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-24] akka.cluster.Cluster(akka://ESA_FMCluster-afgha-slet) - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.42:9810] is JOINING, roles [NON-MASTER]
 INFO 2018-03-28 02:00:25,054 [ESA_FMCluster-afgha-slet-FaultResolutionActor-35] com.ericsson.esa.cluster.actor.FaultResolutionActor - Unreachable member found UnreachableMember(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down))
 INFO 2018-03-28 02:00:25,055 [ESA_FMCluster-afgha-slet-FaultResolutionActor-35] com.ericsson.esa.cluster.actor.FaultResolutionActor - Current cluster size: 7, Unreachable members: Set(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down))
 WARN 2018-03-28 02:00:25,256 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-2] akka.cluster.ClusterCoreDaemon - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810, status = Down)]. Node roles [NON-MASTER]
 INFO 2018-03-28 02:00:28,573 [ESA_FMCluster-afgha-slet-MemberInfoActor-36] com.ericsson.esa.cluster.ActorServiceImpl - Another member left the cluster ESA_FMCluster-afgha-slet@10.117.110.34:9810
 WARN 2018-03-28 02:00:28,574 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-25] akka.remote.Remoting - Association to [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.34:9810] having UID [-947022606] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
 INFO 2018-03-28 04:04:05,022 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-15] akka.cluster.Cluster(akka://ESA_FMCluster-afgha-slet) - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.40:9810] - Shutting down myself

Regards,
Makesh


(Johan Andrén) #2

This happens if a node is downed by another and this information reaches it, if you do not have auto-downing enabled and are not using the commercial Split Brain Resolver (SBR) there must be some other logic in you application doing this.


#3

Hi, we use akka version 2.4.10 and have a 14 member cluster.
We are facing an issue on recovering a node after it cluster shut down itself.

We are not using auto-down-unreachable-after.

In our application, when a node receives an UnreachableMember cluster message, we down that unreachable member from current cluster.
The gossip reaches the same unreachable member, result in its cluster shutting down itself.

On Node 10.117.110.48:

Line 1713: INFO 2018-04-10 12:35:50,768 [ESA_FMCluster-afgha-slet-FaultResolutionActor-49] com.ericsson.esa.cluster.actor.FaultResolutionActor - Unreachable member found UnreachableMember(Member(address = akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.44:9810, status = Down))

On Node 10.117.110.44:

Line 8470: DEBUG 2018-04-10 12:35:51,360 [ESA_FMCluster-afgha-slet-akka.actor.default-dispatcher-15] akka.cluster.ClusterCoreDaemon - Cluster Node [akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.44:9810] - Receiving gossip from [UniqueAddress(akka.tcp://ESA_FMCluster-afgha-slet@10.117.110.48:9810,1649935648)]

The node 10.117.110.44 is marked down and the gossip reaches the same node, result in the node 10.117.110.44 shutting down itself.

Now, we have below queries :-

  1. Is there any procedure available to automatically recover the node 10.117.110.44 after ‘shutting down itself’ ?

  2. What’s the need of sending gossip to a node (10.117.110.48) that has been already declared ‘unreachable’ from other node (10.117.110.44) ?

  3. Why a node has to down its own cluster upon receiving gossip that it has been declared ‘down’ from other random node ?

  4. Is there an option available to skip above option 3 behavior?

  5. When to down a node ?
    a) even if one cluster member sees it as unreachable (we follow this approach) or
    b) all cluster members should declare it ‘unreachable’ to proceed downing.

  6. Does auto-down-unreachable-after follows option 5.a or 5.b ?

Please suggest a way forward for our problem.

Regards,
Makesh