Quarantined node haven't joined back the cluster even after multiple restart


(Muthukumaran Kothandaraman) #1

I’m facing an issue where the quarantined node haven’t joined back the cluster even after multiple restart.

The autodown is enable in akka.conf and is set to 300s.

Please refer to the below sequence that happens after I restart the quarantined node.

Akka version used here is 2.4.7.

Following are the key observations

  1. Healthy 3 node cluster is formed with Node-1, Node-2 and Node-3 members
  2. Node 2 is shutdown - this is the test scenario
  3. Node 1 detects above and correctly moves member to UNREACHABLE state and after autodown period of 5 minutes, the node moves to DOWN state
  4. Node 2 is fully restarted (hence its ActorSystem)
  5. Node 2 joins the cluster and Node 1 identifies the node
  6. Node 2 is now full member of the cluster and is visible to all other nodes
  7. Within seconds, Node 2 again gets quarantined
  8. When quarantined node - Node 2 gets restarted , it again joins the cluster and (7) and (8) keeps repeating irrespective of multiple reboots of Node 2

Few points troubleshooted so far

  • Are there any heavy GC / Non-GC pauses in any nodes of Cluster ? Ans : No
  • Are there any network issues between the cluster nodes ? Ans : No firewalls , no dynamic port blocking rules and ping latency between nodes is normal
  • Is remoting used explictly ? No, only Akka Cluster
  • Is auto-down on unreachable enabled - Yes with 300s timeout
  • Is persistence used - Yes
  • Is persistence querying used - No

Questions:

  • Is this a reported issue in Akka 2.4.7 ?
  • Would disabling of auto-down help ? I see contrary case here - reported to have persisted uptil 2.4.12 https://github.com/akka/akka/issues/20296 and hence chose not to disable auto-down

Please let me know if any additional details are required

Regards
Muthu