Downed member is not removed from cluster and becomes reachable again


#1

Hi,
I’m using akka-cluster 2.5.2, and I’m running 6 cluster nodes with 1 [proxy] node and 5 [computeMaster] nodes. Proxy node has an actor watching compute actors resides on computeMaster nodes. For the sake of convenience, I’ll use port number to represent nodes(each node has different port number). Proxy node(leader) is 41919, while other nodes are 18143, 31813, 34674, 19878, 28216.
I’ll try to describe what I’ve seen in akka log in time order.

[07:36:19]         18143 starting full-gc
[07:36:58.472]  41919 marking 18143 as UNREACHABLE
[07:37:15.063]  18143 finished gc after 55 seconds, and immediately marking his three neighbours( monitored-by-nr-of-members = 3) as UNREACHABLE(19878, 28216, 31813)
[07:37:16.626]  18143 marking 31813 as REACHABLE
[07:37:18.491]  41919 marking 18143 as DOWN (no REACHABLE event from 18143 in 20s since UNREACHABLE event, proxy will down the node, which is our fault-tolarent logic)
[07:37:18.827]  41919 receive UnreachableMember event for 19878
[07:37:18.828]  41919 receive UnreachableMember event for 28216
[07:37:20.281]  41919 ignored gossip from unreachable[18143]
[07:37:21.472]  41919 marking 18143 as REACHABLE(status = down)
[07:37:21.578]  18143 Cluster: shutting down myself
[07:37:21.827]  18143 RemotingTerminator: Remoting shut down
[07:37:38.841]  41919 marking 19878 as DOWN
[07:37:38.844]  41919 marking 28216 as DOWN
[07:37:41.474]  41919 receive Terminated from actor on 19878
[07:37:41.474]  41919 receive Terminated from actor on 28216
[07:38:01.471]  41919 marking 18143 as UNREACHABLE
[07:38:03.472]  41919 Leader is removing unreachable node[18143]
[07:38:03.473]  41919 receive Terminated from actor on 18143

and I also see so many below messages until [07:38:03.472]:

a.remote.ReliableDeliverySupervisor - Association with remote system
[akka.tcp://xqlCluster@192.168.18.27:18143] has failed, address is now gated for [5000] ms. Reason: 
[Association failed with [akka.tcp://xqlCluster@192.168.18.27:18143]] 
Caused by: [Connection refused: /192.168.18.27:18143]

I’m a little bit confused here:

  1. After marking 18143 as down for the first time(it should become down from unreachable), why is 41919 NOT ignoring the UNREACHABLE gossip from 18143? The two nodes(19878, 28216) are actually healthy.
  2. After marking 18143 as down for the first time, why can it be REACHABLE again?
  3. After 18143 ternimated itself, why is 41919 NOT receiving Terminated from actor on 18143 for the first time, and receiving Terminated for the second time?

So could anyone help me figure out what really happened in my cluster? Thanks a lot.


(Patrik Nordwall) #2

First update to latest patch version 2.5.14 and then report back if you see the same thing. (I think you will but just to make sure that we are not hunting ghosts.)