First seed node not rejoining cluster

Hello,

We’re using akka version 2.5.11

In a 3-node cluster setup, upon killing and restarting the first seed node (as listed in the akka.conf), we sometimes see that the node does not or is not allowed to rejoin the cluster. After the seed node timeout, which we have set to 12s, it joins itself and forms a cluster island. This is automated in a robot test environment.

As an example, we have node1 (first seed node - address 10.30.170.94), node2 (10.30.170.26), and node3 (10.30.170.84). Initially all the nodes are started and connected and the cluster is working correctly. Then a kill -9 is issued to node1. Both node2 and node3 lose connection and mark node1 as unreachable as expected. On node2, we see:

2018-07-16T11:17:14,609 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-20 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]

2018-07-16T11:17:18,229 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Leader can currently not perform its duties, reachability status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1), akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 -> akka.tcp://opendaylight-cluster-data@10.30.170.94:2550: Unreachable [Unreachable] (1)], member status: [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.84:2550 Up seen=true, akka.tcp://opendaylight-cluster-data@10.30.170.94:2550 Up seen=false]

On node3:

2018-07-16T11:18:14,859 | WARN  | opendaylight-cluster-data-akka.actor.default-dispatcher-3 | ReliableDeliverySupervisor       | 41 - com.typesafe.akka.slf4j - 2.5.11 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]] Caused by: [Connection refused: /10.30.170.94:2550]

After restarting node1, we see that it sends InitJoin messages to node2 and node3 and actually reports an InitJoinAck from node2 but eventually joins itself after 22s:

2018-07-16T11:18:20,251 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-21 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Received InitJoinAck message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.26:2550/system/cluster/core/daemon#-1596937295]] to [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]

2018-07-16T11:18:42,585 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] is JOINING, roles [member-1, dc-default]

2018-07-16T11:18:42,614 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] to [Up]

On node2, we see InitJoin messages incoming and InitJoinAcks outgoing and that it noticed that node1 was restarted with a new incarnation, marked it as Down, and stated it was removing the previous unreachable incarnation:

2018-07-16T11:18:20,105 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:20,106 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-1#-2094984711]]

2018-07-16T11:18:20,291 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - New incarnation of existing member [Member(address = akka.tcp://opendaylight-cluster-data@10.30.170.94:2550, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

2018-07-16T11:18:20,292 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-18 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Marking unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550] as [Down]

2018-07-16T11:18:21,226 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-22 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Leader is removing unreachable node [akka.tcp://opendaylight-cluster-data@10.30.170.94:2550]

2018-07-16T11:18:30,564 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Received InitJoin message from [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]] to [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550]

2018-07-16T11:18:30,565 | INFO  | opendaylight-cluster-data-akka.actor.default-dispatcher-2 | Cluster(akka://opendaylight-cluster-data) | 41 - com.typesafe.akka.slf4j - 2.5.11 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] - Sending InitJoinAck message from node [akka.tcp://opendaylight-cluster-data@10.30.170.26:2550] to [Actor[akka.tcp://opendaylight-cluster-data@10.30.170.94:2550/system/cluster/core/daemon/firstSeedNodeProcess-2#2006731653]]

followed by several more InitJoin/InitJoinAck sequences…

node3 shows similar InitJoin/InitJoinAck sequences.

I can’t see any reason from this output why node1 was not allowed to rejoin. Clearly there was network communication between node1 and node2/node3 after the restart and node2 and node3 remained connected (ie there were no disconnect/disassociation log messages reported). So the old incarnation of node1 should be removed (as was reported in the logs) and the new incarnation should be allowed to join.

Note this happens sporadically - most of the time it works as expected.

Also note that this only seems to happen when the node is killed - we haven’t seen it when the node is gracefully shutdown.

Thanks,
Tom

This looks strange. After InitJoinAck the Join and Welcome messages should be sent. You can try to enable more debug logging

akka {
  loglevel = DEBUG
  remote {
    log-received-messages = on
    log-sent-messages = on
  }
}

Latest version is 2.5.14. It wouldn’t hurt trying that but I don’t think we have had any bugs in this area.

Do you override the Netty dependency? We have seen issues when using older Netty than what comes with Akka by default.

Otherwise I think you should open an issue for this, and include full logs.

Thanks Patrik. Will enable debug. We’ve seen this issue for quite some time across several akka versions and upgraded to 2.5.11 a few months ago. We’re near the end of a release and it’s not a trivial process for us to bump 3rd party lib versions so we wouldn’t be able to upgrade to 2.5.14 for a few months.

I’ll have to check on Netty - we do use it in other places independent of akka - but if we are modifying the Netty version used by akka, it would be the newest or relatively new version as Netty is critical. Plus this seems like an akka clustering issue and not the remoting layer b/c the logs indicate there are messages getting thru bi-directionally. Also this issue is pretty sporadic.

I have already opened https://github.com/akka/akka/issues/25361. I was told to also post a topic here to get a wider audience…

Please check that the Netty version is 3.10.6.

The extra debug logging might reveal what happens with the missing messages.

Long term I would recommend that you switch to the new remoting implementation, Artery with TCP, but I can understand that you want to have this solved with classic remoting now. It could anyway be interesting to see if you can reproduce with Artery as a way to narrow down where the issue might be.

The org.jboss.netty version is 3.10.6.Final so we aren’t modifying that.

We’re planning to go to artery as well, most likely our next release. But we can enable it for testing now - we’ll try that as well.

I should further clarify the issue. So we have robot tests that automate HA testing. It stops and restarts a node multiple times in succession pretty quickly, like within 90 sec. It’s after multiple times that we sometimes see the node doesn’t re-join after a deadline. Of course, for the first seed node, the deadline is governed by the seed-node-timeout (12s). For that reason we mostly see the issue if the first seed node is chosen for stop/restart. The other nodes can’t join themselves and there’s a much longer deadline - the test will fail after 5 min if the node hasn’t re-joined - we’ve seen this once or twice.

Also note that we’ve only seen this issue if the node is killed, not when gracefully shutdown. For the former the TCP socket isn’t closed cleanly so maybe that’s a factor. I’m not sure if akka does anything special on shutdown, eg notify the remaining cluster that it’s shutting down.