First seed node couldn't join other seed nodes (artery)

Hi

I use akka cluster version 2.6.3 with artery tcp remoting.

In a 2-node cluster setup with raspberry pi’s, the first seed node (=node1) “Couldn’t join other seed nodes” after device reboot although before reboot they successfully formed a cluster.

On node1: after reboot a new cluster is formed and then a connection is established to node2:

[2020-06-22 09:46:38,537] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - Successfully bound to /192.168.178.219:2551
[2020-06-22 09:46:39,488] [INFO ] [a.r.artery.tcp.ArteryTcpTransport] [ArteryTcpTransport(akka://SELServer)] - Remoting started with transport [Artery tcp]; listening on address [akka://SELServer@192.168.178.219:2551] with UID [1980465099243489092]
[2020-06-22 09:46:39,697] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Starting up, Akka version [2.6.3] ...
[2020-06-22 09:46:41,800] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Registered cluster JMX MBean [akka:type=Cluster]
[2020-06-22 09:46:41,802] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Started up successfully
[2020-06-22 09:46:42,164] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - No downing-provider-class configured, manual cluster downing required, see https://doc.akka.io/docs/akka/current/typed
/cluster.html#downing
[2020-06-22 09:46:47,453] [DEBUG] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Couldn't join other seed nodes, will join myself. seed-nodes=[akka://SELServer@192.168.178.219:2551, akka://SELServer@
192.168.178.220:2551]
[2020-06-22 09:46:47,707] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Node [akka://SELServer@192.168.178.219:2551] is JOINING itself (with roles [dc-default]) and forming new cluster
[2020-06-22 09:46:47,751] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - is the new leader among reachable nodes (more leaders may exist)
[2020-06-22 09:46:47,882] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Leader is moving node [akka://SELServer@192.168.178.219:2551] to [Up]
[2020-06-22 09:46:50,523] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/2] - Attempting connection to [/192.168.178.220:2551]
[2020-06-22 09:46:50,543] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/1] - Attempting connection to [/192.168.178.220:2551]
[2020-06-22 09:46:50,545] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/2] - Connection established to [192.168.178.220:2551]
[2020-06-22 09:46:50,555] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/1] - Connection established to [192.168.178.220:2551]
[2020-06-22 09:46:50,816] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-06-22 09:46:52,349] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted

On node2: becomes new leader and then InitJoin and InitJoinAck is received and sent:

[2020-06-22 09:45:53,294] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - is the new leader among reachable nodes (more leaders may exist)
[2020-06-22 09:45:53,306] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Leader is removing confirmed Exiting node [akka://SELServer@192.168.178.219:2551]
[2020-06-22 09:46:01,375] [WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.219:2551], message stream] Upstream failed, cause: StreamTcpExcepti
on: The connection has been aborted
[2020-06-22 09:46:01,376] [WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.219:2551], control stream] Upstream failed, cause: StreamTcpExcepti
on: The connection has been aborted
[2020-06-22 09:46:50,308] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-06-22 09:46:50,317] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-06-22 09:46:50,585] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/22] - Attempting connection to [/192.168.178.219:2551]
[2020-06-22 09:46:50,587] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/22] - Connection established to [192.168.178.219:2551]
[2020-06-22 09:46:51,638] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] to [akka://SELServer@192.168.178.220:2551]
[2020-06-22 09:46:51,640] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] (version [2.6.3])
[2020-06-22 09:46:52,069] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] to [akka://SELServer@192.168.178.220:2551]
[2020-06-22 09:46:52,072] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] (version [2.6.3])
[2020-06-22 09:46:52,119] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/23] - Attempting connection to [/192.168.178.219:2551]
[2020-06-22 09:46:52,121] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/23] - Connection established to [192.168.178.219:2551]
[2020-06-22 09:46:52,240] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] to [akka://SELServer@192.168.178.220:2551]
[2020-06-22 09:46:52,241] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] (version [2.6.3])
[2020-06-22 09:46:52,349] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] to [akka://SELServer@192.168.178.220:2551]
[2020-06-22 09:46:52,349] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] (version [2.6.3])
[2020-06-22 09:46:52,467] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] to [akka://SELServer@192.168.178.220:2551]
[2020-06-22 09:46:52,468] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#-338750855]] (version [2.6.3])

Note that when restarting the application on node1 instead of rebooting the device, the same scenario occurs but not always. Sometimes it just works and thus node1 joins node2.

Also note that node1 and node2 use the same static seed-nodes configuration:

seed-nodes = [
            "akka://SELServer@192.168.178.219:2551"
            "akka://SELServer@192.168.178.220:2551"
        ]

Things I don’t get

  1. Why do I have duplicated logs like “New connection accepted” on node1 or “Connection established to [192.168.178.219:2551]” on node2?
  2. Why could node1 not join node2 but instead formed a new cluster?

Thanks,
Simon

Do you have the logs on node 2 for between 09:46:42 and 09:46:47, that’s when node1 will have tried to communicate with node2, at 09:46:47 node1 timed out and created its own cluster.

Why do I have duplicated logs like “New connection accepted” on node1 or “Connection established to [192.168.178.219:2551]” on node2?

Artery uses multiple TCP connections

1 Like

Hi @chbatey

Thanks for your answers!
I don’t have the log but I will reproduce the scenario in the next 4-5 weeks and then provide full logs.
I’m sorry for the delay but currently we have so much work to do so I cannot reproduce the case right now.

Looking forward to get into this soon.

So I reproduced the scenario:

On node1 : after reboot a new cluster is formed and then a connection is established to node2:

[2020-07-27 09:04:28,687] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - Successfully bound to /192.168.178.219:2551
[2020-07-27 09:04:29,576] [INFO ] [a.r.artery.tcp.ArteryTcpTransport] [ArteryTcpTransport(akka://SELServer)] - Remoting started with transport [Artery tcp]; listening on address [akka://SELServer@192.168.178.219:2551] with UID [-684174873603442119]
[2020-07-27 09:04:29,805] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Starting up, Akka version [2.6.6] ...
[2020-07-27 09:04:31,597] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Registered cluster JMX MBean [akka:type=Cluster]
[2020-07-27 09:04:31,599] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Started up successfully
[2020-07-27 09:04:31,898] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - No downing-provider-class configured, manual cluster downing required, see https://doc.akka.io/docs/akka/current/typed/cluster.html#downing
[2020-07-27 09:04:37,035] [DEBUG] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Couldn't join other seed nodes, will join myself. seed-nodes=[akka://SELServer@192.168.178.219:2551, akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:37,222] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Node [akka://SELServer@192.168.178.219:2551] is JOINING itself (with roles [dc-default]) and forming new cluster
[2020-07-27 09:04:37,260] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - is the new leader among reachable nodes (more leaders may exist)
[2020-07-27 09:04:37,433] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.219:2551] - Leader is moving node [akka://SELServer@192.168.178.219:2551] to [Up]
[2020-07-27 09:04:39,372] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/2] - Resolving 192.168.178.220 before connecting
[2020-07-27 09:04:39,387] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/1] - Resolving 192.168.178.220 before connecting
[2020-07-27 09:04:39,668] [DEBUG] [akka.io.SimpleDnsManager] [akka://SELServer@192.168.178.219:2551/system/IO-DNS] - Resolution request for 192.168.178.220 from Actor[akka://SELServer/system/IO-TCP/selectors/$a/1#117820292]
[2020-07-27 09:04:39,725] [DEBUG] [akka.io.SimpleDnsManager] [akka://SELServer@192.168.178.219:2551/system/IO-DNS] - Resolution request for 192.168.178.220 from Actor[akka://SELServer/system/IO-TCP/selectors/$a/2#534810224]
[2020-07-27 09:04:39,742] [DEBUG] [akka.io.InetAddressDnsResolver] [akka://SELServer@192.168.178.219:2551/system/IO-DNS/inet-address/$a] - Request for [192.168.178.220] was not yet cached
[2020-07-27 09:04:39,854] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/1] - Attempting connection to [/192.168.178.220:2551]
[2020-07-27 09:04:39,873] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/2] - Attempting connection to [/192.168.178.220:2551]
[2020-07-27 09:04:39,879] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/1] - Connection established to [192.168.178.220:2551]
[2020-07-27 09:04:39,881] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.219:2551/system/IO-TCP/selectors/$a/2] - Connection established to [192.168.178.220:2551]
[2020-07-27 09:04:40,250] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-07-27 09:04:40,987] [DEBUG] [a.s.Serialization(akka://SELServer)] [akka.serialization.Serialization(akka://SELServer)] - Using serializer [akka.remote.serialization.ArteryMessageSerializer] for message [akka.remote.artery.OutboundHandshake$HandshakeRsp]
[2020-07-27 09:04:41,058] [DEBUG] [a.s.Serialization(akka://SELServer)] [akka.serialization.Serialization(akka://SELServer)] - Using serializer [akka.cluster.protobuf.ClusterMessageSerializer] for message [akka.cluster.InternalClusterAction$InitJoin]
[2020-07-27 09:04:41,903] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-07-27 09:04:42,118] [DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] - Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#175269773] doesn't match an active actor. It has probably been stopped, using deadLetters.
[2020-07-27 09:04:42,261] [DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] - Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#175269773] doesn't match an active actor. It has probably been stopped, using deadLetters.
[2020-07-27 09:04:42,618] [DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] - Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#175269773] doesn't match an active actor. It has probably been stopped, using deadLetters.
[2020-07-27 09:04:43,000] [DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] - Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#175269773] doesn't match an active actor. It has probably been stopped, using deadLetters.
[2020-07-27 09:04:43,273] [DEBUG] [a.a.L.Deserialization] [akka.actor.LocalActorRefProvider.Deserialization] - Resolve (deserialization) of path [system/cluster/core/daemon/firstSeedNodeProcess-1#175269773] doesn't match an active actor. It has probably been stopped, using deadLetters.

On node2 : becomes new leader and then InitJoin and InitJoinAck is received and sent:

[2020-07-27 09:03:40,617] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - is the new leader among reachable nodes (more leaders may exist)
[2020-07-27 09:03:40,688] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Leader is removing confirmed Exiting node [akka://SELServer@192.168.178.219:2551]
[2020-07-27 09:03:45,718] [INFO ] [akka.remote.artery.Association] [Association(akka://SELServer)] - Association to [akka://SELServer@192.168.178.219:2551] having UID [-4266372872133291128] has been stopped. All messages to this UID will be delivered to dead letters. Reason: Cluster member removed, previous status [Exiting]
[2020-07-27 09:03:48,847] [DEBUG] [akka.io.TcpIncomingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/2] - Closing connection due to IO error java.io.IOException: Connection reset by peer
[2020-07-27 09:03:48,849] [DEBUG] [akka.io.TcpIncomingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/1] - Closing connection due to IO error java.io.IOException: Connection reset by peer
[2020-07-27 09:03:48,851] [DEBUG] [akka.actor.RepointableActorRef] [akka://SELServer/system/Materializers/StreamSupervisor-1/remote-8-0-ignoreSink] - Aborting tcp connection to 192.168.178.219:2551 because of upstream failure: akka.remote.artery.Association$OutboundStreamStopQuarantinedSignal$: 
[2020-07-27 09:03:48,862] [DEBUG] [akka.actor.RepointableActorRef] [akka://SELServer/system/Materializers/StreamSupervisor-1/remote-7-0-ignoreSink] - Aborting tcp connection to 192.168.178.219:2551 because of upstream failure: akka.remote.artery.Association$OutboundStreamStopQuarantinedSignal$: 
[2020-07-27 09:03:48,911] [WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.219:2551], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2020-07-27 09:03:48,912] [WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.219:2551], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2020-07-27 09:04:35,963] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-07-27 09:04:35,979] [DEBUG] [akka.io.TcpListener] [akka://SELServer/system/IO-TCP/selectors/$a/0] - New connection accepted
[2020-07-27 09:04:36,333] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/7] - Resolving 192.168.178.219 before connecting
[2020-07-27 09:04:36,334] [DEBUG] [akka.io.SimpleDnsManager] [akka://SELServer@192.168.178.220:2551/system/IO-DNS] - Resolution request for 192.168.178.219 from Actor[akka://SELServer/system/IO-TCP/selectors/$a/7#-1976066586]
[2020-07-27 09:04:36,336] [DEBUG] [akka.io.InetAddressDnsResolver] [akka://SELServer@192.168.178.220:2551/system/IO-DNS/inet-address/$c] - Request for [192.168.178.219] was not yet cached
[2020-07-27 09:04:36,339] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/7] - Attempting connection to [/192.168.178.219:2551]
[2020-07-27 09:04:36,341] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/7] - Connection established to [192.168.178.219:2551]
[2020-07-27 09:04:37,319] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] to [akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:37,322] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] (version [2.6.6])
[2020-07-27 09:04:37,939] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] to [akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:37,940] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] (version [2.6.6])
[2020-07-27 09:04:37,986] [DEBUG] [a.s.Serialization(akka://SELServer)] [akka.serialization.Serialization(akka://SELServer)] - Using serializer [akka.cluster.protobuf.ClusterMessageSerializer] for message [akka.cluster.InternalClusterAction$InitJoinAck]
[2020-07-27 09:04:37,994] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/8] - Resolving 192.168.178.219 before connecting
[2020-07-27 09:04:37,995] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/8] - Attempting connection to [/192.168.178.219:2551]
[2020-07-27 09:04:37,999] [DEBUG] [akka.io.TcpOutgoingConnection] [akka://SELServer@192.168.178.220:2551/system/IO-TCP/selectors/$a/8] - Connection established to [192.168.178.219:2551]
[2020-07-27 09:04:38,337] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] to [akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:38,338] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] (version [2.6.6])
[2020-07-27 09:04:38,696] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] to [akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:38,697] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] (version [2.6.6])
[2020-07-27 09:04:39,081] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Received InitJoin message from [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] to [akka://SELServer@192.168.178.220:2551]
[2020-07-27 09:04:39,082] [INFO ] [akka.cluster.Cluster] [Cluster(akka://SELServer)] - Cluster Node [akka://SELServer@192.168.178.220:2551] - Sending InitJoinAck message from node [akka://SELServer@192.168.178.220:2551] to [Actor[akka://SELServer@192.168.178.219:2551/system/cluster/core/daemon/firstSeedNodeProcess-1#175269773]] (version [2.6.6])
[2020-07-27 09:04:57,177] [DEBUG] [a.s.Serialization(akka://SELServer)] [akka.serialization.Serialization(akka://SELServer)] - Using serializer [akka.remote.serialization.ArteryMessageSerializer] for message [akka.remote.artery.compress.CompressionProtocol$ClassManifestCompressionAdvertisement]
[2020-07-27 09:04:57,207] [DEBUG] [a.s.Serialization(akka://SELServer)] [akka.serialization.Serialization(akka://SELServer)] - Using serializer [akka.remote.serialization.ArteryMessageSerializer] for message [akka.remote.artery.compress.CompressionProtocol$ActorRefCompressionAdvertisement]

Questions and hint

@chbatey you said

  1. Is this timeout configurable and how much is it by default?
  2. Does it take influence if the nodes do not have the exact same system time? In this case the nodes have drift of approx 2 seconds.

I reproduced the error again with akka version 2.6.9 (note I use classic clustering)

Why is the first seed node not able to join the other seed node (which is running at that time) ?
There is no network issue as it always works as expected if I restart both nodes at the same time or if I just restart node2. The issue is only with node1.

If I leave the nodes in the state where the first seed node joined itself and then established a connection to the other node but they’re not part of the same cluster, then after a while there are logs like:

On node1

[INFO ] [akka.remote.artery.Association] [Association(akka://SELServer)] - Stopping idle outbound control stream to [akka://SELServer@192.168.178.220:2551]
[WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.220:2551], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted

On node2

[INFO ] [akka.remote.artery.Association] [Association(akka://SELServer)] - Stopping idle outbound control stream to [akka://SELServer@192.168.178.219:2551]
[WARN ] [akka.stream.Materializer] [akka.stream.Log(akka://SELServer/system/Materializers/StreamSupervisor-1)] - [outbound connection to [akka://SELServer@192.168.178.219:2551], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted

This may be wrong but here goes…

The first seed node is a special seed node that must be joined first when starting cluster. This is used to avoid split clusters. When you start cluster nodes the first cluster must have this node running.

In essence, the first seed node joins itself and assigns this cluster a “master token” of sorts. Any other node (seed or not) will join this cluster with success even if the first seed node goes down. The only requirement is that at least one of the seed nodes are still up.

If you stop the first seed node and the cluster is running, then the first seed node will rejoin the cluster if and only if: a) we still have seed nodes that are up and b) the cluster was originally started by the seed node (holds the “master token”)

If you down and remove all seed nodes from a cluster, the you have effectively created a split cluster. More concretely, if you start up any node (first seed, other seed node or standard node), these will not have an entry point and will start a new cluster. In other words, we need a first seed node to kick off a cluster.

Note that the split brain resolver will try and determine if a cluster is split and if so if it should be downed. Check if you have configured this and what type of strategy you are employing.

HTHs

1 Like

Thanks @hmf for you inputs!

Nice, I didn’t know SBR has been open sourced - love it!

Enabling SBR solves the issue :grinning:

But still I don’t understand why it’s needed.

The scenario is as follows:

  1. There are two nodes in the cluster, both are seed nodes
  2. I restart node1 which is the first (special) seed node
  3. a and b from your post are met since node2 is still up and the cluster was originally started by the seed node

Also, node1 does not restart due to power loss but instead it runs coordinated shutdown and then is started again.

Why do we need downing in that case, how do we get the split cluster?

@SF Say you have the cluster with seed nodes 1 and 2 and standard nodes 3 and 4. If you down nodes 1 and 2, either node 3 or 4 will become the leader. At this point you have a working cluster with no seeds.

If you now start any node (seed or not), how can they find the entry point? They cannot because no seeds addresses can be connected to, so effectively you have a split cluster.

Any starts of seed nodes will start a new cluster.

@hmf I agree to your example, it is a different one though.

In my case there are only two nodes, both are seed nodes from which just one (the special one, node1) is restarted.
Why should the special seed node then start a new cluster? From the logs it can be seen that node1 “Couldn’t join other seed nodes” for an unknown reason. That’s the reason it starts its own new cluster but I don’t get why it cannot join the cluster through node2.

If there’s anything on that, I appreciate to know it.
Thankfully the problem is solved by applying SBR - thanks again @hmf for pointing to that!

I think node 1 should rejoin. I will have to test to confirm this.

From my point of view, node2 becomes the leader in the cluster as soon as node1 is unreachable.
Because node1 is unreachable, the leader cannot perform its duties (like accepting new nodes).
Thus the only possibility to get to a valid state is shutting down.
But I don’t get why SBR comes to that conclusion - it is not a split brain in my understanding thus it should not be the SBR who does the job.
In what sense is this a split brain problem?