New incarnation of existing member is trying to join. Existing will be removed from the cluster and then new member will be allowed to join

Hi,

I was running a cluster of four nodes, one client node (port 4504) and three worker nodes (port 4501 (seed), 4502, 4503) in my testing.

I did the below steps:

  1. A cluster.down(…) was issued to the client node.
  2. The client node listens to below message
    Cluster(system).registerOnMemberRemoved{
    cluster.leave(cluster.selfAddress)
    Does coordinated shutdown
    }
    The actorSytem of the client node is reset to null.
  3. After which, the client node, upon receiving new requests, tries to reinitialize the actorSystem.

In my testing, if the client node tries to reinitialize the actor system using the same port 4504, I see this “New incarnation” message for 5 minutes.

New incarnation of existing member [Member(address = akka.tcp://abc@machineName:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.

After 5 mins, the cluster does become healthy again.

If this client tries to reinitialize the actorSystem using a different port, it seems the cluster is restored to healthy state much quicker.

My question is why does it take 5 mins when the client node left cleanly? Below is the server log.

Thanks,
Grace

11:24:26.756 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4503, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:24:26.856 ERROR [TheServer-akka.actor.default-dispatcher-4] akka.remote.EndpointWriter:67 - AssociationError [akka.tcp://TheServer@MACHINENAME:4501] <- [akka.tcp://TheServer@MACHINENAME:4504]: Error [Shut down address: akka.tcp://TheServer@MACHINENAME:4504] [
akka.remote.ShutDownAssociation: Shut down address: akka.tcp://TheServer@MACHINENAME:4504
Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system terminated the association because it is shutting down.
]
11:24:27.916 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: SeenChanged(true,Set(akka.tcp://TheServer@MACHINENAME:4503, akka.tcp://TheServer@MACHINENAME:4504, akka.tcp://TheServer@MACHINENAME:4502, akka.tcp://TheServer@MACHINENAME:4501))
11:24:27.916 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4503, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:24:27.916 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged()
11:24:27.916 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4503, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:24:29.572 INFO  [TheServer-akka.actor.default-dispatcher-32] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged()
11:24:29.572 INFO  [TheServer-akka.actor.default-dispatcher-32] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4503, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:24:35.228 WARN  [TheServer-akka.actor.default-dispatcher-2] a.remote.ReliableDeliverySupervisor:75 - Association with remote system [akka.tcp://TheServer@MACHINENAME:4504] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://TheServer@MACHINENAME:4504]] Caused by: [Connection refused: no further information: MACHINENAME/144.203.109.177:4504]

More logs.  Removed due to spacing.
11:27:09.216 INFO  [TheServer-akka.actor.default-dispatcher-3] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:27:09.216 INFO  [TheServer-akka.actor.default-dispatcher-3] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged()
11:27:09.216 INFO  [TheServer-akka.actor.default-dispatcher-3] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:27:17.147 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Received InitJoin message from [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-12#2077907629]] to [akka.tcp://TheServer@MACHINENAME:4501]
11:27:17.147 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-12#2077907629]]
11:27:17.147 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - New incarnation of existing member [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
11:27:27.158 INFO  [TheServer-akka.actor.default-dispatcher-35] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Received InitJoin message from [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-13#-1480625418]] to [akka.tcp://TheServer@MACHINENAME:4501]
11:27:27.158 INFO  [TheServer-akka.actor.default-dispatcher-35] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-13#-1480625418]]

More messages here.  Removed due to spaces.

11:29:59.154 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-27#1103656718]]
11:29:59.154 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - New incarnation of existing member [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
11:30:09.162 INFO  [TheServer-akka.actor.default-dispatcher-21] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Received InitJoin message from [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-28#-1811863137]] to [akka.tcp://TheServer@MACHINENAME:4501]
11:30:09.162 INFO  [TheServer-akka.actor.default-dispatcher-21] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-28#-1811863137]]
11:30:09.162 INFO  [TheServer-akka.actor.default-dispatcher-21] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - New incarnation of existing member [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
11:30:20.145 INFO  [TheServer-akka.actor.default-dispatcher-4] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Received InitJoin message from [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-29#-1763983524]] to [akka.tcp://TheServer@MACHINENAME:4501]
11:30:20.145 INFO  [TheServer-akka.actor.default-dispatcher-4] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-29#-1763983524]]
11:30:20.155 INFO  [TheServer-akka.actor.default-dispatcher-4] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - New incarnation of existing member [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
11:30:27.949 INFO  [TheServer-akka.actor.default-dispatcher-4] c.m.d.i.s.c.p.f.TheClusterUnReachableNodeRemover:82 - Member unreachable detected: Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)
11:30:27.949 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Reachability event: UnreachableMember(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down))
11:30:27.949 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:27.949 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged(akka.tcp://TheServer@MACHINENAME:4502 -> akka.tcp://TheServer@MACHINENAME:4504: Unreachable [Unreachable] (1))
11:30:27.949 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:28.199 WARN  [TheServer-akka.actor.default-dispatcher-21] akka.cluster.ClusterCoreDaemon:75 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)]. Node roles [TheSeedNode, dc-default]
11:30:28.199 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: SeenChanged(false,Set(akka.tcp://TheServer@MACHINENAME:4501))
11:30:28.199 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:28.199 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged(akka.tcp://TheServer@MACHINENAME:4501 -> akka.tcp://TheServer@MACHINENAME:4504: Unreachable [Unreachable] (1), akka.tcp://TheServer@MACHINENAME:4502 -> akka.tcp://TheServer@MACHINENAME:4504: Unreachable [Unreachable] (1))
11:30:28.199 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:29.199 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: SeenChanged(true,Set(akka.tcp://TheServer@MACHINENAME:4501, akka.tcp://TheServer@MACHINENAME:4502))
11:30:29.199 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:29.199 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Other event: ReachabilityChanged(akka.tcp://TheServer@MACHINENAME:4501 -> akka.tcp://TheServer@MACHINENAME:4504: Unreachable [Unreachable] (1), akka.tcp://TheServer@MACHINENAME:4502 -> akka.tcp://TheServer@MACHINENAME:4504: Unreachable [Unreachable] (1))
11:30:29.199 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterStateListener:41 - Cluster state: members=TreeSet(Member(address = akka.tcp://TheServer@MACHINENAME:4501, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4502, status = Up), Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Up)), unreachable=Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)), leader=Some(akka.tcp://TheServer@MACHINENAME:4501)
11:30:29.959 INFO  [TheServer-akka.actor.default-dispatcher-34] c.m.d.i.s.c.p.f.TheClusterUnReachableNodeRemover:66 - downing unreachable node: Set(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down))
11:30:31.149 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Received InitJoin message from [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-30#-1978041379]] to [akka.tcp://TheServer@MACHINENAME:4501]
11:30:31.149 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Sending InitJoinAck message from node [akka.tcp://TheServer@MACHINENAME:4501] to [Actor[akka.tcp://TheServer@MACHINENAME:4504/system/cluster/core/daemon/joinSeedNodeProcess-30#-1978041379]]
11:30:31.149 INFO  [TheServer-akka.actor.default-dispatcher-34] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - New incarnation of existing member [Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
11:30:39.199 INFO  [TheServer-akka.actor.default-dispatcher-21] a.c.Cluster(akka://TheServer):80 - Cluster Node [akka.tcp://TheServer@MACHINENAME:4501] - Leader is removing unreachable node [akka.tcp://TheServer@MACHINENAME:4504]
11:30:39.199 INFO  [TheServer-akka.actor.default-dispatcher-35] c.m.d.i.s.c.p.f.TheClusterUnReachableNodeRemover:85 - Member removed detected: Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Removed)
11:30:39.199 INFO  [TheServer-akka.actor.default-dispatcher-21] c.m.d.i.s.c.p.f.TheClusterStateListener:40 - Member event: MemberRemoved(Member(address = akka.tcp://TheServer@MACHINENAME:4504, status = Removed),Down)

That looks strange. After downing the node the leader (here 4501) should remove it, unless there are other unreachable that haven’t been downed (but that is not the case here).

Have you changed any of the configuration? Which version is this?

I’d probably turn on debug logging and also

akka.cluster.debug.verbose-gossip-logging = on

Do you see any log messages with “Leader can currently not perform its duties” ?

Thanks Patrik.

That’s my understanding also. At some point, node 4503 did exit after 4504 was downed but node 4503 did gracefully leave the cluster and did a coordinated shutdown. So 4503 was not in the unreachable list as it left properly as shown in the log. So, I am not sure why the delay for 4504 to be removed and rejoin as there is nothing else in the unreachable list.

We are using Akka 2.5.9.

This is our Akka log configuration

    akka {
      coordinated-shutdown.phases {
        cluster-leave.timeout = 60 s
        cluster-exiting.timeout = 60 s
        actor-system-terminate.timeout = 60 s
      }
      actor {
        allow-java-serialization = off
        provider = "akka.cluster.ClusterActorRefProvider"
        serializers {
          kryo = "com.twitter.chill.akka.AkkaSerializer"
        }
        serialization-bindings {
          "java.io.Serializable" = kryo
        }
      }
      extensions = [ "akka.cluster.metrics.ClusterMetricsExtension" ]
      remote = {
        maximum-payload-bytes = "500000000 bytes"
        transport-failure-detector {
          heartbeat-interval = 30 s
          acceptable-heartbeat-pause = 300 s
        }
        watch-failure-detector {
          heartbeat-interval = 30 s
          threshold = 12.0
          acceptable-heartbeat-pause = 300 s
          unreachable-nodes-reaper-interval = 5 s
          expected-response-after = 5 s
        }
        netty.tcp {
          hostname = "0.0.0.0"
          bind-hostname = "0.0.0.0"
          message-frame-size = "500000000b"
          send-buffer-size = "500000000b"
          receive-buffer-size = "500000000b"
          maximum-frame-size = "500000000b"
        }
      }
      cluster {
        roles = [theRunner]
        auto-down-unreachable-after = off
        periodic-tasks-initial-delay = 3 s
        gossip-interval = 10 s
        gossip-time-to-live = 6 s
        leader-actions-interval = 10 s
        allow-weakly-up-members = off
        failure-detector {
          heartbeat-interval = 30 s
          threshold = 12.0
          acceptable-heartbeat-pause = 300 s
        }
        metrics.enabled = off
        cluster-dispatcher {
          type = "Dispatcher"
          executor = "fork-join-executor"
          fork-join-executor {
            parallelism-min = 2
            parallelism-max = 8
          }
        }
        use-dispatcher = akka.cluster.cluster-dispatcher
      }
      loggers = ["akka.event.slf4j.Slf4jLogger"]
      loglevel = DEBUG
      logging-filter = "akka.event.slf4j.Slf4jLoggingFilter"
      log-dead-letters = off
      log-dead-letters-during-shutdown = off
      actor.deployment {
        default.cluster {
          // max-nr-of-instances-per-node = 1
        }
        /TheRouter/remoteGroup {
          router = round-robin-pool
          routees.paths = ["/user/TheSupervisor/TheRunner"]
          cluster {
            enabled = on
            allow-local-routees = off
            use-role = TheRunner
            max-nr-of-instances-per-node = 1
          }
          pool-dispatcher {
            executor = "thread-pool-executor"
            thread-pool-executor {
              core-pool-size-max = 2
            }
            throughput = 1
          }
        }
      }
    }

No, there is no message about “Leader can currently not perform its duties”.

I will turn on the debug logging as you suggested and will post more if I see anything.

Thanks,
Grace

There are lot of changes to the configuration that will cause really bad behavior. Please revert to default unless you are sure and have confirmed that it’s a good change from the defaults. More specifically, I think this issue is caused by your settings for gossip-interval, leader-actions-interval, heartbeat-interval, acceptable-heartbeat-pause.

If you want to make the failure detector less sensitive you should only adjust he acceptable-heartbeat-pause, and 300s is too much. 10s should be enough for any somewhat healthy system.

Thanks for the suggestion. I will talk to teammates to find out why they were customized this way originally.

From what I am able to gather so far, the configurations were changed from defaults due to couple reasons.

  1. Long GC pause (Possibly few minutes pause although I have heard it has improved since). That was the reason behind the increased acceptable-heartbeat-pause as we don’t want to down the actorSystem due to GC pause.

  2. We have 16 core machines and 23-33 processes running in each and each process has 10-20 threads minimum. The cluster can have 450+ nodes. There was a concern about resource contention in which the thread to do gossip convergence has to fight for resource to converge and also there would be a lot of chit chatting with 450+ nodes. Hence there was the increase of some of the gossiping interval properties.

Given these general concerns, in your opinion, what are the only properties we should consider changing when needed?
It would be good to know a good starting point.

Thanks,
Grace

Also, you are right in that once i use the default settings in the test, I don’t get this “New incarnation…” message anymore. Thanks.

GC pauses in minutes sounds like a very unhealthy system that requires tuning or major changes. If you can’t send a few messages per second (the gossip an heartbeat interval) you have serious problems and can’t expect that Akka cluster or anything else will work as expected.

100s of active threads on 16 cores sounds like you are overloading the hardware.

I would only change acceptable-heartbeat-pause and not more than 10-20 seconds.