Cluster losing all singletons

Hi guys,

we have an Akka-based application in production and had an outage a couple of days ago because for some unknown reason our Akka cluster lost all its cluster singletons. Our application of course couldn’t handle that and basically became non-responsive. Restarting the entire cluster fixed the issue.

Background: we have a 4 node cluster (all nodes have exactly the same function, same roles, same code, same version) and make use of a couple of cluster singletons. Each node creates a cluster singleton manager and a cluster singleton proxy for each of the cluster singletons we use. We use Akka version 2.4.20.

The scenario was as follows:
(1) One node in our 4 node cluster became unreachable
(2) After a while our split brain resolver decides to down that node
(3) Just after that decision is made and the sbr downs the node, the node becomes reachable again
(4) As a result, the node still gets the cluster event that he’s removed.
(5) Because of this it starts to stop its cluster singleton manager / proxy actors
(6) Then out of the blue, the node that is currently hosting the cluster singletons decides to hand them over to the node that is being removed from the cluster.
(7) Eventually the node that is downed terminates its actor system.

But the cluster singletons NEVER come back and we end up with a cluster without singletons, which is unacceptable for our application.

Could this be caused by a bug in Akka or are we somehow using Akka cluster in the wrong way? Note that this is the very first time we see this, so I can imagine that this is some kind of race condition.

thanks,

Bert

PS: Here are the logs

April 16th 2018, 19:13:47.719	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:47.719	Cluster Node [akka.tcp://NaSystem@10.38.174.91:16940] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up)]. Node roles [...]	akka.tcp://NaSystem@10.38.174.91:16940/system/cluster/core/daemon
April 16th 2018, 19:13:47.939	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:47.939	Cluster Node [akka.tcp://NaSystem@10.38.45.141:9630] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up)]. Node roles [...]	akka.tcp://NaSystem@10.38.45.141:9630/system/cluster/core/daemon
April 16th 2018, 19:13:47.952	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:48.125	Cluster Node [akka.tcp://NaSystem@10.38.174.92:24183] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Up)]. Node roles [...]	akka.tcp://NaSystem@10.38.174.92:24183/system/cluster/core/daemon

our split brain resolver decides to down akka.tcp://NaSystem@10.38.45.142:12763

April 16th 2018, 19:13:54.978	method=onTick downing UniqueAddress(akka.tcp://NaSystem@10.38.45.142:12763,-1651164603)	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:54.980	Cluster Node [akka.tcp://NaSystem@10.38.174.91:16940] - Marking unreachable node [akka.tcp://NaSystem@10.38.45.142:12763] as [Down]	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:55.721	Cluster Node [akka.tcp://NaSystem@10.38.174.91:16940] - Marking node(s) as REACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down)]. Node roles [...]	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:55.939	Cluster Node [akka.tcp://NaSystem@10.38.45.141:9630] - Marking node(s) as REACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down)]. Node roles [...]	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:56.122	Cluster Node [akka.tcp://NaSystem@10.38.174.92:24183] - Marking node(s) as REACHABLE [Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down)]. Node roles [...]	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:56.123	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:56.128	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:56.943	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.45.142:12763, status = Down))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:58.258	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://NaSystem@10.38.174.91:16940, status = Up), Member(address = akka.tcp://NaSystem@10.38.174.92:24183, status = Up), Member(address = akka.tcp://NaSystem@10.38.45.141:9630, status = Up)]. Node roles [...]	akka.tcp://NaSystem@10.38.45.142:12763/system/cluster/core/daemon
April 16th 2018, 19:13:58.398	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.174.91:16940, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:58.401	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.45.141:9630, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:58.401	method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@10.38.174.92:24183, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:58.628	method=onLeaderChanged leader=akka.tcp://NaSystem@10.38.45.142:12763	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:59.173	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip status from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.174.91:16940,1008714780)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.173	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip status from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.174.91:16940,1008714780)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.173	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip status from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.174.92:24183,1800987370)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.522	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.174.91:16940,1008714780)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.522	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.45.141:9630,1012418164)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.522	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Ignoring received gossip from unreachable [UniqueAddress(akka.tcp://NaSystem@10.38.174.92:24183,1800987370)] 	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.627	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Marking node(s) as REACHABLE [Member(address = akka.tcp://NaSystem@10.38.174.91:16940, status = Up), Member(address = akka.tcp://NaSystem@10.38.174.92:24183, status = Up), Member(address = akka.tcp://NaSystem@10.38.45.141:9630, status = Up)]. Node roles [...]	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:13:59.662	method=onLeaderChanged leader=akka.tcp://NaSystem@10.38.174.91:16940	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:59.662	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.174.91:16940, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:59.662	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.174.92:24183, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:13:59.662	method=onReachableMember member=ReachableMember(Member(address = akka.tcp://NaSystem@10.38.45.141:9630, status = Up))	akka://NaSystem/system/cluster/core/daemon/downingProvider
April 16th 2018, 19:14:00.080	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Shutting down...	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:14:00.080	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Shutting down myself	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:14:00.141	method=onMemberRemoved address=akka.tcp://NaSystem@10.38.45.142:12763 I was removed from the cluster. Waiting 10000 milliseconds before self-termination.	akka://NaSystem/user/$a/gracefulShutdownActor
April 16th 2018, 19:14:00.169	Cluster Node [akka.tcp://NaSystem@10.38.45.142:12763] - Successfully shut down	akka.cluster.Cluster(akka://NaSystem)
April 16th 2018, 19:14:00.317	method=onMemberRemoved about to remove akka.tcp://NaSystem@10.38.174.91:16940 from current members	akka://NaSystem/user/clusterMonitor
April 16th 2018, 19:14:00.491	method=onMemberRemoved about to remove akka.tcp://NaSystem@10.38.174.92:24183 from current members	akka://NaSystem/user/clusterMonitor
April 16th 2018, 19:14:00.535	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/channel%401Coordinator
April 16th 2018, 19:14:00.615	method=onMemberRemoved about to remove akka.tcp://NaSystem@10.38.45.141:9630 from current members	akka://NaSystem/user/clusterMonitor
April 16th 2018, 19:14:00.638	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnDraftReadModel%401.0_aCoordinator
April 16th 2018, 19:14:00.638	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnReadModel%401.0_aCoordinator
April 16th 2018, 19:14:00.638	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/validationRuleManager@1
April 16th 2018, 19:14:00.638	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/pacs002Receiver%401.0Coordinator
April 16th 2018, 19:14:00.696	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturnDraft%401Coordinator
April 16th 2018, 19:14:00.697	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturn%401Coordinator
April 16th 2018, 19:14:00.711	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:00.711	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturnBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:00.711	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/ccti%401Coordinator
April 16th 2018, 19:14:00.717	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiFinalBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:00.721	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchCustomerCreditTransferInitiationDraftReadModel%401.0_aCoordinator
April 16th 2018, 19:14:00.748	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturnFinalBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:00.748	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/bpmnCoverageActor@1
April 16th 2018, 19:14:00.748	Younger observed OldestChanged: [Some(akka.tcp://NaSystem@10.38.45.141:9630) -> myself]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.038	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/mpManifestManager@1
April 16th 2018, 19:14:01.038	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchCustomerCreditTransfersReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.038	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/memberIdManager@1
April 16th 2018, 19:14:01.038	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/prRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.199	I'm in a and leaving the cluster and no longer sending events	akka://NaSystem/user/clusterMonitor
April 16th 2018, 19:14:01.216	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/idempotencyKey%401.0Coordinator
April 16th 2018, 19:14:01.216	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/sequenceActorSingleton@1
April 16th 2018, 19:14:01.216	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/customerCreditTransferInitiationDraft%401Coordinator
April 16th 2018, 19:14:01.216	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/user/clearingAndSettlementJmsBlacklistActor@1
April 16th 2018, 19:14:01.321	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:01.339	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/pacs002Receiver%401.0Coordinator
April 16th 2018, 19:14:01.382	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/channel%401Coordinator
April 16th 2018, 19:14:01.382	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.390	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnDraftReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.390	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.390	ClusterSingletonManager state change [Younger -> BecomingOldest]	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturnDraft%401Coordinator
April 16th 2018, 19:14:01.416	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:01.579	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.599	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/searchPaymentReturnDraftReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.601	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/paymentReturnDraft%401Coordinator
April 16th 2018, 19:14:01.603	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/cctiRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.604	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/pacs002Receiver%401.0Coordinator
April 16th 2018, 19:14:01.604	Self removed, stopping ClusterSingletonManager	akka.tcp://NaSystem@10.38.45.142:12763/system/sharding/channel%401Coordinator

Singleton is moved to node that is being removed from the cluster. We never recover from this

April 16th 2018, 19:14:01.796	Singleton terminated, hand-over done [akka.tcp://NaSystem@10.38.45.141:9630 -> Some(akka.tcp://NaSystem@10.38.45.142:12763)]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/cctiBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/searchPaymentReturnDraftReadModel%401.0_aCoordinator
April 16th 2018, 19:14:01.796	Singleton terminated, hand-over done [akka.tcp://NaSystem@10.38.45.141:9630 -> Some(akka.tcp://NaSystem@10.38.45.142:12763)]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/cctiRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/cctiBusinessEvents%401.0Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/paymentReturnDraft%401Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/pacs002Receiver%401.0Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/cctiRetryFailedFraudCheckNotificationReadModel%401.0Coordinator
April 16th 2018, 19:14:01.796	ClusterSingletonManager state change [Oldest -> HandingOver]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/channel%401Coordinator
April 16th 2018, 19:14:01.796	Singleton terminated, hand-over done [akka.tcp://NaSystem@10.38.45.141:9630 -> Some(akka.tcp://NaSystem@10.38.45.142:12763)]	akka.tcp://NaSystem@10.38.45.141:9630/system/sharding/searchPaymentReturnDraftReadModel%401.0_aCoordinator

What is the split brain resolver you are using? (mostly out of curiosity).

I think this may be a bug, I thought I remembered a reported issue with something like this, but I couldn’t find it doing a quick issue search right now.

We have our own custom SplitBrainResolver. Makes use of ZooKeeper to ensure consensus is reached.

We found


that seems to be in the neighbourhood. Would that fix this issue?

Looks like the issue is the one fixed in the commit that you found.
I’d recommend to update to latest version, which is 2.5.12. The migration from 2.4.x to 2.5.x should be rather simple, somewhat depending on what parts of Akka you are using. See details in migration guide.

Thanks for the quick reply!
Happy that this is already fixed. We’ll update to the latest version.