Erroneous split-brain situation in cluster (with properly working sbr)


(Bert Robben) #1

Dear Akka-team,

a couple of days ago our Akka-cluster-based application went into a bad split-brain situation in production. After a deep dive in the logs I get the impression that this might be caused by some issue in Akka cluster itself. The basic problem is that a node was removed from one part of the cluster during split brain resolution; and this removal was not caused by us downing the node. I’d appreciate it a lot if you guys could have a look at my story below and indicate what could be the problem. Unfortunately I can’t reproduce the issue. We have various actors that listen to cluster events and log them, so that information is available.

In production, our application is running with Akka 2.11-2.4.20 (the latest version of our app which is not yet in production is running on Akka 2.11-2.5.x).

Here is the detailed explanation.

  1. The cluster application is running fine with 4 members (A-B-C-D). Then, due to external causes, the network gets saturated.

  2. Because of the saturation, cluster members start seeing MemberUnreachable events. In this particular case two partitions are created where A and B can see each other, but C and D are flagged as unreachable. For C and D, the situation is similar, they see each other fine, but can no longer see A and B.

  3. After a short time our split brain resolver (custom implementation based on ZooKeeper) on node A kicks in and decides to down C and D.

// logs from A
p0440p0l9989.log:[15:50:36,712] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,-838580702),UniqueAddress(akka.tcp://NaSystem@,-9842112)

  1. Up to this point everything is fine and works as expected. However, the issue occurs in node B. As B is not the leader in the A-B partition, B is not actively removing members but rather gets them through the Akka cluster gossiping. Just before this message reaches B, A also becomes unreachable from the point of view of B.

// logs from B
p0440p0l5705.log:[15:50:38,010] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@, status = Down))

p0440p0l5705.log:[15:50:38,106] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)
p0440p0l5705.log:[15:50:38,108] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)

===> up to there all is fine. But in addition to the two expected MemberRemoved events from C and D (2 log lines above), this node also gets at exactly the same time an event indicating that node A is removed. I am completely at a loss of where this message is coming from. Our application has no logic (except for downing through sbr) for removing a node.

p0440p0l5705.log:[15:50:38,106] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)

End result of these message is that node B now forms a cluster on its own and is perfectly happy with it.

  1. Milliseconds later, node B has also become unreachable for node A after which the split brain resolver of A decides to remove B as well.

p0440p0l9989.log:[15:50:43,193] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@, status = Up))

p0440p0l9989.log:[15:50:50,462] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,980338586)

  1. 2 seconds later, the split brain resolver of node C is triggered. He wants to down A and B, but sees in ZooKeeper that the decision has already been made to down B, C and D so he complies to that and downs these. As B is not reachable, the net effect is that C and D go down.

p0440p0l9014.log:[15:50:52,743] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,980338586),UniqueAddress(akka.tcp://NaSystem@,-838580702),UniqueAddress(akka.tcp://NaSystem@,-9842112)

End result: a real split brain situation where node A and node B form separate clusters. A and B then live for quite some time (>> minutes) together and wreac havoc in the system.



Quarantine breaks cluster abstraction
(Bert Robben) #2

After further digging, we found out that the node was downed after it was quarantined.

Marking node as TERMINATED [akka.tcp://NaSystem@], due to quarantine.

I’m opening up a separate topic to follow-up on this.