Erroneous split-brain situation in cluster (with properly working sbr)

Dear Akka-team,

a couple of days ago our Akka-cluster-based application went into a bad split-brain situation in production. After a deep dive in the logs I get the impression that this might be caused by some issue in Akka cluster itself. The basic problem is that a node was removed from one part of the cluster during split brain resolution; and this removal was not caused by us downing the node. I’d appreciate it a lot if you guys could have a look at my story below and indicate what could be the problem. Unfortunately I can’t reproduce the issue. We have various actors that listen to cluster events and log them, so that information is available.

In production, our application is running with Akka 2.11-2.4.20 (the latest version of our app which is not yet in production is running on Akka 2.11-2.5.x).

Here is the detailed explanation.

  1. The cluster application is running fine with 4 members (A-B-C-D). Then, due to external causes, the network gets saturated.

  2. Because of the saturation, cluster members start seeing MemberUnreachable events. In this particular case two partitions are created where A and B can see each other, but C and D are flagged as unreachable. For C and D, the situation is similar, they see each other fine, but can no longer see A and B.

  3. After a short time our split brain resolver (custom implementation based on ZooKeeper) on node A kicks in and decides to down C and D.

// logs from A
p0440p0l9989.log:[15:50:36,712] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,-838580702),UniqueAddress(akka.tcp://NaSystem@,-9842112)

  1. Up to this point everything is fine and works as expected. However, the issue occurs in node B. As B is not the leader in the A-B partition, B is not actively removing members but rather gets them through the Akka cluster gossiping. Just before this message reaches B, A also becomes unreachable from the point of view of B.

// logs from B
p0440p0l5705.log:[15:50:38,010] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@, status = Down))

p0440p0l5705.log:[15:50:38,106] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)
p0440p0l5705.log:[15:50:38,108] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)

===> up to there all is fine. But in addition to the two expected MemberRemoved events from C and D (2 log lines above), this node also gets at exactly the same time an event indicating that node A is removed. I am completely at a loss of where this message is coming from. Our application has no logic (except for downing through sbr) for removing a node.

p0440p0l5705.log:[15:50:38,106] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onMemberRemoved member=MemberRemoved(Member(address = akka.tcp://NaSystem@, status = Removed),Down)

End result of these message is that node B now forms a cluster on its own and is perfectly happy with it.

  1. Milliseconds later, node B has also become unreachable for node A after which the split brain resolver of A decides to remove B as well.

p0440p0l9989.log:[15:50:43,193] [INFO ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onUnreachableMember member=UnreachableMember(Member(address = akka.tcp://NaSystem@, status = Up))

p0440p0l9989.log:[15:50:50,462] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,980338586)

  1. 2 seconds later, the split brain resolver of node C is triggered. He wants to down A and B, but sees in ZooKeeper that the decision has already been made to down B, C and D so he complies to that and downs these. As B is not reachable, the net effect is that C and D go down.

p0440p0l9014.log:[15:50:52,743] [WARN ] [] [c.c.n.d.s.i.ZooKeeperBasedSplitBrainResolver] [akka://NaSystem/system/cluster/core/daemon/downingProvider] - method=onTick downing UniqueAddress(akka.tcp://NaSystem@,980338586),UniqueAddress(akka.tcp://NaSystem@,-838580702),UniqueAddress(akka.tcp://NaSystem@,-9842112)

End result: a real split brain situation where node A and node B form separate clusters. A and B then live for quite some time (>> minutes) together and wreac havoc in the system.



After further digging, we found out that the node was downed after it was quarantined.

Marking node as TERMINATED [akka.tcp://NaSystem@], due to quarantine.

I’m opening up a separate topic to follow-up on this.

A few remarks:

  • what strategy have you implemented for split brain resolution? Having 4 nodes seems to limit you to using an arbiter-like approach (fixed IP or oldest node) since you can’t apply any quorum-based strategies with an even number of members
  • implementing your own split brain resolver means that you can potentially have quite a few issues with that itself - I would not recommend doing this unless you have spent the necessary time researching the topic. you mentioned that you base it on zookeeper which sounds as though you rely on yet another distributed system to make a decision while the network is unstable, which I think might backfire. the split brain resolver should not be interacting with anything accross the wire, since it has to assume that, well, the wire is split :slight_smile:
  • without looking at your implementation it’s hard to tell anything but I think you might not have a mechanism in tthe split brain resolver to account for post-partition instability
  • Correct, we don’t apply any quorum-based strategies. Instead, we use ZooKeeper to reach consensus amongst the different parts of the cluster to decide on which one is staying and which ones have to go.

  • Note that our application runs in DC/OS and as such, ZooKeeper is already a part of the solution. Also, in case the wire would be really split and no external communication is possible at all, then it would be perfectly fine for us to go down. Our application is a pass-through processing system. As such it does not really have a reason for living if it can’t talk to the outside world (or the outside world can’t talk to us).

  • That’s an interesting comment on post-partition instability. Can you elaborate a bit further on this what you actually mean with that?