Quarantine breaks cluster abstraction

bert · September 17, 2018, 7:37am

After analysis of a severe production event in an Akka-cluster based application, we came to the conclusion that quarantining (as implemented by Akka remote) breaks the Akka cluster abstraction.

See Erroneous split-brain situation in cluster (with properly working sbr) for a detailed explanation of one example of this.

Our conclusions are:

(1) quarantine can cause nodes to be removed from the cluster outside of the regular downing process.
(2) because of this, an Akka cluster can get into a split brain situation WITHOUT any intervention of the split-brain-resolver (basically the downing provider is NOT consulted at all). This breaks the akka cluster abstraction (such as e.g. only a single instance of a clustersingleton in the cluster).
(3) this behavior is NOT documented

We now have the following questions:

(1) Is this the behavior of Akka 2.4 only (in which we had the incident), or can this also occur in 2.5? From our understanding this behavior was introduced by https://github.com/akka/akka/commit/dc9fe4f19c4d4e23b9b8a7b9142e212d37e4f176#diff-21188d53fb6ed0402295be650067d3d1R548 and is still there in Akka 2.5.

(2) How can we avoid the resulting split brain? In our understanding this is a plain bug in Akka cluster that should be fixed. Ideally we also get some information on how to avoid this on 2.4 (it is very hard to upgrade our production version to 2.5 in the short term (because it incurs too many changes)).

Dear Akka team, can you help us with this?

thanks,

Bert

PS: Note that we’re not using Artery, but I don’t think that this is important.

patriknw · September 17, 2018, 1:05pm

Thanks for reporting @bert. I agree with you that this is wrong (old code that survived other changes of introducing downing provider and such). The decision of downing should be completely in control of the downing provider.

There is one exception to that and that is when a node with the same hostname:port is joining again (new UID). Then we have evidence that the previous has been shutdown.

Please create an issue and we’ll continue there.

bert · September 17, 2018, 1:14pm

Thanks for the quick reply!

I created https://github.com/akka/akka/issues/25632 for further follow-up.

Topic		Replies	Views
Erroneous split-brain situation in cluster (with properly working sbr) Akka Cluster akka-cluster	3	1419	September 26, 2018
How to avoid nodes to be quarantined in Akka Cluster? Akka Cluster akka , akka-cluster	2	3217	August 25, 2018
Cluster losing all singletons Akka Cluster akka-cluster	4	1872	April 19, 2018
Split Brain scenario Akka Cluster akka-cluster	3	654	April 26, 2020
Why my cluster breaks after a while due to quarantined nodes? Akka Cluster	0	602	May 6, 2019

Quarantine breaks cluster abstraction

Related Topics