Quarantine breaks cluster abstraction


(Bert Robben) #1

After analysis of a severe production event in an Akka-cluster based application, we came to the conclusion that quarantining (as implemented by Akka remote) breaks the Akka cluster abstraction.

See Erroneous split-brain situation in cluster (with properly working sbr) for a detailed explanation of one example of this.

Our conclusions are:

(1) quarantine can cause nodes to be removed from the cluster outside of the regular downing process.
(2) because of this, an Akka cluster can get into a split brain situation WITHOUT any intervention of the split-brain-resolver (basically the downing provider is NOT consulted at all). This breaks the akka cluster abstraction (such as e.g. only a single instance of a clustersingleton in the cluster).
(3) this behavior is NOT documented

We now have the following questions:

(1) Is this the behavior of Akka 2.4 only (in which we had the incident), or can this also occur in 2.5? From our understanding this behavior was introduced by https://github.com/akka/akka/commit/dc9fe4f19c4d4e23b9b8a7b9142e212d37e4f176#diff-21188d53fb6ed0402295be650067d3d1R548 and is still there in Akka 2.5.

(2) How can we avoid the resulting split brain? In our understanding this is a plain bug in Akka cluster that should be fixed. Ideally we also get some information on how to avoid this on 2.4 (it is very hard to upgrade our production version to 2.5 in the short term (because it incurs too many changes)).

Dear Akka team, can you help us with this?

thanks,

Bert

PS: Note that we’re not using Artery, but I don’t think that this is important.


(Patrik Nordwall) #2

Thanks for reporting @bert. I agree with you that this is wrong (old code that survived other changes of introducing downing provider and such). The decision of downing should be completely in control of the downing provider.

There is one exception to that and that is when a node with the same hostname:port is joining again (new UID). Then we have evidence that the previous has been shutdown.

Please create an issue and we’ll continue there.


(Bert Robben) #3

Thanks for the quick reply!

I created https://github.com/akka/akka/issues/25632 for further follow-up.