After analysis of a severe production event in an Akka-cluster based application, we came to the conclusion that quarantining (as implemented by Akka remote) breaks the Akka cluster abstraction.
See Erroneous split-brain situation in cluster (with properly working sbr) for a detailed explanation of one example of this.
Our conclusions are:
(1) quarantine can cause nodes to be removed from the cluster outside of the regular downing process.
(2) because of this, an Akka cluster can get into a split brain situation WITHOUT any intervention of the split-brain-resolver (basically the downing provider is NOT consulted at all). This breaks the akka cluster abstraction (such as e.g. only a single instance of a clustersingleton in the cluster).
(3) this behavior is NOT documented
We now have the following questions:
(1) Is this the behavior of Akka 2.4 only (in which we had the incident), or can this also occur in 2.5? From our understanding this behavior was introduced by https://github.com/akka/akka/commit/dc9fe4f19c4d4e23b9b8a7b9142e212d37e4f176#diff-21188d53fb6ed0402295be650067d3d1R548 and is still there in Akka 2.5.
(2) How can we avoid the resulting split brain? In our understanding this is a plain bug in Akka cluster that should be fixed. Ideally we also get some information on how to avoid this on 2.4 (it is very hard to upgrade our production version to 2.5 in the short term (because it incurs too many changes)).
Dear Akka team, can you help us with this?
thanks,
Bert
PS: Note that we’re not using Artery, but I don’t think that this is important.