Actor duplication while a split brain is being resolved

Hello. In an occurrence of a network partition within a cluster, is there a chance to have duplicated actor (each residing in one of the partitions) while Split Brain Resolver is resolving the situation?
I mean, while Akka is analyzing the cluster events and checking heartbeats to deal with the issue, is it possible (even a small chance) that an actor “A”, hosted at the left side of the partition, can be recreated in the other side due a request made by a client to one of its node, causing some inconsistency to the system? Or Split Brain Resolver is able to deal with this situation and guarantees (100%) that this scenario will never happen?


Cluster Singleton and Cluster Sharding will not create duplicate instances within one cluster, but if two separate clusters are formed there can duplicates, one in each cluster. That is the risk with using the auto-down-unreachable-after setting, as we warn for in the documentation.

If this is a question about Lightbend’s Split Brain Resolver I think we should discuss it in Lightbend’s customer support channel.

Thanks for your answer, Patrik!

Yes, this question is about Lightbend’s Split Brain Resolver. Although my company have been in touch with Lightbend, we’re not (yet) Lightbend’s customer so I don’t think I can use the support channel.

What I’d like to achieve is to get to know better this product and what benefits it can provide. For this question, in other words, what I’d like to know is if Split Brain Resolver acts like a “vaccine” (if there’s a split brain, it provides some mechanism of preventing the creation of the same actor twice at factory level) or like a medication (if there’s a split brain, it will eventually handle it but with a little chance of actor duplication while it’s analyzing the health of the cluster and taking the countermeasures).

Thank you

Split Brain Resolver (SBR) does its very best to avoid split brain (creating several separate Akka clusters) and it’s very unlikely that it will make the wrong decision. There are a few scenarios that can cause it too happen. It’s if at least two things happen “at the same time”, like certain changes in cluster membership at the same time as a network partition occurs. The documentation mentions these scenarios. Lately we have thought some more about those scenarios and realized that we can detect them and avoid those issues too. We will soon implement those improvements.

Another type of problems that are difficult to completely protect against is when there are frequent network instability and typically also with indirectly connected nodes (non-clean partitions). The failure detection information is constantly changing. We have mechanisms to detect and act on these situations too but there is a small risk that it can go wrong.

Recently we have actually implemented a new strategy that is using a distributed lease (lock) to decide what nodes that are allowed to survive. Only one SBR instance can acquire the lease make the decision to remain up. The other side will not be able to aquire the lease and will therefore down itself. The lease is backed by Kubernetes. We are doing final testing and this is not released yet but you can read about it in the snapshot docs.

Happy to talk more with you in the context of a Lightbend subscription.

But to follow up on you original question, SBR acts as a vaccine in your terms. It prevents split brain from ever happening and therefore not allowing the duplicates to be created, not by trying to treat split brain after the fact. I don’t want to write an entire treatise on clustering and split brain, but what split brain resolver does is define the rules around which failover is allowed to happen. (Because the safe thing to do is to shutdown, but the ideal situation from an availability perspective is to failover.)

This often involves tradeoffs between flexibility, safety, and availability. The beauty of SBR is it makes all of those tradeoffs easy to configure. Totally paranoid? Use down-all. Mostly paranoid? use keep-majority and down-all-when-unstable. Don’t need the flexibility to change cluster size dynamically? Use static-quorum and avoid a lot of edge cases. But, as Patrik points out, it’s nearly impossible to rule out all edge cases when you are talking about distributed computing.

I’d recommend looking through the docs, because it goes through all of the details


Thanks for the clarifications!