Actor duplication while a split brain is being resolved

fabiogouw · January 26, 2019, 1:07pm

Hello. In an occurrence of a network partition within a cluster, is there a chance to have duplicated actor (each residing in one of the partitions) while Split Brain Resolver is resolving the situation?
I mean, while Akka is analyzing the cluster events and checking heartbeats to deal with the issue, is it possible (even a small chance) that an actor “A”, hosted at the left side of the partition, can be recreated in the other side due a request made by a client to one of its node, causing some inconsistency to the system? Or Split Brain Resolver is able to deal with this situation and guarantees (100%) that this scenario will never happen?

Thanks

patriknw · January 27, 2019, 9:45am

Cluster Singleton and Cluster Sharding will not create duplicate instances within one cluster, but if two separate clusters are formed there can duplicates, one in each cluster. That is the risk with using the auto-down-unreachable-after setting, as we warn for in the documentation.

If this is a question about Lightbend’s Split Brain Resolver I think we should discuss it in Lightbend’s customer support channel.

fabiogouw · February 9, 2019, 6:12pm

Thanks for your answer, Patrik!

Yes, this question is about Lightbend’s Split Brain Resolver. Although my company have been in touch with Lightbend, we’re not (yet) Lightbend’s customer so I don’t think I can use the support channel.

What I’d like to achieve is to get to know better this product and what benefits it can provide. For this question, in other words, what I’d like to know is if Split Brain Resolver acts like a “vaccine” (if there’s a split brain, it provides some mechanism of preventing the creation of the same actor twice at factory level) or like a medication (if there’s a split brain, it will eventually handle it but with a little chance of actor duplication while it’s analyzing the health of the cluster and taking the countermeasures).

Thank you

patriknw · February 12, 2019, 4:45pm

Split Brain Resolver (SBR) does its very best to avoid split brain (creating several separate Akka clusters) and it’s very unlikely that it will make the wrong decision. There are a few scenarios that can cause it too happen. It’s if at least two things happen “at the same time”, like certain changes in cluster membership at the same time as a network partition occurs. The documentation mentions these scenarios. Lately we have thought some more about those scenarios and realized that we can detect them and avoid those issues too. We will soon implement those improvements.

Another type of problems that are difficult to completely protect against is when there are frequent network instability and typically also with indirectly connected nodes (non-clean partitions). The failure detection information is constantly changing. We have mechanisms to detect and act on these situations too but there is a small risk that it can go wrong.

Recently we have actually implemented a new strategy that is using a distributed lease (lock) to decide what nodes that are allowed to survive. Only one SBR instance can acquire the lease make the decision to remain up. The other side will not be able to aquire the lease and will therefore down itself. The lease is backed by Kubernetes. We are doing final testing and this is not released yet but you can read about it in the snapshot docs.

Happy to talk more with you in the context of a Lightbend subscription.

davidogren · February 12, 2019, 10:55pm

But to follow up on you original question, SBR acts as a vaccine in your terms. It prevents split brain from ever happening and therefore not allowing the duplicates to be created, not by trying to treat split brain after the fact. I don’t want to write an entire treatise on clustering and split brain, but what split brain resolver does is define the rules around which failover is allowed to happen. (Because the safe thing to do is to shutdown, but the ideal situation from an availability perspective is to failover.)

This often involves tradeoffs between flexibility, safety, and availability. The beauty of SBR is it makes all of those tradeoffs easy to configure. Totally paranoid? Use down-all. Mostly paranoid? use keep-majority and down-all-when-unstable. Don’t need the flexibility to change cluster size dynamically? Use static-quorum and avoid a lot of edge cases. But, as Patrik points out, it’s nearly impossible to rule out all edge cases when you are talking about distributed computing.

I’d recommend looking through the docs, because it goes through all of the details https://developer.lightbend.com/docs/akka-commercial-addons/current/split-brain-resolver.html

David

fabiogouw · February 13, 2019, 12:12am

Thanks for the clarifications!

Topic		Replies	Views
Erroneous split-brain situation in cluster (with properly working sbr) Akka Cluster akka-cluster	3	1419	September 26, 2018
Split Brain scenario Akka Cluster akka-cluster	3	653	April 26, 2020
Cluster losing all singletons Akka Cluster akka-cluster	4	1870	April 19, 2018
Quarantine breaks cluster abstraction Akka Cluster	2	919	September 17, 2018
How to NOT use akka.cluster.auto-down-unreachable-after Akka Cluster	3	2753	October 2, 2018

Actor duplication while a split brain is being resolved

Related Topics