Is implementing a peer-to-peer network of Akka nodes that can come and go possible?

Hi,

we are trying to implement a system where a set of Akka system nodes form a peer-to-peer network. The maximum number of nodes as well as their addresses are known by each node in advance. All nodes can come and go whenever they want. Note that I don’t use Akka Cluster but Artery Remoting only.

Each node has a dedicated NodeObserver actor which periodically checks if other nodes are reachable with the help of an ActorSelection having a path to a node’s NodeObserver actor. Once a NodeObserver actor detects that another one on a remote system is reachable, it starts death watching it instead. If a watched remote NodeObserver actor terminates the watching NodeObserver falls back to periodically checking for reachability using the ActorSelection approach outlined above again.

If we shut down a node or start a node all other running nodes react as expected. So this approach works great… Until a network partition is induced by disabling the network adapter, for example. In that case, the now non-reachable nodes become quarantined. And only a restart of all(!) nodes will fix this problem. From what I read, this seems to be the only way to recover from the quarantined state.

This is clearly the worst case because we chose Akka in order to get a resilient solution. And this situation is actually the contrary.

Is it possible to fiddle around with some configuration parameters and make our current implementation magically work? I played around with some configuration parameters related to the quarantining mechanism with no success. Please help, what can we do? (I hope we don’t need to go for a completely different solution because release date is close.)

If you have a good reason for not using Akka Cluster for this you shouldn’t use remote watch (nor remote deployment). That’s why we have disabled that feature by default if not using Akka Cluster from version 2.6.0.

1 Like

Thank you very much for your fast and helpful answer :slightly_smiling_face:
Although this is bad news because we need to implement a watching mechanism ourselves I clearly know now what I shouldn*t do. Instead of wasting time finding a solution where there is none we can concentrate on an alternative approach. Thank’s a lot for that.
Maybe it will suffice to just NOT switch to Akka’s remote watching once a node becomes reachable but to continue checking for reachability using ActorSelection instead.

Exactly. Implementing your own failure detection by periodically sending request-response heartbeat messages isn’t difficult. Those can be sent with actorSelection or the ActorRef that you have discovered.