Fault tolerance best practise

joa23 · July 25, 2018, 4:52pm

Hi,
I want to run a stream in an actor and make that fault tolerant on a cluster. What are the best practices for that?
Akka sharding sounds promising but then the automatic rebalancing is an issue as this would interrupt my stream.
At this point, I have a hand crafted master / worker approach. Where I create workers with a ClusterRouterPool / AdaptiveLoadBalancingPool as the idea was to pick a node in the cluster with low load. When one of my stream nodes is terminated I increase the pool size and assign the new worker the stream that failed. However, I run into a big https://github.com/akka/akka/issues/25385 and learned that AdaptiveLoadBalancingPool is changing the size of the pool dynamically what is not documented.

Any suggestions are welcome.
Thank you.
Stefan

johanandren · July 25, 2018, 5:24pm

Sorry, my bad stating that, the AdaptiveLoadBalancing doesn’t do any dynamic resizing, except for when you remove and add nodes to the cluster.

If automatic rebalance would be problematic with your stream, what do you do when a new node is added to your cluster, or you need to remove the node a stream is running on, or when a node have crashed?

Additionally choosing the right node™ to run a stream on is trickier than choosing the right node to run a small:ish workload on, since a stream is in general long lived, imagine five requests coming at roughly the same time, all seeing node N having low load and starting long lived streams on that node.

Best might be to make sure you have a way to shard requests fairly on some identifier/property of the request.

There’s a singleton + sharding cluster tool available in Lagom which is used to essentially keep singletons alive (and consume persistence queries sharded by tag) but spread out across nodes, maybe this could be something to look into. We have some plans to port this to a pure Akka cluster tool but for now you’d have to mimic it yourself. (Sources are here: https://github.com/lagom/lagom/blob/master/persistence/core/src/main/scala/com/lightbend/lagom/internal/persistence/cluster/ClusterDistribution.scala)

I think I mentioned in the gitter discussion a couple of days ago, but you will also have to think about back pressure and not overwhelming the worker nodes. Whatever solution you choose, make sure you try a higher inflow of work than you can complete out and see in what way that affects the solution. (This is usually why you should avoid routers for distributing workloads)

joa23 · July 25, 2018, 7:01pm

@johanandren,

Really good point - thank you. Let me think about this a little more. I also plan to do some code reading of GearPump that you mentioned in your talk to understand how they handled these kinds of scenarios.

Do you happen to know if there is anyway to switch off / overwrite the Akka sharding redistribution?

johanandren · July 26, 2018, 6:55am

You can write your own ShardAllocationStrategy to make your own decisions about where to allocate new shards and when to rebalance.

I’m not sure there is much happening in Gearpump land, looking at their GitHub repo the last change was two years ago, but maybe they are developing it elsewhere. I don’t think I have mentioned it in talks for quite a while so maybe look for a newer talk if you are using one as reference. :)

joa23 · July 28, 2018, 11:09pm

Thanks @johanandren
Yes - the talk you mentioned GearPump is a little older.
I’m still trying to solve my problem described above. Now that using a ClusterPool is not possible I’m trying to deploy remote actors manually and than create a group from them. However, I can’t find an example how to create an Actor manually on a different cluster node. Do you happen to have a snippet or pointer?
Thanks.
S

johanandren · July 30, 2018, 12:06pm

I’d recommend that you try to turn the problem around and have nodes create their own actors and then have them take part in coordinated work rather than using central management of remote deployed actors (in addition to making the distributed app more tightly coupled that also introduces a single point of failure as the all remotely deployed actors will have a parent on the node that spawned them).

If you look at pretty much all the cluster tool you will notice that this is how they are designed.

If you still want to go with remote deployment, there is a section in the docs here: https://doc.akka.io/docs/akka/current/remoting.html#creating-actors-remotely

Topic		Replies	Views
Akka - Actor Cluster Inconsistent Shard rebalance while Rolling restart of nodes Akka akka , akka-cluster	4	578	May 2, 2023
Akka sharding with consistent hashing instead of shard coordinator Akka Cluster	6	2312	September 10, 2019
Rebalance not happening when using Akka cluster Sharding with proxyMode Akka	0	353	June 22, 2022
Actor distribution question Akka Cluster	3	995	March 18, 2019
Setting up RoundRobinGroup with BalancingPool Akka Cluster	2	675	May 29, 2020

Fault tolerance best practise

Related Topics