Slowdown shard handoff for akka cluster rolling update

Hi,

I’ve encountered some actor recovery issues recently during akka cluster rolling update. Is there a way to slow down the actor shard hand off? Say if if it took seconds before can we make it hours? My problem before was some actors had a big amount of replayable journal events. Although I have made changes to ensure many of the actors have recent snapshots, I can’t do that for the inactive ones. So I am thinking if I can slowdown the actor shard turnover. Say if I have 10 shards, I will handoff 1 shard at a time and then with a long delay in between shards. This way, I will spread out the need for recovering actors over a longer period of time and minimize the load on the persistence db.

So, there’s a couple of straightforward answers to your question, but I’m not sure what you are asking for is really what you want.

There’s two ways that a shard could be moved. (At least that I can recall off the top of my head.) One is a failover event. If the Akka node where a shard is located goes away (for any reason) that shard will be immediately placed on another node. There’s not really any way to control or tune this: if the shard isn’t immediately placed on a node then the cluster would be unable to route messages for actors in that shard! The second way is rebalancing. From time to the time the cluster is going to look for an unequal distribution of shards and move shards from one node to another in order to make the distribution more equal. There are a lot of knobs and dials on this rebalancing process and you can, indeed, slow it down.

But if you are doing a rolling update, all of the shards are going to move anyway! They won’t have a choice because every Akka node is going to be restarted. There’s all kinds of magic in the recent versions of Akka to try and minimize the number of moves required, but everything is going to move, pretty much by definition.

So, instead, what I think you want to consider is actually slowing down your rolling updates rather than a configuration at the Akka level. If Kubernetes kills a Pod, then Akka has no choice but to relocate the shards. But if you slow down Kubernetes then Akka will generally do the smart thing to handle the shards. (I’m assuming you have deployed on K8S, but if you deployed on VMs or ECS, or anything like that the principles will remain the same, but how you implement will have to change.)

Your key tunables are doing to be:

(all on the Deployment object):
.spec.minReadySeconds : A pod won’t be “ready” until it is ready to receive shards. So when we say a shard shouldn’t be considered “available” (for rolling update purposes) for a period of time after it is ready, we are saying “this Akka node is fully ready, go ahead and send it traffic, but wait an additional X amount of time as a grace period before K8S should really count it as “available” for purposes of progressing the rolling update”. This tunable exists for these exact types of scenarios: where we want to give new Pods a “warm up” time.

.spec.strategy.rollingUpdate.maxUnavailable This effectively determines how many old nodes K8S will shutdown at a time. If you are trying to slow down the rolling update, set this to 1. Combined with the previous tunable you can essentially have your rolling update restart 1 Pod at a time with a pause of minReadySeconds between Pod restarts. (Because the previous tunable says "don’t count a new node as ‘available’ until it has been ready for X time, if we also say that only one node can be ‘unavailable’ at a time then K8S has to wait X time after each Pod is ready in order to progress the rolling deployment.)

.spec.strategy.rollingUpdate.maxSurge How many new Pods to start. You might want to experiment with this number. If you are afraid of stress on the new Pods as they do replays then you might set this large. Essentially you could start all of your new Akka nodes at once and slowly move shards into them as your drain the old Pods. On the other hand, if you are afraid of stress on the DB rather than stress on the Pods, that wouldn’t help and you’d likely be better to only add one new Pod at a time.

In short, however, you probably want to tune your K8S deployments rather than your Akka config.

1 Like

Thank you David. Shutting old nodes one at a time with a delay in between looks promising. I’ll try this out.

@davidogren let me clarify the difference between “ready” and “available” in your statement about minReadySeconds. Before minReadySeconds is applied a new Node is deemed “ready” to receive network traffic including receiving a shard. However, there is a pause of minReadySeconds before it tells Kubernetes it is “available” in terms of the rolling deployment and you can kill an old pod and then start a new one. So minReadySeconds doesn’t directly delay shard hand off to the new node/pod right?

If the above is correct, if we combine with LeastShardAllocationStrategy(rebalanceThreshold: Int, maxSimultaneousRebalance: Int) we can control the rate of shard hand off to the new node. If I have 2:1 shard:nodes, set both threshold and simultaneous parameters to 1 and set minReadySeconds to 15 minutes. Then I will have 15 minutes per shard hand-off?

Say I have 4 nodes, 8 shards, maxUnavailabe=1, maxSurge=1, rebalanceThreshold=1, maxSimultaneousRebalance=1. During rolling deployment:

  1. New node 5 is started
  2. shard-0 is handed off to New Node 5
  3. Wait 15 minutes (minReadySeconds)
  4. Old node 1 is killed, begin hand-off of shard-1 and shard-2
  5. New node 5 receives shard-1 and shard-2

Is this correct? New Node 5 will have 3 shards and would have to transfer one to the next new node when it is “ready”? Should I set maxSurge to 2, to avoid a shard from being handed off more than once? But this will prevent me from doing 1 shard hand off for every minReadySeconds at least for the new Node. Is there a way to do 1 shard hand off to the first new Node and have a sufficient delay to allow for Canary type of testing before the second new Node is started and other shards are handed off but avoid extra shard handoffs?