So, there’s a couple of straightforward answers to your question, but I’m not sure what you are asking for is really what you want.
There’s two ways that a shard could be moved. (At least that I can recall off the top of my head.) One is a failover event. If the Akka node where a shard is located goes away (for any reason) that shard will be immediately placed on another node. There’s not really any way to control or tune this: if the shard isn’t immediately placed on a node then the cluster would be unable to route messages for actors in that shard! The second way is rebalancing. From time to the time the cluster is going to look for an unequal distribution of shards and move shards from one node to another in order to make the distribution more equal. There are a lot of knobs and dials on this rebalancing process and you can, indeed, slow it down.
But if you are doing a rolling update, all of the shards are going to move anyway! They won’t have a choice because every Akka node is going to be restarted. There’s all kinds of magic in the recent versions of Akka to try and minimize the number of moves required, but everything is going to move, pretty much by definition.
So, instead, what I think you want to consider is actually slowing down your rolling updates rather than a configuration at the Akka level. If Kubernetes kills a Pod, then Akka has no choice but to relocate the shards. But if you slow down Kubernetes then Akka will generally do the smart thing to handle the shards. (I’m assuming you have deployed on K8S, but if you deployed on VMs or ECS, or anything like that the principles will remain the same, but how you implement will have to change.)
Your key tunables are doing to be:
(all on the Deployment object):
.spec.minReadySeconds : A pod won’t be “ready” until it is ready to receive shards. So when we say a shard shouldn’t be considered “available” (for rolling update purposes) for a period of time after it is ready, we are saying “this Akka node is fully ready, go ahead and send it traffic, but wait an additional X amount of time as a grace period before K8S should really count it as “available” for purposes of progressing the rolling update”. This tunable exists for these exact types of scenarios: where we want to give new Pods a “warm up” time.
.spec.strategy.rollingUpdate.maxUnavailable This effectively determines how many old nodes K8S will shutdown at a time. If you are trying to slow down the rolling update, set this to 1. Combined with the previous tunable you can essentially have your rolling update restart 1 Pod at a time with a pause of minReadySeconds between Pod restarts. (Because the previous tunable says "don’t count a new node as ‘available’ until it has been ready for X time, if we also say that only one node can be ‘unavailable’ at a time then K8S has to wait X time after each Pod is ready in order to progress the rolling deployment.)
.spec.strategy.rollingUpdate.maxSurge How many new Pods to start. You might want to experiment with this number. If you are afraid of stress on the new Pods as they do replays then you might set this large. Essentially you could start all of your new Akka nodes at once and slowly move shards into them as your drain the old Pods. On the other hand, if you are afraid of stress on the DB rather than stress on the Pods, that wouldn’t help and you’d likely be better to only add one new Pod at a time.
In short, however, you probably want to tune your K8S deployments rather than your Akka config.