Dropping buffered messages to shard during hand off to avoid re-ordering

Hi folks,

I’m having the following issue during deployments. It seems that when rebalancing shards, some of the messages for that particular shard are been dropped and because of that, never answered.

WARN akka.cluster.sharding.ShardRegion CategoryTreeSupervisor: Dropping [89] buffered messages to shard [14] during hand off to avoid re-ordering

Do we have any idea why this is happening or how can I fix it?

Message delivery isn’t guaranteed so there can be cases where they will be dropped. It should be a good best effort though, so if you have a small example application that we can try for reproducing the problem it would be interesting to look into.

It’s important that it’s using graceful leaving when performing the rolling updates. That should be the default via the jvm exit but would be good to confirm. Should be visible in logs. No downing.

Make sure you use lates Akka 2.6.x patch version, and enable new allocation strategy. Cluster Sharding • Akka Documentation

Also, look into using app-version property. Rolling Updates • Akka Documentation

Hey, I’ve also seen this issue in our application.

We are using the latest Akka version and are using the new allocation strategy.

DeadLetter message=Dropped(com.xxx.na.sharding.internal.ShardedMessage@4a142d5f,Avoiding reordering of buffered messages at shard handoff,Actor[akka://test-product@10.129.6.229:25520/temp/akkaPersistenceTestService$zf],null) from=Actor[akka://test-product/deadLetters] to=Actor[akka://test-product/deadLetters]

It seems to be very rare though. Only one message was lost during an elaborate test that consisted of thousands of sharded entities with many message sent between them. The message loss occurred when the cluster was scaled up (in the middle of the test) and a shard migrated because of this from an existing node to the new node that freshly joined the cluster.

We do all we can to make such a migration graceful. For instance, one of the things we do is that if the sharded entity needs to schedule a message that is to be sent to it at a later moment in time, then that message will not be sent directly to the sharded entity’s actorRef but rather to the actor ref of the shard. That gives the shard the possibility to buffer the message in case the shard would be in the middle of shard migration. And that works nicely, in 99.99% of the cases.

It would be good to get a bit more explanation on this particular case though and if there is anything that we can do to avoid it. The source (ShardRegion.scala) has a piece of docs that says

// must drop requests that came in between the BeginHandOff and now,
// because they might be forwarded from other regions and there
// is a risk or message re-ordering otherwise

That suggests to me there is a small window of vulnerability where message loss does occur.

Does this make sense, and is there something we can do about it?