Hey, I’ve also seen this issue in our application.
We are using the latest Akka version and are using the new allocation strategy.
DeadLetter message=Dropped(com.xxx.na.sharding.internal.ShardedMessage@4a142d5f,Avoiding reordering of buffered messages at shard handoff,Actor[akka://email@example.com:25520/temp/akkaPersistenceTestService$zf],null) from=Actor[akka://test-product/deadLetters] to=Actor[akka://test-product/deadLetters]
It seems to be very rare though. Only one message was lost during an elaborate test that consisted of thousands of sharded entities with many message sent between them. The message loss occurred when the cluster was scaled up (in the middle of the test) and a shard migrated because of this from an existing node to the new node that freshly joined the cluster.
We do all we can to make such a migration graceful. For instance, one of the things we do is that if the sharded entity needs to schedule a message that is to be sent to it at a later moment in time, then that message will not be sent directly to the sharded entity’s actorRef but rather to the actor ref of the shard. That gives the shard the possibility to buffer the message in case the shard would be in the middle of shard migration. And that works nicely, in 99.99% of the cases.
It would be good to get a bit more explanation on this particular case though and if there is anything that we can do to avoid it. The source (ShardRegion.scala) has a piece of docs that says
// must drop requests that came in between the BeginHandOff and now,
// because they might be forwarded from other regions and there
// is a risk or message re-ordering otherwise
That suggests to me there is a small window of vulnerability where message loss does occur.
Does this make sense, and is there something we can do about it?