Various question about failures with akka persistance and cluster sharding

Dear hakkers,

I was wondering how to handle failure using akka persistance and cluster sharding.

When the client send a command, the command hit a node then the command is routed to the good shard and then handled by the persitant actor.

From the client side, I want to have a response to tell the end user what append: command invalid, failure …

When something bad append in the cluster I think the caller will end with an AskTimeoutException. In that case how could I know if the command has been handled or not, to notify to the end user ?
For example, if the command is handled by the persistant actor but the node handling the sharding crash at the same moment, the caller will end with an AskTimeoutException but the command is handled.
If the persistant actor crash, the caller will end with an AskTimeoutException and the command is not handled.

What is the best way to handle those cases ?

Thanks,
Alex.

This is one of the good old classic problems with distributed systems: You cannot know the difference between an ask that failed with timeout because the sent message got lost and was not received, the message was received and processed but caused the process to fail before a response was sent (potential partially applied) or the message was received but the response was lost.

To get that delivery guarantee you will have to do resending. There is a component for at least once delivery in Akka (https://doc.akka.io/docs/akka/current/persistence.html#at-least-once-delivery) but it will then require a roundtrip to persistence on both sides, pretty much the only way to guarantee 100% delivery in the face of nodes crashing etc.

However if you want something simpler with a bit less guarantee you can for example make the action idempotent, make sure that the same command several times only does the change once, and do resends on timeout. We are hoping to introduce something more lightweight to cover such requirements in Akka Typed (see this PR: https://github.com/akka/akka/pull/25099)

Ok, that what I thought.

Thanks for your answers!