Consistency of shard coordinator

dricketts · March 30, 2020, 10:37pm

How does the shard coordinator (in distributed data mode) guarantee that an entity runs on at most one node at a time, given that distributed data mode is only eventually consistent?

The documentation mentions that the coordinator runs with Write Majority/Read Majority consistency. But I can’t figure out how that helps in partial failure scenarios. Here’s one example:

Coordinator C1 starts to allocate a new shard S to shard region SR1, writes the mapping to a minority of nodes, and dies.
Coordinator C2 starts up, reads from some nodes, one of those nodes has the mapping of S to SR1.
C2 tells SR1 about the mapping for S. This starts S on that region.
C2 dies.
Coordinator C3 starts up, reads from some nodes, none of them have the mapping of S to SR1 (because it was only written on a minority of nodes).
C3 allocates S on a different shard region SR2.

What prevents this from happening? More generally, is the a place to read more algorithmic details about the shard coordinator in distributed data mode, beyond just reading the code?

dricketts · April 2, 2020, 3:46pm

This scenario can happen, and it’s a bug. See https://github.com/akka/akka/issues/28856#issuecomment-607750168 for details.

Topic		Replies	Views
Akka sharding with consistent hashing instead of shard coordinator Akka Cluster	6	2322	September 10, 2019
Run ShardCoordinator in a Node with Different Role Akka Cluster configuration	2	517	April 12, 2022
Akka Cluster Sharding Issue Persistence / Event Sourcing	2	2235	April 17, 2018
Deploying shard coordinator on nodes with different role than nodes hosting shards Akka Cluster	2	619	November 27, 2019
Register coordinator fail when I use proxy-only-mode for cluster sharding Akka Cluster	2	643	December 23, 2019

Consistency of shard coordinator

Related Topics