Why does Akka recommend having 10x as many shards as nodes?

I was trying out an experiment with Persistent actors on a cluster, backed by Cassandra, with a million actors. I had 3 nodes with 9 shards, and the throughput was lower than what I was expecting. Until, I found this line in the documentation:

As a rule of thumb, the number of shards should be a factor ten greater than the planned maximum number of cluster nodes. It doesn’t have to be exact

After I changed the number of shards to 30, the throughput improved significantly (almost 5 times).

My question is, what is happening under the hood that caused this, and where does this 10x guideline come from?

The exact number is only a starting point, there are so many kinds of system that can be built with Akka, tweaking it after benchmarking your use case definitely makes sense.

The number is based on the following:

  1. You don’t want to few, because the shards are how sharding keeps the sharded actors balanced, it can only move full shards between nodes when the cluster topology changes. If you have to few, let’s say worst case, fewer than cluster nodes, you will have nodes with no sharded actors on them at all. In addition to this messages to entities in a shard will pass through the shard inbox so with very high numbers of entities it makes sense to have more shards.

  2. You don’t want too many shards, because there is some level of overhead for every shard and at some point that will start showing up.

There is an additional aspect to think about and that is how evenly your shard id extractor hashes the entities into the shards. If it does not have a good distribution you could end up with a lot of entities in one shard and very few in others.

I think the concept of shard inbox is what I was missing. That means having more shards would increase the concurrency, which would improve the throughput up to a point.

I understand that it would require some benchmarking, as the right balance would probably depend on the use-case that is implemented, and 10x is only a starting point.

Thank you!