ShardRegion bottleneck

Hi all,

I am a bit concerned that the ShardRegion actor may be a bottleneck in an application I am building. From what I understand, all messages to a specific shard region passes through the ShardRegion actor before forwarding the message to the appropriate shard and finally delivered to the entity (ShardRegion -> Shard -> Entity). Are messages to the ShardRegion serialized (consumed sequentially)? Or is the ShardRegion actor a “special” actor like a router? If the former is true, it seems like this would be a huge limit to scalability if there are hundreds or thousands of entities under a ShardRegion actor. How should shards be built to avoid the sequential bottleneck in that case?

There is another thread on this topic here What are the possible bottlenecks of using Sharding?

Hi, unfortunately, that thread doesn’t really give me the answer I’m looking for. That thread discusses the ShardCoordinator. I understand that the ShardCoordinator is used to look up the location of a shard if it is unresolved, otherwise it is bypassed.

What I need to know is how messages are passed through the ShardRegion actor before being delivered to a specific entity. Are messages that arrive at a ShardRegion actor consumed sequentially?

1 Like

The shard region doesn’t do much for each message so it should be able to handle a high throughput.

It’s actually two actors there. First the ShardRegion actor which extracts the shardId from the message and delegates to the Shard actor. This is pretty much like a HashMap lookup. Then the Shard actor delegates to the entity actor by another lookup.

What is your throughput requirements and have you measured that the bottleneck for that is in the sharding actors and not in your entity actors?

Yes, the ShardRegion and Shard actors may not do much more than some lookups. Just wondering if messages to these actors are consumed sequentially (i.e. they are “real” actors, not pseudo-actors like routers). If so, throughput will always be limited, regardless of actual throughput requirements. This is for a thesis project in parallel and concurrent programming so it’s more of an academic exercise of pointing out any limits to scalability in the system.

Yes, Shard and ShardRegions have their own actors. You can increase the throughput parameter on the dispatcher for those actors so that more messages are processed at a time from these actors (and any actor associated with the dispatcher) before switching to a different actor. You can also increase the number of shard regions and instances of your app, but there are trade-offs with doing that.

Also, if you are using remember entities, there are some possible potential performance issues if you have a lot of different entities. These performance issues would happen during the initial start-up, during a restart, or during a rebalance. I talked about these issues in a different topic, but no one has replied yet.

Yes, adjusting the throughput parameter helps avoid unnecessary context switching, and may contribute to better throughput due to prefetching and caching (guess it depends on the underlying implementation of the queue as well as the architecture).

Ultimately, we don’t really care in which shard the entities reside, we just want to send messages to them. Extract the entityId first and if unresolved, extract the shardId second. If the shard is is unresolved, get a ref to the shard via the shard coordinator. If the shard is resolved, ask the shard for a ref to the entity.

This would allow senders to bypass the Shard actor if an entity is resolved.

Thanks for clarifying that it’s a theoretical question.

If it would be a problem you could define more shards (less entities per shard) and scale to more nodes.

You could also use more entity types, and thereby more region actors. Before sending the client/sender would decide which region actor to use for the specific target entity.