Understanding "caching" performance in a sharded cluster

I’m using Akka Cluster Sharding in an event-sourced application with ~300 million persistent entities. Obviously not all of those entities can be in memory at once, so I’m relying on automatic passivation to evict entities to make space. So basically, I have a cache of in-memory entities, and I’d like to get a better understanding of how many hits and misses I’m getting.

I dug through the code a bit and I couldn’t find anything that would notify me:

  • when a message was received for an in-memory entity (this is a hit)
  • when a message was received for an out-of-memory entity (this is a miss). This entity will need to be rehydrated from the event log.

Can you provide any advice on how I might implement such counters?

I don’ t know the answer. And I’d love if someone jumped in with one. Because it does seem like there should be a telemetry event for restoring a sharded entity. But, at least as far as I know, there wasn’t one. Maybe open a ticket with Lightbend, if you have a sub, because this seems a logical metric to me.

But let me tell you what I’ve done instead. It may not be perfect, but it’s what I’ve done.

Mostly it just revolved around tracking recovery. Rehydrating the actor isn’t what’s important, it’s the work recovering the state that’s important. So you can look at your time spent in recovery from the built-in metrics. And you can also log/track the RecoveryCompleted signal if you want something closer to “# of misses”.

You can get “total messages” from Telemetry, so if you care about hit/ratio it would be simple to calcuate hits by deducting misses from total. On one hand it feels like a hack, but on the other hand “time spent in recovery” is really what you care about.

Hmm, I wasn’t aware that “time spent in recovery” is available from Telemetry (I assume you mean Akka Insights?). In fact, I’ve never used Insights at all. I’ll look into it.

Thanks for the tip.