Performance of recovery of persistent actors (cassandra backend)


Recent stress tests of the recovery of persistent actors I performed, reveal that recovering an actor with persistence ID “A” takes significantly longer if another instance of that actor with persistence ID “B” has also persisted some events. Furthermore, while the actor is alive, the higher the amount of events that have been persisted, the longer it takes for an individual persist-call to complete.

I am using a relatively simple ReplayActor which ups a counter every time a command comes in. It persists an event every time it changes its counter. Source code is here: The test I run simply sends 10k UpdateCounterCommands, then waits until the actor’s ReceiveTimeout lets the actor stop itself, and finally recreates an actor with the same persistence ID and sends it a new UpdateCounterCommand. Logs of my tests that show the timings of persisting 500 events and recovery of each actor are here

Below a quick rundown of the recovery times

  • Actor 1: Replaying eventlog of 10000 messages took 24857ms PersistenceId: 2b922de1-d088-49b4-953f-1fc1b11195f8
  • Actor 2: Replaying eventlog of 10000 messages took 45666ms PersistenceId: 592119ae-3ae6-494b-898c-0e0ef868508b
  • Actor 3: Replaying eventlog of 10000 messages took 75792ms PersistenceId: 868d437b-6e9d-4e1a-8fe1-9bb8750dd714
  • Actor 4: Replaying eventlog of 10000 messages took 105311ms PersistenceId: 12aeeb05-7c87-4c27-9206-8f7081bc0e3c
  • Actor 5: Replaying eventlog of 10000 messages took 136015ms PersistenceId: 44cf1824-8a88-4f4d-b9a0-4b081f743264
  • Actor 6: Replaying eventlog of 10000 messages took 169328ms PersistenceId: b56624f9-0c17-444b-9ac6-51e302ae08a4
  • Actor 7: Replaying eventlog of 10000 messages took 198782ms PersistenceId: d4192d8e-2ee6-4e7e-9e6a-758f59340a8a
  • Actor 8: Replaying eventlog of 10000 messages took 245571ms PersistenceId: eb1c0a14-8594-4b36-8cb1-de28d827c077
  • Actor 9: Replaying eventlog of 10000 messages took 276393ms PersistenceId: 863c5ee2-80eb-4b70-8517-84ae9c40982e

Please note that each actor uses a different persistence ID

I have run these tests in Java with Akka Persistence Cassandra 0.60 and 0.89. Both show similar results. In fact, actors that recover in v0.89 are even slower than those that recover in v0.60.

This came as a surprise to me. I was expecting the amount of events persisted by an actor with one persistence ID to be irrelevant to the recovery speed of an actor with another persistence ID.

Were my expectations wrong? Is this intended behavior?

Thanks in advance!

Turns out lidalia’s in memory logger was on the classpath and it was keeping all logmessages in memory, slowing everything down.

Exchanging it with logback solved the issue.