Akka Persistence + Cassandra + Recovery + Circuit Breaker Timeout + Multiple Cassandra Contact Points

We are using Akka Persistence with Cassandra 3.x. During application (pod in Openshift) startup, we have noticed occasional Circuit Breaker timeout issues. Few questions

  1. When we have multiple Cassandra contact points, and circuit breaker condition is met for one node – will recovery resume or restart using another Cassandra node?

  2. When recovery fails, is there a way to get Cassandra connection information? I mean, recovery failed using which connection.

  3. If there is no self-healing, is there a way for the application to restart recovery and force another Cassandra node?

Library Details

  • akka-persistence_2.12: 2.5.12
  • akka-persistence-cassandra_2.12: 0.85

Circuit Breaker Configuration

circuit-breaker {
                 max-failures = 5
                 call-timeout = 1s
                 reset-timeout = 1s
               }

Error Logs

2018-11-06T14:59:26,455 ERROR [csm-akka.actor.default-dispatcher-4](akka://csm/user/network) -- system.NetworkSupervisor - Persistence failure when replaying events for persistenceId [network]. Last known sequence number [196]
akka.pattern.CircuitBreaker$$anon$1: Circuit Breaker Timed out.

2018-11-06T14:59:26,466 ERROR [csm-akka.actor.default-dispatcher-24](akka://csm/user/network) -- actor.OneForOneStrategy - Recovery Failed: for the Event::None::Circuit Breaker Timed out.
csm.exceptions.RecoveryFailedException: Recovery Failed: for the Event::None::Circuit Breaker Timed out.
	at csm.actors.system.AbstractClientStateSupervisingActor.onRecoveryFailure(AbstractClientStateSupervisingActor.java:56) ~[main/:?]
	at akka.persistence.Eventsourced$$anon$4.stateReceive(Eventsourced.scala:623) ~[akka-persistence_2.12-2.5.13.jar:2.5.13]
	at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:222) ~[akka-persistence_2.12-2.5.13.jar:2.5.13]
	at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:221) ~[akka-persistence_2.12-2.5.13.jar:2.5.13]
	at akka.persistence.AbstractPersistentActor.aroundReceive(PersistentActor.scala:454) ~[akka-persistence_2.12-2.5.13.jar:2.5.13]
	at csm.actors.system.AbstractClientStateSupervisingActor.aroundReceive(AbstractClientStateSupervisingActor.java:49) ~[main/:?]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.actor.ActorCell.invoke(ActorCell.scala:557) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [akka-actor_2.12-2.5.13.jar:2.5.13]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [akka-actor_2.12-2.5.13.jar:2.5.13]

Retries are handled internally by the Cassandra driver. https://github.com/akka/akka-persistence-cassandra/blob/master/core/src/main/resources/reference.conf#L622

The circuit breaker indicates that the journal isn’t responding so I’d take a look at driver logs and serverside metrics to see why this is the case.