Kubernetes: Cassandra timeout during CAS write query at consistency SERIAL

Hi,

Running Lagom with a Cassandra instance deployed on minikube fails with timeout.

Lagom version: 1.5.4
minikube: 1.5.2
helm version: v3.0.0
Cassandra Chart: bitmani/cassandra - 4.1.11
Cassandra: 3.11.5

Cassandra write request timeout property in cassandra.yaml: 2000, even having it set to 10000, doesn’t change much.

Steps to reproduce:

  • deploy cassandra chart to a minikube instance using helm
  • forward cassandra pod port to localhost (kubectl port-forward pod-name 9042:9042 &
  • in the project build.sbt disable cassandra and point it to localhost with related auth properties
    ie. lagomCassandraEnabled in ThisBuild := false

During init, the following waring appears:

12:13:34.600 [info] com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor [sourceThread=hello-impl-application-akka.actor.default-dispatcher-3, akkaTimestamp=11:13:34.600UTC, akkaSource=akka.tcp://hello-impl-application@127.0.0.1:63213/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare, sourceActorSystem=hello-impl-application] - Cluster start task cassandraOffsetStorePrepare done.

12:13:44.593 [warn] akka.persistence.cassandra.journal.CassandraJournal [sourceThread=hello-impl-application-lagom.persistence.dispatcher-29, akkaTimestamp=11:13:44.592UTC, akkaSource=akka.tcp://hello-impl-application@127.0.0.1:63213/system/cassandra-journal, sourceActorSystem=hello-impl-application] - Failed to connect to Cassandra and initialize. It will be retried on demand. Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during CAS write query at consistency SERIAL (1 replica were required but only 0 acknowledged the write)

Trying to call any REST Api, throws:

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#-1282607955]] after [5000 ms]. Message of type [com.lightbend.lagom.scaladsl.persistence.CommandEnvelope]. A typical reason for AskTimeoutException is that the recipient actor didn’t send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:648)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:669)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:202)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:874)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:872)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:334)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:285)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:289)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:241)
at java.lang.Thread.run(Thread.java:748)

Having it run with a Cassandra database deployed on the localhost doesn’t go into any timeout… Any clue or documentation would be much appreciated as we would like to see how far lagom can be deployed on a local kubernetes cluster including cassandra within the cluster.

Also the weirdest thing about it, is that the app keyspace and related cqrs tables are created while going through cqlsh on the cassandra instance deployed on the minikube cluster.

This is during initialization, are any queries succeeding? Often when you get a timeout with replicas available 9 it is because your keyspace referes to data centers that don’t exist in the cluster.
Describe the keyspace and check that if it is using NetworkTopology that the datacenter than nodetool reports with nodetool status

Most of the initialisation is succeeding as I can see that Lagom creates the app keyspace and related tables. Could you please illustrate, ideally with a command, which information should be checked with nodetool ?

Meanwhile, Here after the ouput of a cqlsh to the cassandra instance on minikube, my app keyspace being hello:

Connected to cassandra at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.5 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh> describe keyspaces
stream         system_auth system_traces  **hello**
system_schema  system       system_distributed  authtest     

Here after is the app configuration file:

play.application.loader = com.micro.xp.poste.impl.PosteLoader

poste.cassandra.keyspace = hello

cassandra-journal.keyspace = ${poste.cassandra.keyspace}
cassandra-snapshot-store.keyspace = ${poste.cassandra.keyspace}
lagom.persistence.read-side.cassandra.keyspace = ${poste.cassandra.keyspace}

cassandra.default {
  ## list the contact points  here
  contact-points = ["127.0.0.1"]
  ## override Lagom’s ServiceLocator-based ConfigSessionProvider
  session-provider = akka.persistence.cassandra.ConfigSessionProvider
}

cassandra-journal {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

cassandra-snapshot-store {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

lagom.persistence.read-side.cassandra {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

lagom.persistence.read-side.cassandra {
  authentication {
    username = "user"
    password = "pwd"
  }
}

akka.loglevel = DEBUG

cassandra-journal{
  authentication {
    username = "user"
    password = "pwd"
  }
}

cassandra-snapshot-store{
  authentication {
    username = "user"
    password = "pwd"
  }
}

Still getting a timeout error to the Akka Actor (PersistentEntity), when calling the hello rest api:

13:50:14.375 [error] akka.cluster.sharding.PersistentShardCoordinator [sourceThread=poste-impl-application-akka.actor.default-dispatcher-17, akkaTimestamp=12:50:14.373UTC, akkaSource=akka.tcp://poste-impl-application@127.0.0.1:62618/system/sharding/PosteEntityCoordinator/singleton/coordinator, sourceActorSystem=poste-impl-application] - Persistence failure when replaying events for persistenceId [/sharding/PosteEntityCoordinator]. Last known sequence number [0]
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [/127.0.0.1:9042] Timed out waiting for server response
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:513)
	at akka.persistence.cassandra.package$$anon$1.$anonfun$run$1(package.scala:18)
	at scala.util.Try$.apply(Try.scala:213)
	at akka.persistence.cassandra.package$$anon$1.run(package.scala:18)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/127.0.0.1:9042] Timed out waiting for server response
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:954)
	at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1575)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:682)
	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:757)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:485)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted

The weirdest thing about it, is that running the Hello Project (same sample of code) using the Java seed project from Lagom (https://www.lagomframework.com/get-started-java-maven.html), just works, unlike the scala implementation with sbt… :confounded:

In all, the timeout exception isn’t self explanatory and looks a way too generic, making it difficult to find the root cause.

We are currently using a simplestrategy, even forcing it within the configuration doesn’t solve it:

$ nodetool describecluster

Cluster Information:
        Name: cassandra
        Snitch: org.apache.cassandra.locator.SimpleSnitch
        DynamicEndPointSnitch: enabled
        Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
        Schema versions:
                UNREACHABLE: [172.17.0.9]

$ cassandra@cqlsh> select * from system_schema.keyspaces ;

 keyspace_name      | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
….
              hello |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
              poste |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
 system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
             system |           True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

$ cassandra@cqlsh> select data_center from system.local;

 data_center
-------------
 datacenter1

Any hint ?