Kubernetes: Cassandra timeout during CAS write query at consistency SERIAL

skokaina · December 5, 2019, 11:30am

Hi,

Running Lagom with a Cassandra instance deployed on minikube fails with timeout.

Lagom version: 1.5.4
minikube: 1.5.2
helm version: v3.0.0
Cassandra Chart: bitmani/cassandra - 4.1.11
Cassandra: 3.11.5

Cassandra write request timeout property in cassandra.yaml: 2000, even having it set to 10000, doesn’t change much.

Steps to reproduce:

deploy cassandra chart to a minikube instance using helm
forward cassandra pod port to localhost (kubectl port-forward pod-name 9042:9042 &
in the project build.sbt disable cassandra and point it to localhost with related auth properties
ie. lagomCassandraEnabled in ThisBuild := false

During init, the following waring appears:

12:13:34.600 [info] com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor [sourceThread=hello-impl-application-akka.actor.default-dispatcher-3, akkaTimestamp=11:13:34.600UTC, akkaSource=akka.tcp://hello-impl-application@127.0.0.1:63213/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare, sourceActorSystem=hello-impl-application] - Cluster start task cassandraOffsetStorePrepare done.

12:13:44.593 [warn] akka.persistence.cassandra.journal.CassandraJournal [sourceThread=hello-impl-application-lagom.persistence.dispatcher-29, akkaTimestamp=11:13:44.592UTC, akkaSource=akka.tcp://hello-impl-application@127.0.0.1:63213/system/cassandra-journal, sourceActorSystem=hello-impl-application] - Failed to connect to Cassandra and initialize. It will be retried on demand. Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during CAS write query at consistency SERIAL (1 replica were required but only 0 acknowledged the write)

Trying to call any REST Api, throws:

akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://hello-impl-application/system/sharding/HelloEntity#-1282607955]] after [5000 ms]. Message of type [com.lightbend.lagom.scaladsl.persistence.CommandEnvelope]. A typical reason for AskTimeoutException is that the recipient actor didn’t send a reply.
at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:648)
at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:669)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:202)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:874)
at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:113)
at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:872)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:334)
at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:285)
at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:289)
at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:241)
at java.lang.Thread.run(Thread.java:748)

Having it run with a Cassandra database deployed on the localhost doesn’t go into any timeout… Any clue or documentation would be much appreciated as we would like to see how far lagom can be deployed on a local kubernetes cluster including cassandra within the cluster.

Also the weirdest thing about it, is that the app keyspace and related cqrs tables are created while going through cqlsh on the cassandra instance deployed on the minikube cluster.

chbatey · December 9, 2019, 1:49pm

This is during initialization, are any queries succeeding? Often when you get a timeout with replicas available 9 it is because your keyspace referes to data centers that don’t exist in the cluster.
Describe the keyspace and check that if it is using NetworkTopology that the datacenter than nodetool reports with nodetool status

skokaina · December 10, 2019, 1:00pm

Most of the initialisation is succeeding as I can see that Lagom creates the app keyspace and related tables. Could you please illustrate, ideally with a command, which information should be checked with nodetool ?

Meanwhile, Here after the ouput of a cqlsh to the cassandra instance on minikube, my app keyspace being hello:

Connected to cassandra at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.5 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cassandra@cqlsh> describe keyspaces
stream         system_auth system_traces  **hello**
system_schema  system       system_distributed  authtest

Here after is the app configuration file:

play.application.loader = com.micro.xp.poste.impl.PosteLoader

poste.cassandra.keyspace = hello

cassandra-journal.keyspace = ${poste.cassandra.keyspace}
cassandra-snapshot-store.keyspace = ${poste.cassandra.keyspace}
lagom.persistence.read-side.cassandra.keyspace = ${poste.cassandra.keyspace}

cassandra.default {
  ## list the contact points  here
  contact-points = ["127.0.0.1"]
  ## override Lagom’s ServiceLocator-based ConfigSessionProvider
  session-provider = akka.persistence.cassandra.ConfigSessionProvider
}

cassandra-journal {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

cassandra-snapshot-store {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

lagom.persistence.read-side.cassandra {
  contact-points = ${cassandra.default.contact-points}
  session-provider = ${cassandra.default.session-provider}
}

lagom.persistence.read-side.cassandra {
  authentication {
    username = "user"
    password = "pwd"
  }
}

akka.loglevel = DEBUG

cassandra-journal{
  authentication {
    username = "user"
    password = "pwd"
  }
}

cassandra-snapshot-store{
  authentication {
    username = "user"
    password = "pwd"
  }
}

Still getting a timeout error to the Akka Actor (PersistentEntity), when calling the hello rest api:

13:50:14.375 [error] akka.cluster.sharding.PersistentShardCoordinator [sourceThread=poste-impl-application-akka.actor.default-dispatcher-17, akkaTimestamp=12:50:14.373UTC, akkaSource=akka.tcp://poste-impl-application@127.0.0.1:62618/system/sharding/PosteEntityCoordinator/singleton/coordinator, sourceActorSystem=poste-impl-application] - Persistence failure when replaying events for persistenceId [/sharding/PosteEntityCoordinator]. Last known sequence number [0]
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.OperationTimedOutException: [/127.0.0.1:9042] Timed out waiting for server response
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:552)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:513)
	at akka.persistence.cassandra.package$$anon$1.$anonfun$run$1(package.scala:18)
	at scala.util.Try$.apply(Try.scala:213)
	at akka.persistence.cassandra.package$$anon$1.run(package.scala:18)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.OperationTimedOutException: [/127.0.0.1:9042] Timed out waiting for server response
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onTimeout(RequestHandler.java:954)
	at com.datastax.driver.core.Connection$ResponseHandler$1.run(Connection.java:1575)
	at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:682)
	at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:757)
	at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:485)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted

The weirdest thing about it, is that running the Hello Project (same sample of code) using the Java seed project from Lagom (https://www.lagomframework.com/get-started-java-maven.html), just works, unlike the scala implementation with sbt…

skokaina · December 12, 2019, 9:34am

In all, the timeout exception isn’t self explanatory and looks a way too generic, making it difficult to find the root cause.

We are currently using a simplestrategy, even forcing it within the configuration doesn’t solve it:

$ nodetool describecluster

Cluster Information:
        Name: cassandra
        Snitch: org.apache.cassandra.locator.SimpleSnitch
        DynamicEndPointSnitch: enabled
        Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
        Schema versions:
                UNREACHABLE: [172.17.0.9]

$ cassandra@cqlsh> select * from system_schema.keyspaces ;

 keyspace_name      | durable_writes | replication
--------------------+----------------+-------------------------------------------------------------------------------------
….
              hello |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
              poste |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '1'}
 system_distributed |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '3'}
             system |           True | {'class': 'org.apache.cassandra.locator.LocalStrategy'}
      system_traces |           True | {'class': 'org.apache.cassandra.locator.SimpleStrategy', 'replication_factor': '2'}

$ cassandra@cqlsh> select data_center from system.local;

 data_center
-------------
 datacenter1

Any hint ?

Topic		Replies	Views
Timeout rror between Lagom service and cassandra Lagom Persistence API	0	1168	June 24, 2019
Akka persistence Cassandra throwing ask timeouts after writing few million events Lagom akka , streams	1	800	December 12, 2020
Akka Cassandra Persistence WriteTimeoutException in query with consistency SERIAL during single-node akka cluster startup Persistence / Event Sourcing	1	1045	July 13, 2018
Cassandra failes with AskTimeoutException, actor doesn't have response randomly Persistence / Event Sourcing	0	400	April 4, 2022
ReadsideProcessor Cassandra failes with AskTimeoutException Lagom Persistence API	10	1540	March 1, 2019

Kubernetes: Cassandra timeout during CAS write query at consistency SERIAL

Related Topics