Cassandra Schema Creation

franz · November 26, 2018, 5:30pm

Hi,

When I start my microservices from a clean cassandra, most of them would fail almost all the time. That’s because it’s starting to create the cassandra schema and it takes awhile for that schema to be created (and propagated amongst the cassanda cluster).

The stacktrace of my microservice looks something like this

2018-11-26 11:27:45,522 ERROR a.a.OneForOneStrategy com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
	at com.google.common.util.concurrent.AbstractFuture.getDoneValue(AbstractFuture.java:503)
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:462)
	at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2.$anonfun$run$2(package.scala:25)
	at scala.util.Try$.apply(Try.scala:209)
	at akka.persistence.cassandra.package$ListenableFutureConverter$$anon$2.run(package.scala:25)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
	at com.datastax.driver.core.exceptions.ReadTimeoutException.copy(ReadTimeoutException.java:115)
	at com.datastax.driver.core.Responses$Error.asException(Responses.java:136)
	at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:507)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1075)
	at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:998)
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:647)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:582)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:461)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	... 1 common frames omitted
Caused by: com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency QUORUM (2 responses were required but only 1 replica responded)
	at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:63)
	at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:38)
	at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289)
	at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269)
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88)
	... 22 common frames omitted

Any ideas how to fix this?

Thanks,
Franz

aklikic · November 26, 2018, 6:06pm

@franz
Check replication factor for your keyspace.
You have 3 nodes and constancy level (QUORUM) expects 2 to respond.
Replication factor should be, in this case, 3.

franz · November 26, 2018, 6:09pm

Thanks @aklikic. Yes. In my configuration, replication factor is set to 3

cassandra-journal {
  replication-factor = 3
  max-message-batch-size = 100
}

cassandra-snapshot-store {
  replication-factor = 3
  write-consistency = "QUORUM"
  read-consistency = "QUORUM"
}

aklikic · November 26, 2018, 8:27pm

Did you check cassanda logs?

franz · November 26, 2018, 10:08pm

Yes. in the cassandra logs, it says that schema is being propagated. Which makes sense.

The problem is that that process of propagating schema takes awhile and thus, the microservice fails to startup.

aklikic · November 26, 2018, 10:51pm

But if it fails it should retry.
You are doing create table in read side processor globalPrepare?
This stack trace is related to query. Do you get any error on createTable?

franz · November 26, 2018, 10:53pm

@aklikic this is on simple Command being passed to the Entity.

In some of our microservices, they would eventually recover. In some, I would have to restart those microservices.

Is this not something you get often as well on a fresh cassandra database?

aklikic · November 26, 2018, 11:28pm

Never had problems that would black application start. It always eventually recovers.

franz · November 27, 2018, 12:00am

but you do see that error on the microservice while cassandra schema is still being created?

TimMoore · November 27, 2018, 3:56am

@franz I would recommend creating the schema explicitly in Cassandra (using cqlsh for example) before deploying your services in production (or a production-like environment). The auto-creation feature is really intended only as a convenience for development and testing, and is not recommended for production.

franz · November 27, 2018, 6:48am

Thanks @TimMoore.

I was thinking of something similar to that as well. but how can I safely do that?

For example, I can probably export the current schema and use that in the schema creation but is that safe to do? If i update lagom and there’s a schema update, would it still run against my manually created schema?

Thanks

TimMoore · November 27, 2018, 7:09am

When there is a schema update in future versions of Lagom, the details will be published in the migration guide.

The schema auto-creation doesn’t do any migration of existing data, it only creates the schema if it does not already exist. This works well on single-node development environments, but can have problems when running in a cluster. There is some attempt to coordinate this and prevent concurrent updates, but it is not 100% bulletproof. If you have requests making queries against the Cassandra schema running concurrently with the auto-creation, this can cause errors.

franz · November 27, 2018, 7:24am

hmm.understood.

Btw, another question. Here’s what I did in the other microservices.

Upon start-up of the microservice, I executed an Iniitalize command on my entity. This Initialize command does nothing except create a dummy Initialized event. This is to force the schema creation if the cassandra database does not exist yet. And if it already exists, then this will execute without error. If it does have errors, it will retry it after x seconds.

Now this works fine, but it pollutes my logs with stacktraces on initial execution. That’s because the request to cassandra will timeout (because the schema creation and propagation to the rest of the cassandra cluster) takes awhile.

One way I think this can be improved is by increasing the timeout period for this particular entityRef.ask(new MyCommand.Initialize()) because I know that this one can take awhile to execute (while the rest of my commands should be pretty fast). Is there a way I can do that?

Thanks

TimMoore · November 28, 2018, 12:16am

Yes, the PersistentEntityRef has a method called withAskTimeout that takes a FiniteDuration

entityRef
  .withAskTimeout(30.seconds)
  .ask(new MyCommand.Initialize())

aditya · November 28, 2018, 4:49am

Have you also checked the cassandra node logs on each of the hosts? We have observed timeouts in play applications in the past.

Possible causes:

1.) Too many tombstones created and large SStables created as a result and no compactions for a long time.
2.) One or more nodes out of sync and require a full repair.

HTH
Aditya

Topic		Replies	Views
Timeout rror between Lagom service and cassandra Lagom Persistence API	0	1168	June 24, 2019
ReadsideProcessor Cassandra failes with AskTimeoutException Lagom Persistence API	10	1539	March 1, 2019
Akka Cassandra Persistence WriteTimeoutException in query with consistency SERIAL during single-node akka cluster startup Persistence / Event Sourcing	1	1044	July 13, 2018
Kubernetes: Cassandra timeout during CAS write query at consistency SERIAL Lagom Persistence API scala , configuration	3	4818	December 12, 2019
Cassandra failes with AskTimeoutException, actor doesn't have response randomly Persistence / Event Sourcing	0	400	April 4, 2022

Cassandra Schema Creation

Related Topics