Cassandra: "Column family ID mismatch" at startup

I’ve seen a report about similar errors here: Lagom errors on startup

My situation is different.

My lagom 1.4.11 application has about a dozen micro-services in Scala 2.12, each are fairly simple.
Unlike the other case, there is no code that creates tables in cassandra. There are also no read-side processors. It’s just vanilla service descriptors. All interactions w/ Cassandra are due to Lagom’s persistence.

I use an external cassandra 3.11.4 & kafka services.

So, when I execute sbt runAll, lagom starts the service locator + all my micro services; nothing else.

With a clean cassandra server, sbt runAll results in several “Column family ID mismatch” errors in the cassandra log. From the various reports about this error, it seems to be related to vulnerabilities in concurrent schema operations in cassandra; an on-going topic here: https://issues.apache.org/jira/browse/CASSANDRA-9424

On my laptop (Dell 7530, Ubuntu 18.04 LTS, Xeon E2186M, 64Gb RAM), there’s enough horsepower that, statistically speaking, starting a dozen micro-services, each of which needs to initialize a schema in cassandra, may indeed put some stress on a single node cassandra deployment, also running on that machine. At least, this is what my experience seems to suggest.

Recently, I wrote a simple sbt task to start all my micro-services sequentially with Def.sequential(lagomRun in project1, lagomRun in project2, ....).

At least, this avoids stressing cassandra during schema initialization.
What’s annoying though is that I can’t write: Def.sequential(lagomServiceLocatorStart, lagomRun in project1, ....) as SBT says that lagomServiceLocatorStart is undefined.

Indeed, looking in the LagomPlugin source code, this task key is set only in the private project, lagom-internal-meta-project-service-locator. Ok, I have to manually issue lagomServiceLocatorStart and then my sequential task. At least, this makes for a somewhat clean start in dev mode.

Surprisingly, I didn’t used to experience this problem. Is this a scale issue?
I"m not sure, one other factor that may be relevant: recently, I turned on logging for each micro service.
Before, everything went to the sbt output; that might have indirectly forced some sequencing of the micro services start. With logging enabled, It seems that there is more concurrency at startup; perhaps enough to trip some vulnerabilities in cassandra.

I’m wondering whether others have experienced this kind of problem.

For production, I’m wondering about doing something similar, that is, forcing the micro services to start one after the other just to prevent concurrent schema operations on cassandra. After all schemas are created, I’m not worried about concurrent operations hitting cassandra because this is a well known territory whereas schema initialization is more tricky as indicated by the on-going issue.

  • Nicolas.
1 Like

@NicolasRouquette thanks for this Nicolas. I am on the same Lagom version as yours and yes, we are seeing the same problem even with the services that does not use any customer readside tables (i.e. no CreateTables in globalPrepare).

I’ve experienced this issue in my previous project as well. Still couldn’t figure out what exactly is wrong. I guess I will try your sequential deployment and see if that still causes an issue.

What I found was that when I deployed the services without any custom readside tables, the column family exception started as soon as the Lagom was populating the eventsbytag1 materialized view (along with the MAT-View-not-safe logging).

Tried it many times, and experienced it almost all attempts.

@TimMoore Just tagging you for your attention, I think quite a few Lagom users are experiencing and suffering from this Column Family Exception in Cassandra.

1 Like

Some logs…

This is from the service that does NOT create any customer readside table:

At first, in the middle of operation, when I started issueing some mutating commands to this service about 20mins after the deployment was completed. You can see that the service is trying to create this snapshots table adhocly like this:

INFO [MigrationStage:1] 2019-04-22 09:04:05,438 ColumnFamilyStore.java:411 - Initializing fee.snapshots\n

[Native-Transport-Requests-2] 2019-04-22 09:04:05,447 MigrationManager.java:376 - Create new table: org.apache.cassandra.config.CFMetaData@6b2bf1fe[cfId=948f2570-64dd-11e9-947d-038f8cea8ac5,ksName=fee,cfName=snapshots,flags=[COMPOUND],params=TableParams{comment=, read_repair_chance=0.0, dclocal_read_repair_chance=0.1, bloom_filter_fp_chance=0.01, crc_check_chance=1.0, gc_grace_seconds=864000, default_time_to_live=0, memtable_flush_period_in_ms=0, min_index_interval=128, max_index_interval=2048, speculative_retry=99PERCENTILE, caching={‘keys’ : ‘ALL’, ‘rows_per_partition’ : ‘NONE’}, compaction=CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, options={max_threshold=32, min_threshold=4, tombstone_threshold=0.2, bucket_high=1.5, bucket_low=0.5, tombstone_compaction_interval=86400, unchecked_tombstone_compaction=false, min_sstable_size=50, enabled=true}}, compression=org.apache.cassandra.schema.CompressionParams@cc702f3c, extensions={}, cdc=false},comparator=comparator(org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)),partitionColumns=[[] | [meta meta_ser_id meta_ser_manifest ser_id ser_manifest snapshot snapshot_data timestamp]],partitionKeyColumns=[persistence_id],clusteringColumns=[sequence_nr],keyValidator=org.apache.cassandra.db.marshal.UTF8Type,columnMetadata=[meta_ser_manifest, persistence_id, ser_manifest, sequence_nr, meta, ser_id, snapshot_data, timestamp, snapshot, meta_ser_id],droppedColumns={},triggers=[],indexes=[]]\n

And then almost immediately, within 0.12 sec, this starts to happen:

Could it be related to this cassandra issue?

https://issues.apache.org/jira/browse/CASSANDRA-9424

Also, is there a way we can create snaphshot table during the startup instead of let Lagom creates it when it needs to?

@lejoow that Cassandra issue is pretty general “Schema Improvements” but it is well known that Cassandra doesn’t correctly handle concurrent schema updates. Lagom takes some measures to try to avoid concurrent schema updates from multiple cluster nodes, but it is not perfect.

There are some recent investigation notes in this issue:

Also, is there a way we can create snaphshot table during the startup instead of let Lagom creates it when it needs to?

I’ve never tried this, but it looks like you can trigger eager loading of the snapshot plugin with a configuration like this:

snapshot-store.auto-start-snapshot-stores = ["cassandra-snapshot-store"]

See the Akka Persistence reference.conf:

Ah, this slipped under my radar. I’ll link back to this thread from the Issue @TimMoore mentioned above since there’s relevant cases adding more light on how to reproduce and potential causes of the exception. Also, I get the feeling the problem happens more often than in previous versions of Lagom.

Thanks for your feedback!

I’ve been trying to build a reproducer with very little success. My current work is at https://github.com/ignasi35/multi-project-lagom-app where I built a lagom app with 9 hello-world services. I was hopping I would hit the issue as @NicolasRouquette described above but I’m trying it out in two separate laptops and never hit the “Column family ID mismatch” error.

I tried doing a fresh git clone of https://github.com/ignasi35/multi-project-lagom-app and then running sbt run in two different machines:

  • MBP 13" 2016 - macOS 10.12.6 - 3.3GHz i7 - 16 GB 2133MHz LPDDR3 RAM
  • MBP retina 13" 2015 - macOS 10.13.2 - 2.7GHz i5 - 16 GB 1867MHz DDR3 RAM

Any help building a reproducer will be very welcome.

@ignasi35 This is a good start. When I experienced the bug, I looked into the issue @lejoow mentioned and got the sense that the problem pertains to vulnerabilities around table creation.

In my case, there were about a dozen micro services, a couple of which involved a lot of event registrations which adds more work for creating tables at service startup. So, I wonder if one way to probe this problem would be to parameterize the multi-project with increasingly larger tables to create, as if we were changing some messages like this:

case class GreetingMessage(message: String)
case class GreetingMessage(message: String, dummy1: String)
case class GreetingMessage(message: String, dummy1: String, dummy2: String)
case class GreetingMessage(message: String, dummy1: String, dummy2: String, dummy3: String)
case class GreetingMessage(message: String, dummy1: String, dummy2: String, dummy3: String, dummy4: String)

Since this bug, I moved from Cassandra persistence to MySQL persistence (so that we can use AWS RDS) and migrated to lagom 1.5 for deploying to kubernetes. If I can find spare cycles, i’ll try to try to revert to the 1.4/cassandra version that had this problem to see if I can reproduce it.

  • Nicolas.

Hi @NicolasRouquette,

the number of different persistent entities is irrelevant, there is only one Journal table and one SnapshotStore table.

I’m beginning to suspect the problem is caused by concurrent KEYSPACE creation. Just noticed this bit in akka-persistence-cassandra and I think the issue is Lagom doesn’t honor the serialization of statements:

I still don’t have a reproducer so :woman_shrugging:

Cheers,

@ignasi35 Thanks for the clarification. The comment you found is relevant. I had a couple of read-side persistence monitors. I wonder then if we could trip the bug by adding a few of those. - Nicolas.

Hi @lejoow and @NicolasRouquette,

I’ve raised a PR with a potential fix. Unfortunately, I couldn’t find a reproducer so it’d be great if you can clone/checkout that branch and build Lagom locally to test the change.

I’m not 100% sure the solution is correct. I’m working under the hypothesis there are 2 problems:

  • multiple threads invoking CREATE KEYSPACE (which the PR tries to solve)
  • multiple threads altering the KEYSPACE (Lagom’s read side tries to create the offsetStore table and user tables in parallel). Note: I have some local changes to serialize these statements but I’m less convinced that they are the cause of the issue.

Thanks for your great help!

Hi @lejoow and @NicolasRouquette,

did you have a chance to test the changes on the PR? I’ve released:

 addSbtPlugin("com.lightbend.lagom" % "lagom-sbt-plugin" % "1.4.13-a5bd71c7")

with those changes so they are more easily tested.

Note this release is not meant for production and was not thoroughly tested since it introduces changes to fix for a race condition we can’t reproduce.

Please let us know what you find.

@ignasi35 thanks so much for this. I will test it next week and revert back to you! Thanks

Well, I can say that I managed to verify that the bug was still present w/ Lagom 1.4.11 with my services (there are 11 + 1 web front end).

One service always failed like before:

2019-05-21 21:29:14,175 [ERROR] from akka.remote.Remoting in services-integrationSessionExecution-application-akka.actor.default-dispatcher-3 - Remoting error: [Startup timed out. This is usually related to actor system host setting or host name resolution misconfiguration.] [
akka.remote.RemoteTransportException: Startup timed out. This is usually related to actor system host setting or host name resolution misconfiguration.
	at akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:149)
	at akka.remote.Remoting.start(Remoting.scala:212)
	at akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:222)
	at akka.cluster.ClusterActorRefProvider.init(ClusterActorRefProvider.scala:32)
	at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:874)
	at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:870)
	at akka.actor.ActorSystemImpl._start(ActorSystem.scala:870)
	at akka.actor.ActorSystemImpl.start(ActorSystem.scala:891)
	at akka.actor.ActorSystem$.apply(ActorSystem.scala:246)
	at play.api.libs.concurrent.ActorSystemProvider$.start(Akka.scala:187)
	at play.api.libs.concurrent.ActorSystemProvider$.start(Akka.scala:149)
	at com.lightbend.lagom.scaladsl.server.ActorSystemProvider$.start(LagomApplicationLoader.scala:264)
	at com.lightbend.lagom.scaladsl.server.LagomApplication.actorSystem$lzycompute(LagomApplicationLoader.scala:238)
	at com.lightbend.lagom.scaladsl.server.LagomApplication.actorSystem(LagomApplicationLoader.scala:237)
	at play.api.BuiltInComponents.tempFileReaper(Application.scala:334)
	at play.api.BuiltInComponents.tempFileReaper$(Application.scala:334)
	at play.api.BuiltInComponentsFromContext.tempFileReaper$lzycompute(ApplicationLoader.scala:122)
	at play.api.BuiltInComponentsFromContext.tempFileReaper(ApplicationLoader.scala:122)
	at play.api.BuiltInComponents.tempFileCreator(Application.scala:335)
	at play.api.BuiltInComponents.tempFileCreator$(Application.scala:335)
	at play.api.BuiltInComponentsFromContext.tempFileCreator$lzycompute(ApplicationLoader.scala:122)
	at play.api.BuiltInComponentsFromContext.tempFileCreator(ApplicationLoader.scala:122)
	at play.api.BuiltInComponents.playBodyParsers(Application.scala:280)
	at play.api.BuiltInComponents.playBodyParsers$(Application.scala:279)
	at play.api.BuiltInComponentsFromContext.playBodyParsers$lzycompute(ApplicationLoader.scala:122)
	at play.api.BuiltInComponentsFromContext.playBodyParsers(ApplicationLoader.scala:122)
	at com.lightbend.lagom.scaladsl.server.LagomServerComponents.lagomServerBuilder(LagomServer.scala:85)
	at com.lightbend.lagom.scaladsl.server.LagomServerComponents.lagomServerBuilder$(LagomServer.scala:85)
	at com.lightbend.lagom.scaladsl.server.LagomApplication.lagomServerBuilder$lzycompute(LagomApplicationLoader.scala:221)
	at com.lightbend.lagom.scaladsl.server.LagomApplication.lagomServerBuilder(LagomApplicationLoader.scala:221)
	at org.open.caesar.service.integrationSessionExecution.CaesarIntegrationSessionExecutionApplication.lagomServer$lzycompute(CaesarIntegrationSessionExecutionLoader.scala:71)
	at org.open.caesar.service.integrationSessionExecution.CaesarIntegrationSessionExecutionApplication.lagomServer(CaesarIntegrationSessionExecutionLoader.scala:70)
	at com.lightbend.lagom.scaladsl.server.LagomApplication.<init>(LagomApplicationLoader.scala:243)
	at org.open.caesar.service.integrationSessionExecution.CaesarIntegrationSessionExecutionApplication.<init>(CaesarIntegrationSessionExecutionLoader.scala:60)
	at org.open.caesar.service.integrationSessionExecution.CaesarIntegrationSessionExecutionLoader$$anon$1.<init>(CaesarIntegrationSessionExecutionLoader.scala:52)
	at org.open.caesar.service.integrationSessionExecution.CaesarIntegrationSessionExecutionLoader.loadDevMode(CaesarIntegrationSessionExecutionLoader.scala:52)
	at com.lightbend.lagom.scaladsl.server.LagomApplicationLoader.load(LagomApplicationLoader.scala:76)
	at play.core.server.LagomReloadableDevServerStart$$anon$1.$anonfun$get$5(LagomReloadableDevServerStart.scala:176)
	at play.utils.Threads$.withContextClassLoader(Threads.scala:21)
	at play.core.server.LagomReloadableDevServerStart$$anon$1.$anonfun$get$3(LagomReloadableDevServerStart.scala:173)
	at scala.Option.map(Option.scala:146)
	at play.core.server.LagomReloadableDevServerStart$$anon$1.$anonfun$get$2(LagomReloadableDevServerStart.scala:149)
	at scala.util.Success.flatMap(Try.scala:247)
	at play.core.server.LagomReloadableDevServerStart$$anon$1.$anonfun$get$1(LagomReloadableDevServerStart.scala:147)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:654)
	at scala.util.Success.$anonfun$map$1(Try.scala:251)
	at scala.util.Success.map(Try.scala:209)
	at scala.concurrent.Future.$anonfun$map$1(Future.scala:288)
	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:29)
	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:29)
	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)
	at java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1402)
	at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
	at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
	at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
	at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:255)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:259)
	at scala.concurrent.Await$.$anonfun$result$1(package.scala:215)
	at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
	at scala.concurrent.Await$.result(package.scala:142)
	at akka.remote.Remoting.start(Remoting.scala:195)
	... 54 more

And another one failed too, like before:

2019-05-21 21:29:36,266 [ERROR] from akka.actor.OneForOneStrategy in services-integrationSessions-application-akka.actor.default-dispatcher-21 - Ask timed out on [Actor[akka://services-integrationSessions-application/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare#1476725431]] after [20000 ms]. Message of type [com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$] was sent by [Actor[akka://services-integrationSessions-application/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare#1476725431]]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://services-integrationSessions-application/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare#1476725431]] after [20000 ms]. Message of type [com.lightbend.lagom.internal.persistence.cluster.ClusterStartupTaskActor$Execute$] was sent by [Actor[akka://services-integrationSessions-application/user/cassandraOffsetStorePrepare-singleton/singleton/cassandraOffsetStorePrepare#1476725431]]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
	at akka.pattern.PromiseActorRef$.$anonfun$defaultOnTimeout$1(AskSupport.scala:635)
	at akka.pattern.PromiseActorRef$.$anonfun$apply$1(AskSupport.scala:650)
	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
	at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:870)
	at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:109)
	at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:103)
	at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:868)
	at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.executeBucket$1(LightArrayRevolverScheduler.scala:279)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.nextTick(LightArrayRevolverScheduler.scala:283)
	at akka.actor.LightArrayRevolverScheduler$$anon$3.run(LightArrayRevolverScheduler.scala:235)
	at java.lang.Thread.run(Thread.java:748)

Same code, clean, compile using lagom 1.4.13-a5bd71c7. runAll ran just fine; none of these exceptions.

Good job!

Sorry for the delay in confirming your excellent work!

  • Nicolas.
1 Like