Cassandra: "Column family ID mismatch" at startup

scala
(Nicolas Rouquette) #1

I’ve seen a report about similar errors here: Lagom errors on startup

My situation is different.

My lagom 1.4.11 application has about a dozen micro-services in Scala 2.12, each are fairly simple.
Unlike the other case, there is no code that creates tables in cassandra. There are also no read-side processors. It’s just vanilla service descriptors. All interactions w/ Cassandra are due to Lagom’s persistence.

I use an external cassandra 3.11.4 & kafka services.

So, when I execute sbt runAll, lagom starts the service locator + all my micro services; nothing else.

With a clean cassandra server, sbt runAll results in several “Column family ID mismatch” errors in the cassandra log. From the various reports about this error, it seems to be related to vulnerabilities in concurrent schema operations in cassandra; an on-going topic here: https://issues.apache.org/jira/browse/CASSANDRA-9424

On my laptop (Dell 7530, Ubuntu 18.04 LTS, Xeon E2186M, 64Gb RAM), there’s enough horsepower that, statistically speaking, starting a dozen micro-services, each of which needs to initialize a schema in cassandra, may indeed put some stress on a single node cassandra deployment, also running on that machine. At least, this is what my experience seems to suggest.

Recently, I wrote a simple sbt task to start all my micro-services sequentially with Def.sequential(lagomRun in project1, lagomRun in project2, ....).

At least, this avoids stressing cassandra during schema initialization.
What’s annoying though is that I can’t write: Def.sequential(lagomServiceLocatorStart, lagomRun in project1, ....) as SBT says that lagomServiceLocatorStart is undefined.

Indeed, looking in the LagomPlugin source code, this task key is set only in the private project, lagom-internal-meta-project-service-locator. Ok, I have to manually issue lagomServiceLocatorStart and then my sequential task. At least, this makes for a somewhat clean start in dev mode.

Surprisingly, I didn’t used to experience this problem. Is this a scale issue?
I"m not sure, one other factor that may be relevant: recently, I turned on logging for each micro service.
Before, everything went to the sbt output; that might have indirectly forced some sequencing of the micro services start. With logging enabled, It seems that there is more concurrency at startup; perhaps enough to trip some vulnerabilities in cassandra.

I’m wondering whether others have experienced this kind of problem.

For production, I’m wondering about doing something similar, that is, forcing the micro services to start one after the other just to prevent concurrent schema operations on cassandra. After all schemas are created, I’m not worried about concurrent operations hitting cassandra because this is a well known territory whereas schema initialization is more tricky as indicated by the on-going issue.

  • Nicolas.
1 Like
(Joo) #2

@NicolasRouquette thanks for this Nicolas. I am on the same Lagom version as yours and yes, we are seeing the same problem even with the services that does not use any customer readside tables (i.e. no CreateTables in globalPrepare).

I’ve experienced this issue in my previous project as well. Still couldn’t figure out what exactly is wrong. I guess I will try your sequential deployment and see if that still causes an issue.

What I found was that when I deployed the services without any custom readside tables, the column family exception started as soon as the Lagom was populating the eventsbytag1 materialized view (along with the MAT-View-not-safe logging).

Tried it many times, and experienced it almost all attempts.

@TimMoore Just tagging you for your attention, I think quite a few Lagom users are experiencing and suffering from this Column Family Exception in Cassandra.

1 Like
(Joo) #3

Some logs…

This is from the service that does NOT create any customer readside table:

At first, in the middle of operation, when I started issueing some mutating commands to this service about 20mins after the deployment was completed. You can see that the service is trying to create this snapshots table adhocly like this:

INFO [MigrationStage:1] 2019-04-22 09:04:05,438 ColumnFamilyStore.java:411 - Initializing fee.snapshots\n

[Native-Transport-Requests-2] 2019-04-22 09:04:05,447 MigrationManager.java:376 - Create new table: org.apache.cassandra.config.CFMetaData@6b2bf1fe[cfId=948f2570-64dd-11e9-947d-038f8cea8ac5,ksName=fee,cfName=snapshots,flags=[COMPOUND],params=TableParams{comment=, read_repair_chance=0.0, dclocal_read_repair_chance=0.1, bloom_filter_fp_chance=0.01, crc_check_chance=1.0, gc_grace_seconds=864000, default_time_to_live=0, memtable_flush_period_in_ms=0, min_index_interval=128, max_index_interval=2048, speculative_retry=99PERCENTILE, caching={‘keys’ : ‘ALL’, ‘rows_per_partition’ : ‘NONE’}, compaction=CompactionParams{class=org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy, options={max_threshold=32, min_threshold=4, tombstone_threshold=0.2, bucket_high=1.5, bucket_low=0.5, tombstone_compaction_interval=86400, unchecked_tombstone_compaction=false, min_sstable_size=50, enabled=true}}, compression=org.apache.cassandra.schema.CompressionParams@cc702f3c, extensions={}, cdc=false},comparator=comparator(org.apache.cassandra.db.marshal.ReversedType(org.apache.cassandra.db.marshal.LongType)),partitionColumns=[[] | [meta meta_ser_id meta_ser_manifest ser_id ser_manifest snapshot snapshot_data timestamp]],partitionKeyColumns=[persistence_id],clusteringColumns=[sequence_nr],keyValidator=org.apache.cassandra.db.marshal.UTF8Type,columnMetadata=[meta_ser_manifest, persistence_id, ser_manifest, sequence_nr, meta, ser_id, snapshot_data, timestamp, snapshot, meta_ser_id],droppedColumns={},triggers=[],indexes=[]]\n

And then almost immediately, within 0.12 sec, this starts to happen:

(Joo) #4

Could it be related to this cassandra issue?

https://issues.apache.org/jira/browse/CASSANDRA-9424

Also, is there a way we can create snaphshot table during the startup instead of let Lagom creates it when it needs to?