Can Coordinator histories be blown away?

Context: my application (Querki) is one of the older production Akka applications (once upon a time, I was Lightbend’s smallest customer), and is currently running on an unforgiveably ancient version of the Akka stack. I’m starting to upgrade that, but in the meantime I just had a major production crash – the system is completely down, so I’m trying to puzzle out how to get it back up again.

The errors in the logs manifest in two forms, one like this:

[akka.tcp://querki-server-2@10.64.6.107:10007/system/cassandra-journal/$f] Invalid replayed event [sequenceNr=5126, writerUUID=aa42b54e-1ee0-4f6f-b753-e504ec1cea19] from a new writer. An older writer already sent an event [sequenceNr=5126, writerUUID=5ad395b7-6928-425c-a091-3dc774555921] whose sequence number was equal or greater for the same persistenceId [/sharding/IdentityCacheCoordinator]. Perhaps, the new writer journaled the event out of sequence, or duplicate persistentId for different entities?

and the other like this:

[akka.tcp://querki-server-2@10.64.6.107:10007/system/sharding/IdentityCacheCoordinator/singleton/coordinator] Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number [5128] for persistenceId [/sharding/IdentityCacheCoordinator].

From digging around, it sounds like the problem has to have been a split-brain that corrupted the histories of the Coordinators for four of my sharding regions. (Querki is heavily cluster sharded, with a bunch of different entity types.)

I have no idea how it got that split-brain (my homebrew system tends to be over-conservative specifically to avoid that), but that’s arguably a lesser concern: once I get the stack up to modern snuff, I’ll be switching over to the SBR, now that that has been open-sourced. For now, I’m just trying to get things back up and running.

Based on comments here, I get the impression that it is reasonable, while the system is down, to simply blow away the corrupted histories – that Coordinator histories don’t really matter across full shutdowns. Is this correct? And if so, am I correct that that is probably the easiest way to get things up and running again?

Thanks in advance for any insights you might be able to provide.

Yes, or even better, if you are using a recent Akka version, shutdown cluster, delete the storage and also switch to the newer, default, distributed data based store for shard coordinator state which does not put anything in a journal (Cluster Sharding • Akka Documentation)

The only scenario where you need a journal/persistent store for sharding nowadays is if you use remember entities, which needs to survive full cluster stop.

Okay, cool. Eventually we’ll be on a modern version of Akka, but this is still (looks) 2.4.18 – not quite the dawn of time, but seriously antique.

Thanks! Now off to learn enough CQL to figure out how to delete the Coordinator histories…

Slow long term storage memory access times, but I just remembered we have a tool for cleaning that up: akka/RemoveInternalClusterShardingData.scala at d703a2afe0e0c6dc8f1c0321b0f4652600e2ae99 · akka/akka · GitHub

1 Like

Oh, neat – that might save me a lot of hassle. Thanks again!

Took a little while to figure out, but that did the trick.

For any future readers who might come across this thread (keeping in mind that this is mainly relevant for older versions of Akka), the recipe I used is:

  • Make sure the application is completely shut down.
  • In an sbt-capable environment that can talk to the production Cassandra cluster, open sbt in your fully-configured project. (That is, it’s easiest if your local application.conf has all the needed connectivity information.)
  • In sbt, do run-main akka.cluster.sharding.RemoveInternalClusterShardingData [entity1] [entity2]..., where each [entity] is one of the fouled-up Coordinators. (For example, I was getting errors on my IdentityCacheCoordinator, so the parameter for that was IdentityCache.)

That should clear out all of the bollixed-up data, and allow things to boot.

Thanks again for the pointers! Once I understood the necessary bits (TIL that run-main is a thing), it fixed things right up.