Offsetstores reset when a cassandra cluster is down and then went up

Hello Guys,

We recently had a big production issue. All the nodes in our cassandra cluster were restarted at the same time.

When the cassandra nodes recovered, there is a very strange behaviour that happened in our lagom applications. Most (if not all) of our active readside processors had the column timeuuidoffset reset back from the beginning. We verified this by seeing old events in our logs, and also checking the offsetstore table of each of the lagom service keyspaces:

SELECT toTimestamp(timeuuidoffset), eventprocessorid, timeuuidoffset FROM <keyspace>.offsetstore;

We have verified that the time went to as old as 2019.

A lot of our events got replayed, and many of the processors have side effects (sending emails).

Anyway, here’s the summary of what happened:

- lagom services are online
- production C* went down
- revived prouction C*
- production C* had a lot of commitlogs, processing took a lot of time
- production C* is up
- offsettores are reset

We managed to replicate this issue in our staging enviroment:

- lagom services are online
- mounted prod disk snapshots to staging cluster
- started staging C*
- staging C* had a lot of commitlogs, processing took a lot of time
- staging C* is up
- offsettores are reset

Initially, we thought that it was purely just a cassandra issue and we were so concerned about the integrity of our messages table.

However when we did the following:

- lagom services are taken offline
- mounted prod disk snapshots to staging cluster (same snapshot)
- started staging C*
- staging C* had a lot of commitlogs, processing took a lot of time
- staging C* is up
- offsettores are not reset
- lagom services are revived
- offsettores are not reset

Sadly, the akka logs are set to log level WARN so we weren’t able to see the information when the application has reconnected to the offsetstore, or what it did during that time. We will do that as a follow up (redeploy our apps with the appropriate log level, and do a restoration). However, this is not something we can do right away.

In this case, is it possible to get pointers on how to debug this?
I created a bare minimum project to try and replicate the issue, but to no avail.
Here it is:

We’re running on:

lagom 1.6.4
cassandra 3.11.3

I tried publishing many messages, but I think my dataset is limited. Any insights on how to replicate this issue with minimal code is highly appreciated.

Thanks!

Clyde

1 Like

Hi @clydeespeno ,

I can’t remember from the top of my head how

had a lot of commitlogs, processing took a lot of time

appears to a user querying the database, but I suspect the application did not find any offsert_store table in the schema and just created a new one (which was empty). As a consequence, events were replayed.

offset_store is a table in the database where each topic producer and readside processor saves what was the last event it consumed and, when no value is found lagom assumes the projections was never run and starts from scratch.

I would suggest you disabled the flag to auto-create tables so that if/when you stumble upon the same issue, the read-side processors won’t be as eager.
Still, I suspect disabling the table creation will not be enough: I suspect that C* restores the database in a non-transactional way so as soon as the application sees the offset_store with just a few rows (as the C* restore is running) ithe application would run the projections.

Summing up, the safest solution would be stop the application until the C* cluster has fully recovered.

Cheers,

1 Like

Hi @ignasi35,

Thanks for your response.

Based on our cassandra logs, it’s unlikely that the tables were re-created. The curious thing though is that only the active producers, and processors were updated.

The WRITETIME of the rows were also from the original date they stopped updating. So at least, that part is quite unlikely.

Indeed, one of the takeaways here is to ensure that lagom apps are stopped during a C* shutdown.

But still, if we could find out exactly on what conditions this happens, that would be great.

We’re quite concerned about situations where there’s a sudden network partition between the lagom apps and the C*, and suddenly the network recovers (hopefully that is not the case).