Shard region not getting registered to coordinator

We are using persistence actor to store actor state in postgresql. We enabled rememberEntities feature (with persistence as a state store for maintaining shard data) to have all actors in in-memory. After enabling this feature, shard coordinator pod got terminated due to memory spike. After this, cluster is getting formed but the shard region is not able to register with coordinator.

We are getting following error continuously and no events are getting processed
WARNING : Trying to register to coordinator at [ActorSelection[Anchor(akka://actor-system/), Path(/system/sharding/ActorSystemCoordinator/singleton/coordinator)]], but no acknowledgement. Total [3] buffered messages. [Coordinator [Member(address = akka://actor-system@ip:port, status = Up)] is reachable.]
ERROR : Exception in receiveRecover when replaying event type
[akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number [12980] for persistenceId [/sharding/DeviceActorCoordinator].
Shard [-20] not allocated: State(Map())

As of now, we have disabled rememberEntities and shard-state-store to make cluster stable

Akka version used: V2.6.6
split brain resolver: akka.cluster.sbr.SplitBrainResolverProvider

created this as an issue in github (Shard region not getting registered to coordinator · Issue #30154 · akka/akka · GitHub)

Cluster shard state in persistence store got corrupted somehow and it is not getting recovered after this.

It is not possible to enable remember entities in a rolling upgrade, so maybe trying to do that caused the problem?

It is probably best to stop cluster, clear out the cluster sharding state from your database and start the cluster anew with remember entities enabled.

@johanandren In our case, rememberEntities is working fine with rolling update. The state is getting corrupted, when the pod does not shutdown gracefully

It is probably best to stop cluster, clear out the cluster sharding state from your database and start the cluster anew with remember entities enabled.

We cannot do this everytime, we face this issue right? If we do this, then the entities created before restart will not be remembered until we get request for those entities

Is there a way to make cluster self-heal in these scenarios?

If you can reproduce this, when not doing a rolling upgrade enabling remember entities, but during normal operations after remember entities was enabled after a full cluster stop it is a bug and we are interested in details about steps to repeat and if possible a minimal reproducer project.

If it is during or because of a rolling upgrade, with the new version enabling remember entities, that is not expected to work safely.

@johanandren
Sample project: https://github.com/GoushikaaMoorthi/akka-cluster-shard.git

Steps to reproduce issue:

  1. Create large number of actors
  2. Send some large number of requests
  3. Terminate shard-coordinator pod (not graceful shutdown) immediately after sending the request
    Repeat 2 and 3 step until “Trying to register to coordinator” logs appear

I noticed now you are running Akka 2.6.6 which is quite old, in 2.6.7 we did some considerable rework of the remember entities implementation. Can you try with latest Akka (2.6.14) and see if you can repeat the problem there as well?

@johanandren
I am facing the exact issue but with Akka version 2.6.17,
we can replicate the issue as exactly mentioned by @Goushikaa

Akka conf I use

akka {
    ### extensions ###
    extensions = [akka.persistence.Persistence]

    ### persistence ###
    persistence {

        journal {
            plugin = "akka-contrib-mongodb-persistence-journal"
            auto-start-journals = ["akka-contrib-mongodb-persistence-journal"]
        }
    }

    contrib.persistence.mongodb {
        casbah {
            mongo.journal-write-concert = "Acknowledged"
            #this setting is specific to azure see http://docs.mlab.com/connecting/#known-issues
            maxidletime = 60seconds
            socketkeepalive = true
        }

        mongo {
            database = "xxxxxx"
            journal-write-concert = "Acknowledged"
        }
    }

    loggers = ["akka.event.slf4j.Slf4jLogger"]
    ### logging ###
    loglevel = "INFO"

    ### actor ###
    actor {
        provider = "cluster"

        serializers {
            kryo = "com.twitter.chill.akka.AkkaSerializer"
        }

        serialization-bindings {
            "com.projectpackage.BaseMessage" = kryo
        }
    }

    ### remoting ###
    remote {
        log-remote-lifecycle-events = on
        netty.tcp {
            hostname = ${clustering.ip}
            port = ${clustering.port}
        }
        artery.canonical {
            hostname = ${clustering.ip}
            port = ${clustering.port}
        }
    }

    ### cluster ###
    cluster {
        seed-nodes = [
            "akka://"${clustering.cluster.name}"@"${clustering.seed-ip-1}":"${clustering.seed-port}
            "akka://"${clustering.cluster.name}"@"${clustering.seed-ip-2}":"${clustering.seed-port}
        ]
        downing-provider-class = "akka.cluster.sbr.SplitBrainResolverProvider"
        split-brain-resolver {
            active-strategy=keep-oldest
            keep-oldest {
                # Enable downing of the oldest node when it is partitioned from all other nodes
                down-if-alone = on

                # if the 'role' is defined the decision is based only on members with that 'role',
                # i.e. using the oldest member (singleton) within the nodes with that role
                role = ""
            }
        }
    }

    akka.actor.default-dispatcher.default-executor.fallback = "thread-pool-executor"
    akka.persistence.max-concurrent-recoveries = 50
}

19cy-dispatcher {
    mailbox-type = "com.projectpackage.EventsPriorityMailBox"
}

clustering {
    ip = "127.0.0.1"
    ip = ${?CLUSTER_IP}
    port = 2551
    port = ${?CLUSTER_PORT}
    seed-ip-1 = "127.0.0.1"
    seed-ip-1 = ${?CLUSTER_IP}
    seed-ip-1 = ${?SEED_1_IP}
    seed-ip-2 = "127.0.0.1"
    seed-ip-2 = ${?CLUSTER_IP}
    seed-ip-2 = ${?SEED_2_IP}
    seed-port = 2551
    seed-port = ${?SEED_PORT}
    cluster.name = DcmSysActor
}


@Goushikaa did you find any work around or solution for the same

Thanks