In Singleton Actor, Scheduler schedule method not running after 1 or 2 runs

Akka version is 2.5.23

From Singleton Actor we are calling schedule method to send a self message in every 1 hour.

getContext().getSystem().scheduler().schedule(initialDelay, interval, self(), CronActivity.WAKEUP, executionContext, ActorRef.noSender())

executionContext is default.

After 1/2 runs the scheduling does not work, please help with this.

Also not sure what does executionContext means in context of schedule method. Does it means that while sending message the thread pool of executionContext will be used ?

The singleton could move between nodes on cluster topology change (rolling upgrade, or something else stopping the oldest node), the schedule is a node local facility so if that node stops the scheduled task is lost. Perhaps that is the explanation.

If not, Akka 2.5.23 was released in mid 2019 and the entire 2.5 line reached end-of-life back in 2020, so one thing to try would be to upgrade to a newer version and see if that was perhaps something that was addressed in a fix since then, a lot of work has gone into Akka since back then.

We had send a release of making all io operation on a different thread pool and my mistake we used this blockingExecutionContext for scheduling self message to singleton. and Singleton scheduled messaged did not trigger, So i thought that issue is because there are no available threads to send a self message.

Later we send a fix release with making Singleton schedule use default executionContext, in this case Singleton schedule did work 2 times than stopped for 3 hours.

Again we send a release with all change reverted, now Singleton schedule is working fine.

We have 8 akka seed nodes on 8 bare metal machines, release steps are that we stop then start the application on each node one by one.

So do you also feel that i was just unlucky that after scheduling Singleton got moved to a different node ?

Stopping the app node by node means that you could have only one move of the singleton, if you stop them so that the oldest cluster node is the last you stop. Worst case scenario is that you stop them from oldest to youngest having the singleton migrate between each of the nodes.

To be sure a scheduled task is triggered you would need to be able to either calculate the time until next scheduled task, or persist the fact that it should happen, and always reschedule on singleton actor start.

i did observed that after the whole release process somehow post and pre release Singleton was on same node, even though one Singleton jump was there but again it came back to the pre release Singleton node.

Even now we have reschedule every time on singleton actor start, but we have a boolean variable in singleton to check if singleton is already present (if yes then we do not reschedule it assuming it is already scheduled)

Since the singleton lives on the oldest node, doing a rolling upgrade that is not from youngest to oldest will mean that if there are old nodes left and you stop the oldest, the singleton will jump to another node with the old version of your application.

If you are on Kubernetes we have recently added functionality to akka-management that makes sure rolling upgrades happen in an optimal order: Rolling Updates • Akka Management

Still i am confused how the Singleton moved back to the pre release singleton node.

Bare metal instances: 1, 2, 3, 4, 5, 6, 7, 8

Pre release singleton: 5
During release singleton: 5 → 8 → 5
Post release singleton: 5

In above scenario if there is a variable (In Singleton class) lets say isSystemUp boolean, will it ever reset post any number of deployments ?

Also any documentation around constructor, preStart and postStop of singleton.

Please help with any documentation around constructor, preStart and postStop of singleton.

The actor you run as a singleton will have a new instance created, on a new JVM cluster node, each time the oldest node in the cluster stops (and a newer node, still in the cluster, becomes the oldest node in the cluster).

Being the oldest node in the cluster can not happen twice with another node in between for the same actor system, the system stops being oldest when it shuts down.

A new system started on the same server/ip could become oldest in the cluster and host the singleton again though, but since it is a new JVM/system you would not have state from the previous JVM/system, unless you have persisted it to some storage/database and read that on start.

Okay thanks, also wanted to check if below deployment strategy will work instead of kubernetes rolling updates one.

All nodes (except oldest) stop then start with new release one by one, at last stop oldest then start with new release.

Yes, that would minimise the number of times the singleton has to move between nodes (one time).

Okay thanks, will go with this deployment strategy.

But is saying singleton will move only 1 time only correct ? or singleton may move second time again to last node (Which was singleton before the release) as IP and server both are same for all nodes.

Yes, only one time if you keep the oldest around and stop it last. The oldest node is the first to have joined the cluster of the current members, it is not really related to what ip address the node has.

Okay thanks… will try it out.

Had tried this out in our latest release.

Release change: Removal of cassandra journey config from conf file (We are not using it from a very long time)

Faced this error on few nodes: Shard region not getting registered to coordinator · Issue #30154 · akka/akka · GitHub (Trying to register to coordinator at, but no ack)

To resolve this, Had to kill all akka nodes then make a fresh deployment.

So was this due to change in conf file and a some time akka nodes were at different setting ?

When is it safe to perform deployment with killing all nodes ?

@johanandren please guide.

The linked error was likely about the next part of the error message, where replaying remember entities state saw some unexpected event, would be logged as Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number [12980] .... We haven’t seen more reports of that, so if you saw that it would be interesting to know how you ended up with that stored in the journal. It should not be possible to resolve by restarting the cluster as sharding will try to start that persistent actor again and would see the same error each time.

The shard coordinator not being able to register on the other hand, can happen for a multitude of reasons, network issues, not starting sharding correctly on all the nodes, the home of the singleton being overloaded. You should look at logs from all nodes, and especially the oldest node at that point in time and look if you can see why the coordinator did not start/run there.

When is it safe to perform deployment with killing all nodes ?

In general with Akka that is always safe, but it is possible to build services with your own logic that cannot handle a shutdown (only storing important state in memory, storing references to actors that are no longer there after restart).

Able to see only WARN Trying to register to coordinator but no acknowledgement. error only not ‘Exception in receiveRecover’

At app end change was removing below config (And related config) and after fresh cluster deployment it got resolved not sure why.

persistence {
    max-concurrent-recoveries = 100
    journal {
      plugin = "cassandra-journal"
    snapshot-store {
      plugin = "cassandra-snapshot-store"
    journal-plugin-fallback {
      plugin-dispatcher = "akka.persistence.dispatchers.default-plugin-dispatcher" 
      replay-dispatcher = "akka.persistence.dispatchers.default-replay-dispatcher"
      circuit-breaker {
        max-failures = 20
        call-timeout = 20s
        reset-timeout = 3s
      recovery-event-timeout = 60s
    snapshot-store-plugin-fallback {
      plugin-dispatcher = "akka.persistence.dispatchers.default-plugin-dispatcher" 
      circuit-breaker {
        max-failures = 20
        call-timeout = 20s
        reset-timeout = 3s
      recovery-event-timeout = 60s

That config change should be unrelated, unless you use remember entities, that is the only part sharding that relies on persistence for storing state, but if it was that it would be permanently broken if you removed the journal config.

I have checked the logs (Including oldest node), not able to find any error, just the specified WARN message.

So wanted to check if this WARN message is just a warning and it would had auto recovered ?