The singleton could move between nodes on cluster topology change (rolling upgrade, or something else stopping the oldest node), the schedule is a node local facility so if that node stops the scheduled task is lost. Perhaps that is the explanation.
If not, Akka 2.5.23 was released in mid 2019 and the entire 2.5 line reached end-of-life back in 2020, so one thing to try would be to upgrade to a newer version and see if that was perhaps something that was addressed in a fix since then, a lot of work has gone into Akka since back then.
We had send a release of making all io operation on a different thread pool and my mistake we used this blockingExecutionContext for scheduling self message to singleton. and Singleton scheduled messaged did not trigger, So i thought that issue is because there are no available threads to send a self message.
Later we send a fix release with making Singleton schedule use default executionContext, in this case Singleton schedule did work 2 times than stopped for 3 hours.
Again we send a release with all change reverted, now Singleton schedule is working fine.
We have 8 akka seed nodes on 8 bare metal machines, release steps are that we stop then start the application on each node one by one.
So do you also feel that i was just unlucky that after scheduling Singleton got moved to a different node ?
Stopping the app node by node means that you could have only one move of the singleton, if you stop them so that the oldest cluster node is the last you stop. Worst case scenario is that you stop them from oldest to youngest having the singleton migrate between each of the nodes.
To be sure a scheduled task is triggered you would need to be able to either calculate the time until next scheduled task, or persist the fact that it should happen, and always reschedule on singleton actor start.
i did observed that after the whole release process somehow post and pre release Singleton was on same node, even though one Singleton jump was there but again it came back to the pre release Singleton node.
Even now we have reschedule every time on singleton actor start, but we have a boolean variable in singleton to check if singleton is already present (if yes then we do not reschedule it assuming it is already scheduled)
Since the singleton lives on the oldest node, doing a rolling upgrade that is not from youngest to oldest will mean that if there are old nodes left and you stop the oldest, the singleton will jump to another node with the old version of your application.
If you are on Kubernetes we have recently added functionality to akka-management that makes sure rolling upgrades happen in an optimal order: Rolling Updates • Akka Management
The actor you run as a singleton will have a new instance created, on a new JVM cluster node, each time the oldest node in the cluster stops (and a newer node, still in the cluster, becomes the oldest node in the cluster).
Being the oldest node in the cluster can not happen twice with another node in between for the same actor system, the system stops being oldest when it shuts down.
A new system started on the same server/ip could become oldest in the cluster and host the singleton again though, but since it is a new JVM/system you would not have state from the previous JVM/system, unless you have persisted it to some storage/database and read that on start.
Yes, only one time if you keep the oldest around and stop it last. The oldest node is the first to have joined the cluster of the current members, it is not really related to what ip address the node has.
The linked error was likely about the next part of the error message, where replaying remember entities state saw some unexpected event, would be logged as Exception in receiveRecover when replaying event type [akka.cluster.sharding.ShardCoordinator$Internal$ShardHomeDeallocated] with sequence number  .... We haven’t seen more reports of that, so if you saw that it would be interesting to know how you ended up with that stored in the journal. It should not be possible to resolve by restarting the cluster as sharding will try to start that persistent actor again and would see the same error each time.
The shard coordinator not being able to register on the other hand, can happen for a multitude of reasons, network issues, not starting sharding correctly on all the nodes, the home of the singleton being overloaded. You should look at logs from all nodes, and especially the oldest node at that point in time and look if you can see why the coordinator did not start/run there.
When is it safe to perform deployment with killing all nodes ?
In general with Akka that is always safe, but it is possible to build services with your own logic that cannot handle a shutdown (only storing important state in memory, storing references to actors that are no longer there after restart).
That config change should be unrelated, unless you use remember entities, that is the only part sharding that relies on persistence for storing state, but if it was that it would be permanently broken if you removed the journal config.