ClusterSingletonManager restart on same node causes ClusterSingletonProxy failure


(Oliver Wickham) #1

I have a fairly niche situation where the codebase I am working on causes a restart of the ClusterSingletonManager (not just the singleton it is managing). This can be due to the fact that the CSM parent has restarted. Importantly in this case, there would have been no changes to cluster topology. I expected the ClusterSingletonProxy to attempt to identify the actor after it receives the terminate message, as it is watching the singleton once it finds it, and resume forwarding messages. It doesn’t do this when testing with the codebase.

Here is a snippet of code in ClusterSingletonProxy:

  def receive = {
    ...
    case Terminated(ref) ⇒
      if (singleton.contains(ref)) {
        // buffering mode, identification of new will start when old node is removed
        singleton = None
      }

It appears to explicitly wait for a member event to restart the identification process, which excludes recovery of a ClusterSingletonManager on the same node. Superficially, it would seem that the code could easily be modified to start the identification process independently of receiving a cluster down event. The advantage would be that it would be robust in the case of CSM failure, but it would mean that identification messages would be sent whilst the topology was changing in the general use case that the oldest node has gone down.

So, with that said, here are my questions:

  1. What are the possible bad things that could happen if sending identify messages shortly after the Terminate message is received, and throughout the member state changes?
  2. Is the behaviour to reconnect upon CSM failure desirable generally?
  3. Would there be any interest in a PR to fix this, assuming that point 1 has no show-stoppers?

Thanks in advance