I have found another source of unstable behavior in our production cluster.
- I have 2 running nodes A and B. Node A has been appointed as the leader by evaluating the RoleLeaderChanged event.
- Node B (the one that is not the leader) gets restarted.
- During start-up node B gets appointed as a leader via RoleLeaderChanged event. Node A remains leader during this time and does not get any notifications. Some actors now cause damage, because they are running on 2 nodes now.
- After a short period of time, node B gets another RoleLeaderChanged event and recognizes node A as the leader now. Now everything is fine, but the leader on node A cannot recover the damage that node B has created, because it does not even get to know that there was a second leader for some time.
Here are the relevant log lines. The node is leader for 5 seconds until it gets the Up status.
2020-09-22T22:08:34.077Z INFO myown - Handle RoleLeaderChanged, selfAddress=akka://ClusterSystem@100.64.4.57:2551, leaderAddress=akka://ClusterSystem@100.64.4.57:2551, isLeader=true 2020-09-22T22:08:39.477Z INFO akka.cluster.Cluster - Cluster Node [akka://ClusterSystem@100.64.4.57:2551] - Marking node as REACHABLE [Member(address = akka://ClusterSystem@100.64.0.45:2551, status = Up)]. 2020-09-22T22:08:39.478Z INFO akka.cluster.Cluster - Cluster Node [akka://ClusterSystem@100.64.4.57:2551] - is no longer leader 2020-09-22T22:08:39.479Z INFO myown - Handle RoleLeaderChanged, selfAddress=akka://ClusterSystem@100.64.4.57:2551, leaderAddress=akka://ClusterSystem@100.64.0.45:2551, isLeader=false
Is this expected behavior? I can certainly evaluate Member.Up events as well, but this makes it much harder to rely on RoleLeaderChanged events. I would have expected that AKKA would not send any RoleLeaderChanged events until a decision can be made. Or if it does then without a leader being set.