Hook into cluster leader changes

Hello,

Is it possible to hook into the cluster leader changes, especially when he is able/not able anymore to perform its duties? I would like to publish metrics and plug them into our monitoring framework to be notified when it happens.

Thanks!

I think https://doc.akka.io/docs/akka/current/cluster-usage.html?language=scala#subscribe-to-cluster-events and this https://doc.akka.io/api/akka/current/akka/cluster/ClusterEvent$.html should give you what you need.

If you are a Lightbend subscriber, these changes are published via Lightbend Telemetry as well.

David

It seems I can know when the leader changed, but not when he is not able to perform its duties anymore which is what I really want.

What do you mean “not able to perform its duties anymore”? I think I’m not understanding the case you want to listen for.

Do you mean temporarily unable to perform duties because a member is unreachable? You should be able do that by listening to unreachable member messages.

Permanently? I’m don’t think there’s a case where a leader permanently “stops doing their duties” outside of a leader change.

Yes, I mean exactly that. We had several occurrences that past few months where it lasted long (between a few hours and a few days). We want to be aware when it happens before doing code deployment as we notice having movements when the leader is not fully working make it a lot harder to recover.

I think you need to figure out what the issue is then with your leaders. If you using Lightbend’s SBR or a custom SBR implementation and the leader isn’t able to act, then you need to figure out why your leader isn’t acting. If you don’t have an SBR implementation in place this could be normal and you probably just need to be monitoring unreachable node events so that you can resolve the situation manually. But if you do have SBR, then I don’t think there is any situation where leaders “don’t fulfull their duties” for any extended period of time.

Does that make sense? I’m kind of struggling to figure out whether you are experiencing a normal situation for a network without SBR, or if you are experiencing some kind of big/situation I don’t understand.

Yes we definitely need to understand what the issue is.
We are using the auto-down feature + custom code to ensure quorum on singletons.

Thanks to the information contained in the log “The leader can no longer perform its duties” we were able to see that there were pairs of nodes in an unreachable status, and killing the ones at the source of this fixed the issue. We’ll need to dig deeper to understand what is going on. But in the meantime we would love to be able to be notified rather sooner than later when this happens, and looking at Akka source code (ClusterDaemon and MembershipState) it looks hard to re-use the code that checks convergence and publish metrics accordingly.