Hook into cluster leader changes

epot · March 22, 2019, 8:28pm

Hello,

Is it possible to hook into the cluster leader changes, especially when he is able/not able anymore to perform its duties? I would like to publish metrics and plug them into our monitoring framework to be notified when it happens.

Thanks!

davidogren · March 22, 2019, 9:52pm

I think https://doc.akka.io/docs/akka/current/cluster-usage.html?language=scala#subscribe-to-cluster-events and this https://doc.akka.io/api/akka/current/akka/cluster/ClusterEvent$.html should give you what you need.

If you are a Lightbend subscriber, these changes are published via Lightbend Telemetry as well.

David

epot · March 23, 2019, 5:55am

It seems I can know when the leader changed, but not when he is not able to perform its duties anymore which is what I really want.

davidogren · March 23, 2019, 1:15pm

What do you mean “not able to perform its duties anymore”? I think I’m not understanding the case you want to listen for.

Do you mean temporarily unable to perform duties because a member is unreachable? You should be able do that by listening to unreachable member messages.

Permanently? I’m don’t think there’s a case where a leader permanently “stops doing their duties” outside of a leader change.

epot · March 23, 2019, 7:27pm

Yes, I mean exactly that. We had several occurrences that past few months where it lasted long (between a few hours and a few days). We want to be aware when it happens before doing code deployment as we notice having movements when the leader is not fully working make it a lot harder to recover.

davidogren · March 23, 2019, 11:43pm

I think you need to figure out what the issue is then with your leaders. If you using Lightbend’s SBR or a custom SBR implementation and the leader isn’t able to act, then you need to figure out why your leader isn’t acting. If you don’t have an SBR implementation in place this could be normal and you probably just need to be monitoring unreachable node events so that you can resolve the situation manually. But if you do have SBR, then I don’t think there is any situation where leaders “don’t fulfull their duties” for any extended period of time.

Does that make sense? I’m kind of struggling to figure out whether you are experiencing a normal situation for a network without SBR, or if you are experiencing some kind of big/situation I don’t understand.

epot · March 25, 2019, 8:44am

Yes we definitely need to understand what the issue is.
We are using the auto-down feature + custom code to ensure quorum on singletons.

Thanks to the information contained in the log “The leader can no longer perform its duties” we were able to see that there were pairs of nodes in an unreachable status, and killing the ones at the source of this fixed the issue. We’ll need to dig deeper to understand what is going on. But in the meantime we would love to be able to be notified rather sooner than later when this happens, and looking at Akka source code (ClusterDaemon and MembershipState) it looks hard to re-use the code that checks convergence and publish metrics accordingly.

Topic		Replies	Views
Sick Cluster - WeaklyUp Leader Akka Cluster	5	755	May 18, 2020
Akka cluster number of messages for monitoring purpose Akka akka-cluster	2	276	June 6, 2023
Cluster losing all singletons Akka Cluster akka-cluster	4	1891	April 19, 2018
RoleLeaderChanged appoints 2 leaders during start-up of a node Akka Cluster	2	471	September 24, 2020
Downed member is not removed from cluster and becomes reachable again Akka Cluster	1	1066	July 13, 2018

Hook into cluster leader changes

Related Topics