Timeout in coordinated shutdown phase "cluster-exiting"

thjaeckle · June 16, 2021, 1:16pm

Hello.

In the Eclipse Ditto project we make use of Coordinated Shutdown in an Akka setup deployed in Kubernetes.
There we experience that the shutdown phase “cluster-exiting” is always failing, even after increasing the default timeout of 10s to 20s.

Here are the DEBUG logs of one node which was stopped (all other nodes in the cluster were not stopped):

Seems like the message Resending system message [Unwatch] [2] is sent every 1 second during the “cluster-exiting” phase but does not reach its destination?

I found a similar topic Akka coordinated shutdown is timing out - however this topic was not resolved as well.

thjaeckle · June 17, 2021, 3:13pm

Out latest findings assume that this could have something to do with the Kubernetes lifecycle when removing pods:

if we manually send a “kill” command to the java process running in the container, the graceful shutdown works as expected
if we however remove the Kubernetes pod with kubectl remove <our-pod> (and k8s sends a SIGTERM to pid 1 which forwards the TERM signal to the java process):
- the Akka cluster can directly no longer reach the node running in this pod (maybe because k8s already adjusted iptables or ingress controllers?)
- from that point in time, we also don’t get any logs pushed out from our service any longer via a Logstash TCP appender

So it seems that once k8s starts removing a pod, this pod immediately gets isolated, not only “service” ports, but also the “remoting” port used for Akka’s TCP remoting via artery.

Does anyone have similar experiences when running Akka cluster in Kubernetes?

thjaeckle · June 17, 2021, 7:54pm

Our very latest findings: seems to be related to the Calico networking for Kubernetes having a bug, eg used on AWS and Azure hosted Kubernetes Versions greater than 1.19.

See issue: With non-Calico CNI, Calico networkpolicy enforcement does not allow Terminating pods to gracefully shut down · Issue #4518 · projectcalico/calico · GitHub

Pods that are in state Terminating immediately lose all network connectivity. Applications that are still handling in-flight network connections or applications that might want to reach out to the network for a graceful shutdown can not do so.

That obviously is not so good for gracefully leaving an Akka cluster

patriknw · June 18, 2021, 7:45am

Thanks for sharing your findings. That is indeed not good for graceful leaving. Hope it can be configured or fixed in upstream network components.

Topic		Replies	Views
Akka coordinated shutdown is timing out Akka Cluster	2	873	March 18, 2019
Actor System timed-out during coordinated shutdown phase [actor-system-terminate] Actors akka-cluster	0	463	February 8, 2023
2.6.x CoordinatedShutdown reason Akka java , akka-cluster	3	1099	January 5, 2022
Akka Remoting not fully shutting down after test? Akka	2	351	April 21, 2020
How do I configure CoordinatedShutdown timeout? Play Framework	8	3266	May 14, 2019

Timeout in coordinated shutdown phase "cluster-exiting"

Related Topics