Warnings about scheduled sending of heartbeat

Hello,

I’m experiencing this type of errors, originating from random nodes in our cluster:

  Scheduled sending of heartbeat was delayed. Previous heartbeat was sent
   [15799] ms ago, expected interval is [1000] ms. This may cause failure
   detection to mark members as unreachable. The reason can be thread
   starvation, e.g. by running blocking tasks on the default dispatcher, CPU
   overload, or GC.

These messages appear even though I do not feed the cluster with any data to process (cpu < 1.0%).

In my knowledge, we do not run any blocking tasks in the default dispatcher,
but, just in case, I’ve tried to isolate gossip-related task with a dedicated
dispatcher, as specified in the documentation:

  akka.cluster.use-dispatcher = cluster-dispatcher

  cluster-dispatcher {
    type = "Dispatcher"
    executor = "fork-join-executor"
    fork-join-executor {
      parallelism-min = 2
      parallelism-max = 4
    }
  }

But the WARNING messages still occur.

I’m using akka 2.5.16.

Could this issue be related to the akka’s Scheduler?
Any insight to help understanding the cause of this problem would be appreciated!

Have you enabled GC logging and verified that there are no long pauses that correlate with these warnings?

What ”hardware” are you running on and is it shared with others?

Take a few thread dumps and look for blocked threads.

Thanks for your answer.

We use physical machines (no VMs).
I will double check the GC and blocked threads.
About blocked threads, I was expecting that specifying another dispatcher in akka.cluster.use-dispatcher would exclude this hypothesis.
Perhaps the scheduler in charge of emitting heartbeat is still using the default dispatcher?

No GC pauses, and no blocked threads.
Furthermore, I checked in the akka codebase, and, if I understand it well, the dispatcher used by the cluster’s scheduler is specified by akka.cluster.use-dispatcher.
So, using a cluster-specific dispatcher should have prevented such warning messages; or, am I wrong?

PS. I’m using PersistentActor and PersistentQuery with the cassandra plugin, and kamon to get some business metrics.

Sounds very strange, and it’s a very long delay.
The Sceduler has one Thread that triggers all scheduled tasks, like this one. The task is running on another dispatcher so should’t block the Scheduler.

Might have to hook up a profiler to get more insights of what is going on.

Can you reproduce on other machines? How often does it occur?

Yes, it happens on several machines (same datacenter), ~1 time per hour.
I tried JProfiler, and didn’t see anything suspicious.

I’m using PersistentQuery wrapped in actor to emulate the old PersistentView behavior. Perhaps it might be related to akka-stream/cassandra/netty?

Does it also trigger unreachability events, if you have acceptable-heartbeat-pause default value or less than 13 seconds?

@apm have you solved the issue?
I’m experiencing the same…
Akka Cassandra Persistence plugin.

Originally posted on https://stackoverflow.com/questions/58015699/akka-cluster-heartbeat-delays-on-kubernetes


Our Scala application (Kubernetes deployment) constantly experience Akka Cluster
heartbeat delays of ≈3s.

Once we even had a 200s delay which also manifested itself in the following graph:

grafana-200s

Can someone suggest things to investigate further?

Java Flight Recording

Some example:

timestamp    delay_ms
06:24:55.743 2693
06:30:01.424 3390
07:31:07.495 2487
07:36:12.775 3758

There were 4 suspicious time points where lots of Java Thread Park events were
registered simultaneously for Akka threads (actors & remoting)
and all of them correlate to heartbeat issues:

jfr-thread-park-1
jfr-thread-park-2

Around 07:05:39 there were no “heartbeat was delayed” logs, but was this one:

07:05:39,673 WARN PhiAccrualFailureDetector heartbeat interval is growing too large for address SOME_IP: 3664 millis

No correlation with halt events or blocked threads were found during
Java Flight Recording session, only two Safepoint Begin events
in proximity to delays:

jfr-safepoint-begin

CFS throttling

The application CPU usage is low, so we thought it could be related to
how K8s schedule our application node for CPU.
But turning off CPU limits haven’t improved things much,
though kubernetes.cpu.cfs.throttled.second metric disappeared.

Separate dispatcher

Using a separate dispatcher seems to be unnecessary since delays happen even when
there is no load, we also built an explicit application similar to our own which
does nothing but heartbeats and it still experience these delays.

K8s cluster

From our observations it happens way more frequently on a couple of K8s nodes in
a large K8s cluster shared with many other apps when our application doesn’t loaded much.

A separate
dedicated K8s cluster where our app is load tested almost have no issues with
heartbeat delays.

Are you sure you have defined limits and requests correctly? See:

I am also facing the issues with the delayed heartbeats. It is very strange because it is happening even without load. I also tried to use a custom dispatcher for the cluster with no luck. I created a repo, which you can use to reproduce it.

Link : https://github.com/CostasChaitas/Akka-Demo

Any help would be much appreciated. Thanks

I commented on StackOverflow as well, but I was unable to reproduce from your repro. Are you saying it is happening even without load? Because I thought perhaps the missing piece was the load generator.

I tried on minishift rather than minikube, but that seems unlikely to be the difference.

Yep, we have following in our K8s manifest:

requests.cpu = 16
# limits.cpu not set

After reading this I’m still not sure, does it make sense to add -XX:+UseContainerSupport or not?

Oh, I see -XX:+UseContainerSupport is enabled by default in Java 11, so no need to define it explicitly:

I tried running the Kubernetes cluster on AWS and it works perfectly without the heartbeat delays. I guess something is wrong with my configuration on my local minikube setup.

If the request.cpu is not specified and only the limit is, how can you use the auto-scale features of Kubernetes ? Because in this case, the resource cpu on pods is not specified. E.g :

resource cpu on pods (as a percentage of request): / 60%