One akka node using way too much CPU and Memory than all other nodes

We are running an akka cluster with 1000 nodes. One node among these 1000 (most probably the leader one) is taking way higher CPU and Memory than the other nodes. To put it into numbers: while all other nodes are at 0.5 Core usage, the odd one is at 15Core usage. Even if we kill that node, usage of a new node shoots up. On running the cluster with local k8s, the node with very high usage, is getting OOMKilled.

What can be the possible reasons and the solution for this situation? Any lead will be of great help.

akka version: 2.6.19

A few thoughts:

  • If you have a stable cluster (meaning nodes not constant entering and leaving) it’s unlikely to be the leader. The leader actually has very few responsibilities. And while 1000 nodes is a quite large cluster, and I don’t have experience running a cluster that big in production, I still think leader responsibilities would be an absolutely negligible amount of compute.
  • A more likely theory is that the “high CPU” node is the singleton node. Cluster Singleton • Akka Documentation The question is what work is running as a singleton that is using so much CPU. It could be something from Akka, such as the Shard Coordinator, but in a stable cluster I’m more suspicious of your code and you deliberately running some workload on the singleton.
  • You best bet is to validate my theory. The singleton runs on the oldest node in the cluster. Is the high CPU node the oldest? (And if you kill the oldest, does it move to the next oldest?) And/or can you see the high CPU node getting promoted to be the oldest in the logs? I forget the exact message in the logs when something gets promoted to be the oldest but it’s pretty obvious. If you can confirm that the problem is that the high CPU is related to the singleton you can then either look at your code for what work you are running on the singleton or use something like Akka Telemetry to observe the work running on the singleton.
  • You are running an outdated version of Akka. I don’t know of anything related to this issue in this old version, but I always like to mention it when I see people running old versions of Akka. I believe there are known CVEs in 2.6.x at this point because it hasn’t had any patches in a year and a half.
  • 1000 nodes in a cluster is pretty big. And you are running pretty tiny nodes (if they are running 0.5 cores). This is just an opinion, but if there isn’t any compelling reason to do otherwise, I’d rather run 150 nodes at 3.3 cores than 1000 nodes at 0.5 cores. Or even 70 nodes at 7.1 cores. It kind of depends how big your K8S worker nodes are. I generally want my Akka nodes to be as big as possible (to minimize overhead) without being so big that losing a node is problematic or so big that the K8S scheduler has difficulty scheduling them. Thus “how many nodes should I run” is a bit of a balancing act, but 1000 nodes at 0.5 cores seems out of balance. It’s not that 1000 nodes is impossible, but you are introducing a lot of complexity by going to 1000 nodes. And for an application that only uses 500 cores of work, I don’t really see any need to introduce that complexity. (Athough obviously there may be requirements I don’t know about.)
  • Speaking of which, I hope this gets you going in the right direction. But I’m a bit concerned with trying to troubleshoot a 1000 node Akka cluster in an Internet forum. While I suspect that maybe you are running 2.6 in order to avoid the BSL licensing, if you are running an application this big (500+ cores), it probably behooves you to get some kind of support contract to help with issues like this. (Disclaimer: I do not work for Lightbend and/or have any stake in the company. I just, as someone with a lot of experience in open source, get the heebie jeebies when people rely on community support for mission critical applications.)

Addendum:

You should definitely try to avoid singleton workloads. For the reasons listed in the documentation I linked. But larger nodes also mitigates the impact of singleton workloads since each node will have more resources to handle that singleton workload if that workload is moved to that node.

But, the first step should be to try and figure out where that CPU and memory is being used and then you can figure out that best solution afterwards.

Thanks man for the great explanation.

Yes, Initially we were suspecting the same. That the oldest pod in the cluster is acting as the shard coordinator and hence taking on the additional responsibilities which is resulting in high CPU and Memory usage.
Hence, in order to rectify this problem we assigned roles to the cluster pods. We divided the cluster pods of two different roles:

  1. Frontend pods:
    – These pods are assigned the role as frontend, which results in these pods running as the coordinator node.
  2. Backend pods:
    – These pods are assigned the role as backend, which results in these pods running as the worker node.
    We have achieved this by tweaking the below set of configs:
play {
  akka {
    cluster {
      roles = []
      roles += ${?AKKA_CLUSTER_ROLE} //Is configured as `frontend` for the frontend role pods and backend for the backend role pods
      sharding {
        role = "backend"
        coordinator-singleton-role-override = on
      }
      role {
        frontend.min-nr-of-members = 3
        backend.min-nr-of-members = 2
      }
      singleton {
        role = "frontend"
      }
    }
  }
}

However, when we deployed and ran the load test to verify the changes. We still saw one of the backend pod taking a huge amount of CPU and Memory. We verified the cluster members:
– Oldest pod in the cluster belonged to frontend role (This was expected and as per our understanding, should have acted as the shard coordinator)
– Leader of the cluster belonged to the backend role.
(This is the pod, which was taking huge CPU and Memory)
– And the cluster is very stable. Members are not leaving and rejoining at all.

Have you profiled the problematic node to see what is happening? (async profiler)

I ran a 10000 node cluster once (albeit by switching out Akka’s Cluster membership implementation) and along the way to that scale I ended up having to profile a lot to get it to work. It’s a bit difficult to guess what is happening without looking at the system up close, back when I did this experiment I found a bug in Akka that only manifested at a larger scale.

2 Likes

A few more thoughts.

  • I almost mentioned trying to create a couple of roles in my profile. But I didn’t because I really don’t think it solves any problems. It adds complexity but doesn’t have any benefit in a situation like this.

  • It sounds like you have verified that the high CPU node is the leader? Correct? I still don’t understand why the leader would be doing any work at all if there are no members entering or leaving.

    • Start monitoring domain events. Perhaps there are lots of transitory failures and your cluster isn’t as stable as you think? If the leader is processing lots of reachability events that would explain the high CPU.

    • I think you need to do some traditional profiling. What exactly is the CPU doing? What actors? What methods?

    • I still think you need to reduce the size of your cluster. 1000 nodes is between bleeding edge and experimental. I’m glad Manuel jumped in. I remember Manuel talking about his large cluster work, but I had forgotten it when I wrote a post about cluster size. Not to be overly blunt, but running a 1000 node cluster without help from Lightbend (or being knee deep into the source code of Akka yourself like Manuel) is probably suicide. But it you are only using 0.5 cores of work per Akka node, there’s no reason you can’t reduce the cluster size. Especially if turns out that it actually is the leader duties causing the CPU load.