Load Balancing in Akka Stream application

I have an application based on Akka streams. The application takes the input from one Kafka topic, processes it and again sends the processed data to another Kafka topic. Basically, the application takes data from one Kafka topic and pushes the data to other Kafka topic based on some property.
This application is running in Kubernetes cluster and we are using 2 pods as of now. Whenever a load is generated, the CPU utilization for one pod is 70-80% where as for the other pod its 20-30%.

We have used Load Balancer with Round - Robin algorithm.

Can anyone suggest what could be the best approach or how to solve the high CPU utilisation of one POD and low of the other and also state the reason for such behaviour.

Also, we have tried with single POD and autoscale at 60% but still after auto scaling there is more than 30% difference in CPU utilisation across the PODS.

Hi @Euqinu,

It’s tough to say why you’re seeing these CPU numbers. Assuming both pods are running the same app belonging to the same consumer group and you’re using the default consumer group partition assignment strategy you should expect to see roughly equal numbers of partitions assigned to each consumer group member (each pod). If you had 10 partitions you should expect 5 partitions assigned to each pod. You can confirm this is actually happening using Kafka ecosystem tools, or the kafka-consumer-groups.sh script that ships with Apache Kafka.

Even if partitions are well distributed it’s possible that your message distribution is not. For example, you may have some “hot” partitions that contain more messages than others. You mentioned that you’re using a “round-robin” algorithm to distribute messages. Are you doing this for messages that are produced to the topic you’re consuming?

Finally, even with good partition and message distribution, you may expect different processing characteristics if messages from some partitions require more processing or take a longer time to processed for other reasons (more complicated workflow, blocking network calls to other systems, databases, etc.)

I would recommend ensuring that you have good partition and message distribution, as well as looking at consumer rate metrics per partition to see how quickly partitions are being processed. Partitions that take longer to process should give you clues as to what kind of message content is causing the spike in load.

Hope that helps,