Aggregating a million records using Akka Streams

Hi Team,

I am running a simulation which generates a million records every second. I’m writing them to Kafka and reading them through Akka Streams. I’m performing a few aggregations on this data and writing the output back to Kafka.

The data contains a timestamp based on which the aggregations are grouped. Using the timestamps, I’m creating windows of data and performing aggregation on these windows. Since there are a million records each second, the aggregations are taking about 40 seconds for one million records. This is really slow because new data is being generated and written to Kafka every second.

I referred this blog post for performing window aggregations.

Is there any better way to perform these aggregations in lesser time(preferably less than one second) using Akka Streams?


Im not a pro in this area but I would go with a different approach. Now you have a big pipe of events, and your solutions seems to be unscaleable. I would write a small app with streams which consumes and maybe meta the data from the original stream, and push back the data to windowed subtopics (like 12:05_5min) and would write an another app which would consume and aggregate the data from those subtopics. The first app is scaleable bcs it dont do any aggregation, and the second app is easily scalable with a “stream reservation” mechanism which can be implemented with cluster actors or redis. If you can scale easily you can finetune your cpu power and can measure the bottlenecks nore easily.