Akka kafka source : batch processing & back pressure / ACK


(Firenow) #1


I discovered Scala / Akka few months ago and figured out Actors / Reactive streams could be the right toolbox to create an IoT data integration application.
To summarize it briefly, GB of IoT data are streamed continuously to a Kafka cluster.
All this data needs to be aggregated per batch in order to be transformed in a fast read storage format.

I figured the actor model (batch jobs notifications) and akka reactive streams (for guaranteed delivery and back-pressure) should be the right tools :
Consumer : consumes a batch of data and send a BatchNotification to a JobManager with guaranteed delivery
JobManager : build appropriate jobs and send it all to a Worker Master actor which broadcast JobAvailable to a pool of Worker actor.

I figured out how to consume data from kafka (using “akka.kafka.scaladsl.Consumer”) and send a job to the JobManager using “toMat” (I have to admit it’s still unclear what these materialized views exactly mean).

One thing which remains to be done, and which I have trouble implementing, is implementing back-pressure and ACK.
In other words, the consumer should wait for an ACK from JobManager before commiting Kafka offsets and sending a new job to it.

I’ve no idea how to implement this, and can’t find any example of this.
Is this easily implementable using Akka actors / streams ?
Any hint on the design pattern, doc and package to use ?

(Aditya Athalye) #2

Hi @firenow

If you are continuously receiving GBs of streaming IoT data, I am not so sure why you would want to slow down the entire processing chain by artificially introducing a back-pressure mechanism (unless there are other restrictions in terms of hardware and such).

There could be multiple ways of doing this, and one of the ways would be this:

1.) Introducing a data storage (RDBMS like MySQL) into the picture.
2.) The Kafka consumer reads a batch of data and sends a message to a remote actor (JobManager).
3.) The JobManager creates a Job definition out of this batch and writes this definition to the MySQL store (Since you are batching, the number of writes to the DB should be manageable).
4.) Once the write to the DB is successful, the JobManager actor sends an ACK to the original consumer which then commits the kafka offsets. This part will ensure that your job is created, but the consumer need not wait for the aggregations on the workers to finish for pulling in next batch of data.
5.) As soon as the workers are done with execution of a batch, the JobManager updates the status (Completed, Failed etc.) for this job definition in your DB.

Introducing the DB gives your 2 advantages:

1.) You can build some auditing around your jobs (status, time to completion etc.) with a possibility of retrying failed jobs.
2.) Frees the Kafka consumer of having to wait for a batch to complete to pull in next batch. If the JobManager starts lagging behind big time, the back pressure introduced would slow down the system as a whole.

Try to design in such a manner such that each component (consumers, job managers, workers) can be scaled independently.



(Johan Andrén) #3

The Kafka connector docs has examples for doing this: https://doc.akka.io/docs/akka-stream-kafka/current/atleastonce.html

Interacting with an actor could be introduced in the flow through .mapAsync or ActorFlow.ask.

toMat means to + give me the option to decide what materialized value to keep, nothing about “materialized views”. I’d recommend that you go through the stream docs and get a good grasp for the basics of streams before jumping onto something more advanced, like streaming from kafka with offset commits.