Resource management and task scheduling with akka.cluster

Hi,

I’m trying to handle a “simple” scheduling problem involving tasks executed on a cluster of several machines. Each task can consume some CPU, memory and files stored on the local disk (we know in advance the resources a task will consume), a task can last minutes, hours or even days, that we don’t know in advance. Note that a task can wait until the cluster is ready to execute it. So I have the scheduler logic (more or less), where to put it in akka ?

I started by a simple scheduler/router actor where actor nodes register to it. A task is an actor that ends when the task is done. Each Node Actor reports the status of the resources (local files, cpu, memory…) consumed by ball the tasks running. When a task is done, the scheduler is notified and others tasks can start (or not if there is not enough room). It’s a very simple cluster implementation without any seed node. It works but I miss all the good things the akka cluster can get me (such as all the logic to join/leave a cluster…).

So, of course, I would like to reuse the cluster package but I don’t know where to start with the router. I really don’t see where the scheduler and resource manager would fit.

Few pointers at a higher level:

  • I’m currently looking at the next step where I might also involve kubernetes in the loop to do something similar to that: https://www.youtube.com/watch?v=OOXRgd5yUQo
  • It’s like what YARN for hadoop is doing (at much simpler level in my case but still the idea is the same).
  • we can also think of reusing Kubernetes API, as Spark team is planning to do, to schedule containers executing a specific task (it will join the cluster when created) since the goal is to replace Yarn capabilities: apache-spark-with-native-kubernetes/
    there are still some work to do for data locality (sorry I cannot port more than 2 links).

Any idea ?

Thanks !

Maybe the good old distributed workers sample could give you some inspiration and ideas?
https://developer.lightbend.com/guides/akka-distributed-workers-scala/

It seems to me like the main benefit of putting the Akka cluster in charge of Kubernetes via its API (https://github.com/ticofab/akka-cluster-kubernetes) for adding/deleting nodes is money savings. Spun up cluster nodes only exist (from a billing standpoint, not just software) when needed. Wondering if this is the general understanding of this idea?