Setting up RoundRobinGroup with BalancingPool

Hi guys,

I’m distributing quite heavy computation with Akka Cluster. Orchestrating actor is spawning subtasks which send their results to reducing actor which in the end is collecting sub results and producing final result. Pretty simple.

I use round-robin-group configured as follows:

    deployment {
      /parent/child {
        router = round-robin-group
        routees.paths = ["/user/child"]

        cluster {
          enabled = on
          allow-local-routees = on
          use-role = compute


system.actorOf(ChildWorker.props(...).withRouter(new RoundRobinPool(workerInstances)), "child");

It’s working as expected, but to gain more speed i though about using BalacingPool on the local ChildWorker to have more flexibility in handling subtasks of quite different duration. Results were a but surprising to me.

system.actorOf(ChildWorker.props(...).withRouter(new BalancingPool(workerInstances)), "child");

It turned out that rebalancing is working fine on the machine which is initiating the processing, but all other remote machines are always using single actor to process the subtasks.

This is the how it looks. Table below has 3 columns: number of tasks processed on given hostname and by actor having UUID. Clearly rebalancing is working nicely on tmp06 but as you can see all other machines use only one actor which handles a lot of tasks which is making the processing way slower.

   7 tmp06  UUID[-1966118007]
   8 tmp06  UUID[-1198559308]
   8 tmp06  UUID[1380304217]
   8 tmp06  UUID[-1393713980]
   8 tmp06  UUID[-231652641]
   8 tmp06  UUID[954268528]
   9 tmp06  UUID[1575392504]
   9 tmp06  UUID[241401640]
   9 tmp06  UUID[253473252]
   9 tmp06  UUID[662565581]
  10 tmp06  UUID[-1021666439]
  10 tmp06  UUID[-1740072156]
  10 tmp06  UUID[-1944934763]
  11 tmp06  UUID[1233386530]
  11 tmp06  UUID[-1461303810]
  11 tmp06  UUID[-358932747]
  11 tmp06  UUID[-375831361]
  11 tmp06  UUID[-381997833]
  11 tmp06  UUID[973420023]
  12 tmp06  UUID[-1595089979]
  12 tmp06  UUID[1620412350]
  12 tmp06  UUID[1882766899]
  12 tmp06  UUID[1978863154]
  12 tmp06  UUID[218773515]
  12 tmp06  UUID[-29776901]
  12 tmp06  UUID[695670443]
  13 tmp06  UUID[-1682686364]
  13 tmp06  UUID[-1899851456]
  13 tmp06  UUID[307375540]
  14 tmp06  UUID[1316354448]
  14 tmp06  UUID[-1375544402]
  14 tmp06  UUID[488103872]
  14 tmp06  UUID[-768640047]
  15 tmp06  UUID[1613489745]
  15 tmp06  UUID[290922216]
  18 tmp06  UUID[1049063036]

 349 tmp01  UUID[-1177891195]
 360 tmp03  UUID[-1159146073]
 367 tmp04  UUID[-2058093493]
 369 tmp02  UUID[982203959]
 383 tmp05  UUID[-988775137]
 407 lhch03 UUID[1661774721]
 407 lhch04 UUID[1801962384]

So coming to my question: are there any restrictions in using local BalancingPool together with clustered RoundRobinGroup which could lead to such behavior?

(I’m using Akka 2.6.4 with Java and JDK 1.8)

Any help appreciated

That’s interesting, and strange. The group router is only sending the messages with actorSelection to the destination path. That is completely independent of the actor that is located at that path.

What happens if you replace the balancing pool with an ordinary round-robin pool?

Could it be that because the bottleneck is the remote transfer the balancing pool can stick to one routee? What happens if you simulate slow destination routees?

Thanks for your response Patrik. I interpret it as “It should work”, which is a good sign :slight_smile:

Regarding your question:

What happens if you replace the balancing pool with an ordinary round-robin pool?

This is my default setup which is working fine. Sub tasks are distributed to all machines and many actors on each machine. But as wrote sometimes not balanced in an optimal way due to different sub task duration. That’s the thing I’m trying to optimize.

Could it be that because the bottleneck is the remote transfer the balancing pool can stick to one routee?

Destination routees need 1 to 4 seconds to handle each task. Task definition might have few kilobytes (serialized as Kryo) and result is slightly bigger. Up to 100kb. To be honest I doubt that network is a bottleneck. But I could dig in.