Possible to have too many Futures per request

Is it possible to have too many Futures per request regardless of dispatcher tuning?

To illustrate the example, I have an async action returning a future. Within the function there are 20 Futures composed through a for expression. Each Future depends on the previous so I expect a list of flatMap / map transforms from the for expression. Each Future performs a service ‘task’. Each task is very lightweight / non-blocking. There is one Future in the middle which is the result of a rest request using WSClient.

for {
 r1 <- Future(...)
 r2 <- f(r1)
r10 <- postCallToOtherService(...)
 r20 <- f(r19)
} yield r20

And the dispatcher uses a fork join executor running on a quad core server.

example-dispatcher {
    executor = "fork-join-executor"
    fork-join-executor {
      parallelism-min = 4
      parallelism-max = 4

I noticed something interesting under load of ‘many’ requests per second. Measuring the total amount of time to execute the functions within each Future and the rest call response time was at least an order of magnitude shorter than the time to construct the Futures and yield the final result. More specifically the time span from before calling Future(...) to yielding the value r20. During this scenario CPU usage was between 50% and 75%.

In this scenario there seems to be a considerable delay between creating a Future and completing it. As if the dispatcher could not start to execute it for some time. But then why was CPU not close to 100% load?

Finally I decided to minimize the Futures used for the ‘steps’ to complete the request. Refactoring would have looked something like:

for {
 a <- Future(/* compute r1, r2 ... r9 */)
 b <- postCallToOtherService( a )
 c <- f( b )
} yield c

However I opted to manage state by refactoring this into an Actor.

def receive: Receive = {
  case r1: Request =>
   s = sender()
   r2 = ...
   r9 = ...

  postCallToOtherService( ... ).onComplete {
    case Success(r10) => self ! r10  

  case r10: R =>
    r11 = ...
    r20 = ...
    s ! r20 // send the final value

In the final scenario both CPU load was decreased and latency was about half.

Through those scenarios I observed the overhead cost of using concurrency utilities and they should be used carefully. There’s plenty of documentation on managing blocking code inside Futures but how much work is too little work considering the cost of dispatching a Future? Any advice is appreciated.