async-profiler shows that my akka-http server is spending 16% of CPU time on forkJoinPool.scan(10% on signalWork and 6% on unSafe.park). The server gets thousands of requests per second and usually responds with 300ms latency. My fork-join-pool configuration looks like this
The machine has 16 CPUs and 16GB memory. Application runs with Xmx 12GB
Total JVM threads include:
16 threads - akka.default-blocking-io-dispatcher(used to load a file every minute. Doesn’t really show up on profiling. Barely used)
8 threads for logback (typesafe logger)
Why is ForkJoinPool.scan taking so much CPU%? I have played around with all configurations of fork-join-executor but can’t get the forkJoinPool.scan CPU% to be lower than 16%. Any suggestions on how I can improve performance?
ForkJoinPool.scan is a (or better, “used to be”) a mechanism where an idle pool thread keeps spinning alive for a while actively polling for more work. That helps with latency as a park/unpark cycle takes quite some time vs. nearly instantaneous execution of a task that was found during polling. In that way, scan is a trade-off between latency and efficiency during idle times.
If you don’t overprovision fork join pool threads, i.e. every fork join pool thread mostly has a dedicated core to run on, scanning shouldn’t affect throughput. However, it does affect fairness and power consumption.
So, are you running at 100% CPU (across all cores) when you see that behavior?
In any case, with Akka 2.6 we switched from an internal copy of the ForkJoinPool to using the JDK variant directly. The JDK disabled scanning in more recent version (or at least decreased the impact of it drastically). Could you try rerunning your test with Akka 2.6.6 and see if that’s still a problem?
No. The application is not running at 100% CPU when I see this behavior. We have autoscaling that never allows the nodes to reach CPU beyond 60%. So, the behavior is quite common, the way we are using akka-http
Yes, I will try running with Akka 2.6.6 and see if that improves the situation
Our fork-join-pool uses 16 threads and we have 16 CPU. It’d be helpful if you can elaborate why it affects fairness and power consumption? If I increase the number of threads, the CPU % on forkJoinPool.scan increases
In that case, the behavior you see can be quite problematic as it can look as if the machine is more busy than it actually is.
E.g. if you have a load where you have 16 concurrent tasks that execute for a very short time and then they wait asynchronously for something for the same time, then if the pauses are short enough all threads might still be continuously at 100% even if the actual load is far from 50%.
Typical actor use cases can actually fall into this kind of pattern: if you have two actors that communicate, and actor A is just sending a message to actor B, then actor A is still active when it sends a message, so processing the message on actor B may start concurrently with actor A still running. However, in many cases actor A will soon afterwards be idle. If the pool threads running those actors have no other work to do and the idle times between activation of those actors is small enough, then they will keep busy spinning (for a while) waiting for more work to arrive. IIRC the spinning time is in the order of microseconds, but depending on the work patterns that will introduce lots extra CPU load (that would otherwise be idle).
If you overprovision threads (more threads than cores / hyper-threads), you get an even worse problem because after bursts of work there are even more threads battling to find more work, but the spinning might take longer because some of those threads will be preempted.
Effect on power consumption: instead of being idle, threads actively look for more work (for a while), so, as said before, depending on workload, a thread might use much more CPU than strictly necessary
Effect on fairness: the effect on fairness is on several levels: since threads are less idle, other threads from the same process get less chances to run, on an OS level: since there is less idle time, processes/threads with lower priority get less chances to run.
All that said, it shouldn’t be (much of) an issue in Akka 2.6 any more!
I did upgrade to akka version 2.6.6 and ran async-profiler. The good news is that it dropped the ForkJoinPool.signalWork from 16% to just under 2%. But I didn’t see any performance improvement(overall CPU% continued to remain the same).
I compared the flamegraphs before and after upgrading Akka 2.6, I can certainly see the difference in the drop in ForkJoinPool.signalWork but it seems like all that drop in CPU time got redistributed across all tasks but didn’t result in overall performance improvement
I see an increase in CPU time of the FSM(45%), ActorGraphInterpreter(20%) and BatchingExecutor(10%) but all of these run some application code. So, I can’t say if it’s the way we are using akka-http that’s not as efficient as it can be.
How do your measure cpu time? Async profiler only gets you relative percentages so it’s expected that those extra percentages will be distributed so that everything adds up to 100% again. The change isn’t supposed to get you extra performance, it will only stop increase idle time of your cpu, i.e. save energy.
If you want to increase performance then the FJP issue is mostly a red herring. What kind of performance do you see and what do you need? Can you share the flame graphs?
I measure CPU usage of the JVM process using htop. We also have a datadog metric that uses cpu idle time and subtracts it from 100 to trigger alerts.
We are getting 27000 requests per minute and it takes around 43 nodes(16 CPU and 16GB) nodes to handle that with 300ms latency. This is odd because we have other servers that are written with akka-http that handle more than this.
I’m not at liberty to share the flamegraphs on public domain, yet. If there’s a way to DM I can probably get that over(or perhaps screenshots?)
In general, I think the discussions on this topic has given the much needed impetus for us to upgrade to 2.6 in all of our servers
Thanks for sharing the flame graphs privately with me. I identified a couple of optimization opportunities in akka-http. Overall it looks like akka and akka-http could be a bit more optimized but there’s nothing in there that should make a very big difference (> 10%).
Future processing has been much improved in Scala 2.13. E.g. Future.failed is quite inefficient in Scala 2.12 because it creates an Exception with stack traces when the future is successful (that’s something I found in your flame graph).
I’ll create tickets for the things I found in Akka / Akka HTTP:
discardEntity on Strict entities should be a no-op but is really slow instead
NewHostConnectionPool does expensive debug logging even if debug logging is turned off
Source.single is surprisingly slow
toStrict will access the config if you don’t pass in a byte limit
That sounds quite slow indeed but, of course, I don’t know about the actual processing load your application has. Those numbers seem to boil down to 27000 request per minute / 60 seconds per minute / 43 nodes ~ 10.5 requests per second per node. That doesn’t sound as if the load should be able to saturate all cores. So something else seems off here.
So, I would definitely recommend to update to the latest versions of Scala, Akka, and Akka-Http and rerun the tests and only then start micro-optimizing things.