[PERF] Cluster Sharding perf issue

Problem Encountered

ASKing a local actor achieves >120k req/s throughput, but ASKing a remote actor drops throughput to < 15k req/s.

With serialisation on, local messaging still achieves >70k req/s throughput, so drastic drop in throughput does not seem to relate to serialisation (only)

Cross-Machine communication in Cluster Sharding expected to be faster.

Environment
Java™ SE Runtime Environment (build 16.0.2+7-67)
Java HotSpot™ 64-Bit Server VM (build 16.0.2+7-67, mixed mode, sharing)
Windows Server 2019
n2-standard-8 machine in GCP (8 CPU @ 2.80 GHz and 32GB RAM)

A minimum viable repo that reproduces the issue is at GitHub - carl-camilleri-uom/benchmark-akka-scala-cluster

In this repo, compiled.zip contains a compiled version using sbt universal:packageBin. Use bin\run.bat to execute.

Details of Experiment

Running two n2-standard-8 nodes in GCP (8 CPU @ 2.80 GHz and 32GB RAM) with Windows Server 2019 in GCP (“instance-1” and “instance-2”), and a third machine to run the benchmarks from

Check 1
curl http://instance-1:8080/1

Response:
Pong(hello from instance-1,Actor[akka://ping-pong-cluster-system/system/sharding/Ping/49/1#-1050428529])

Therefore entity id 1 actor is hosted on instance-1 server

Check 2
curl http://instance-1:8080/2

Response:
Pong(hello from instance-2,Actor[akka://ping-pong-cluster-system@instance-2:2551/system/sharding/Ping/50/2#-995701206])

Benchmark 1 - “Local” Actor

wrk -t64 -c64 -d30s http://instance-1:8080/1

Running 30s test @ http://instance-1:8080/1
  64 threads and 64 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   688.79us    1.36ms  31.40ms   96.97%
    Req/Sec     2.00k   415.62     2.69k    88.02%
  3839568 requests in 30.10s, 798.25MB read
Requests/sec: 127563.34
Transfer/sec:     26.52MB

Benchmark 2 - “Remote” Actor
wrk -t64 -c64 -d30s http://instance-1:8080/2

Running 30s test @ http://instance-1:8080/2
  64 threads and 64 connections

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     4.93ms    3.67ms  31.20ms   82.61%
    Req/Sec   230.09     30.70   464.00     83.37%
  440164 requests in 30.10s, 97.81MB read
Requests/sec:  14623.61
Transfer/sec:      3.25MB

Benchmark 3 - “Local” Actor with serialisation
This benchmark is the same as Benchmark 1, but sets akka.actor.serialize-messages = on

wrk -t64 -c64 -d30s http://instance-1:8080/1

Running 30s test @ http://instance-1:8080/1
  64 threads and 64 connections

  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.09ms    1.42ms  28.50ms   94.87%
    Req/Sec     1.15k   305.61     2.19k    89.11%
  2196409 requests in 30.10s, 452.45MB read
Requests/sec:  72972.05
Transfer/sec:     15.03MB