Akka grpc performance in benchmarks

Hi.

I saw this grpc bencmark (GitHub - LesnyRumcajs/grpc_bench: Various gRPC benchmarks) on twitter recently and the results of akka-grpc are very weak compared with java.

Do you guys have any explanation for the results? The code of the benchmark seems very simple and I don’t see anything obvious wrong…

2 Likes

So I was having a closer look at this and I think there are 2 different issues here:

  • single core server results
  • multi core server results

Single core server

Single core results are bad, because there are a lot of failure responses from the server:

Status code distribution:
  [OK]                 68817 responses    
  [Unavailable]        220868 responses   
  [DeadlineExceeded]   293 responses      

Error distribution:
  [219868]   rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: timed out waiting for server handshake   
  [227]      rpc error: code = DeadlineExceeded desc = context deadline exceeded                                                                          
  [66]       rpc error: code = DeadlineExceeded desc = latest connection error: timed out waiting for server handshake                                    
  [1000]     rpc error: code = Unavailable desc = transport is closing  

while for the java benchmark I see

Status code distribution:
  [OK]            1523989 responses   
  [Unavailable]   1000 responses      

Error distribution:
  [1000]   rpc error: code = Unavailable desc = transport is closing  

and I see the following errors in the server logs

22:37:36.470 [GreeterServer-akka.actor.default-dispatcher-13] ERROR akka.actor.ActorSystemImpl - Unhandled error: [Stream with ID [27] was closed by peer with code CANCEL(0x08)].
akka.http.scaladsl.model.http2.PeerClosedStreamException: Stream with ID [27] was closed by peer with code CANCEL(0x08)

however after increasing the parallelism setting this errors is gone but the results are not much better. That said I suspect the problem is that the dispatcher settings are not good for a single core server…

3 cores server

Here I no longer see errors

Status code distribution:
  [OK]            186355 responses   
  [Unavailable]   966 responses      

Error distribution:
  [966]   rpc error: code = Unavailable desc = transport is closing  

but there is a lot of GC activity, which I see when running with JFR:

and checking the TLAB Allocations tab:

akka.http.scaladsl.model.Uri$.parseHttp2AuthorityPseudoHeader(ParserInput, Charset, Uri$ParsingMode) seems to be a bottleneck. UriParser was already reported as a bottleneck in Allocation hotspots in akka-http-client · Issue #2739 · akka/akka-http · GitHub

Future steps

I hope these findings can help improve the performance of akka-http and akka-grpc. Meanwhile I will try to contribute some jmh tests to akka-http and If my skills allow contribute some improvements.
Meanwhile if someone has any tips about the single core problems that would be helpful…

I opened this PR, but other than removing that outside TLAB allocation the results are still similar

There are several other comments and tips about how to improve that benchmark performance in the r/scala reddit.
Concretely, there is a comment about the use of MergeHub and another about adding a delay to make the tests more realistic.

FYI: I opened a PR with a benchmark against fs2-grpc scala fs2-grpc implementation by jtjeferreira · Pull Request #143 · LesnyRumcajs/grpc_bench · GitHub

Concretely, there is a comment about the use of MergeHub

That code is not used (I guess they copied pasted that from an hello world example), so that is not the problem.

and another about adding a delay to make the tests more realistic.

still that does not explain the big difference between the scala and the java implementations

Thanks for these investigations!

The benchmark doesn’t have a warmup phase so running on a single core means that the single core will be busy JIT compiling all the libraries for quite a while (on my fast machine for >60s). This means that the a) the single core spends a significant amount of time executing JIT compiler b) the actual code is not yet optimized for a long time. That leads to performance that is orders of magnitude worse in the first 10s of seconds.

All that extra contention also means potentially more context switching introducing even more cost.

I can get to about 30-50% of the Java throughput when I don’t constrain the number of cores for the akka version. That seems reasonable given the architectural constraints of the Akka implementation.

Note, that the results as presented don’t make much sense. You cannot measure both latency and throughput at the same time and get meaningful results (I also doubt the ghz latency reporting in general, it seems they can be way off when the benchmarking tool gets overloaded itself).

Hi @jrudolph

Sorry that I was not clear, but what I meant is that the server errors in logs are gone, but the “Status code distribution” is still the same, i.e there are still a lot of errors returned from the server. So I think we should have a closer look at this, but honestly I am out of ideas…

The benchmark doesn’t have a warmup phase

It has an warmup phase of 5s. Maybe it is not enough, but at least is the same for all the benchmarks

I can get to about 30-50% of the Java throughput when I don’t constrain the number of cores for the akka version.

These are the numbers I get when running with 3 cores server on my 12 core laptop

Benchmark info:
37a7f8b Mon, 17 May 2021 16:06:05 +0100 João Ferreira scala zio-grpc implementatio
Benchmarks run: scala_fs2_bench scala_akka_bench scala_zio_bench java_hotspot_grpc_pgc_bench
GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=5s
GRPC_SERVER_CPUS=3
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211705T162018
--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   59884 |       16.19 ms |       40.65 ms |       54.12 ms |       88.15 ms |  256.21% |     204.7 MiB |
| scala_akka         |    7031 |      141.70 ms |      281.35 ms |      368.74 ms |      592.53 ms |  294.91% |    175.44 MiB |
| scala_fs2          |    7005 |      142.20 ms |      231.57 ms |      266.35 ms |      357.07 ms |  274.57% |    351.34 MiB |
| scala_zio          |    6835 |      145.74 ms |      207.45 ms |      218.25 ms |      266.37 ms |  242.61% |    241.43 MiB |
--------------------------------------------------------------------------------------------------------------------------------

@jrudolph if you are interested I can share some flamegraphs… (unfortunately this forum does not allow uploading SVG files)

Interesting, that was added last week only after I looked at it. 5 seconds is probably too little. 30-60 seconds would be good but is still probably not enough for the 1 core version.

I can see number in a similar ballpark. I think part of what I saw was related to the benchmark client being saturated around 50-60k rps with the default 9 cores client setting.

Thanks, I also got some flamegraphs. I don’t see particularly low hanging fruits, unfortunately. Right now with HTTP2 and gRPC everything needs to be handled with streams even the small non-streaming requests of the benchmark. Creating lots of streams is somewhat expensive, though, so we lose quite some performance in these areas. For HTTP/1.1 we have Strict entities to avoid this cost but replicating that in HTTP/2 is more challenging. I created Investigate HTTP/2 performance improvements · Issue #3815 · akka/akka-http · GitHub to track what we could do to improve the situation.

@jrudolph thanks for your comments, and indeed the use of streams seams to be a source (no pun intented) of problems. Meanwhile I also opened an issue in akka-grpc regarding marshalling and streams Marshalling performance · Issue #1349 · akka/akka-grpc · GitHub

Regarding the single core errors, do you have any idea or any hints to tweak the configuration?

I only tested without docker and there I don’t see that problem on my machine. Here’s the command I use for running the example:

taskset -c 0 /usr/lib/jvm/adoptopenjdk-16-hotspot-amd64/bin/java  -XX:+PreserveFramePointer -XX:+UnlockDiagnosticVMOptions -XX:+DebugNonSafepoints -XX:+PrintCompilation -Dakka.actor.default-dispatcher.fork-join-executor.parallelism-max=1 -jar target/scala-2.13/akka-grpc-quickstart-scala-assembly-1.0.jar

and here for the benchmarker:

taskset -c 4-15 ./ghz --proto=proto/helloworld/helloworld.proto --call=helloworld.Greeter.SayHello --insecure --concurrency=50 --connections=5 --duration 300s --cpus=7 --data-file payload/100B 127.0.0.1:50051

Ah, this is using much less concurrency / connections.

Just using the setup as given, increasing the GRPC_BENCHMARK_WARMUP to 20s seems to avoid the extra failed requests. But I think you can just ignore the single core benchmark, it’s not a feasible setup.

So after “wasting” all these hours profiling, I noticed that the heap settings were not being applied. After changing that, the results are a bit better.

Single core

GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=20s
GRPC_SERVER_CPUS=1
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211805T232758
--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   38695 |       25.73 ms |       45.49 ms |       51.22 ms |       71.34 ms |  100.23% |    189.41 MiB |
| scala_zio          |    7306 |      136.40 ms |      199.65 ms |      217.33 ms |      365.58 ms |   98.02% |     261.0 MiB |
| scala_fs2          |    5959 |      167.18 ms |      200.19 ms |      212.46 ms |      286.31 ms |   100.2% |    219.81 MiB |
| scala_akka         |    3063 |      324.07 ms |      501.57 ms |      599.05 ms |      811.57 ms |   98.75% |     252.9 MiB |
--------------------------------------------------------------------------------------------------------------------------------

3 cores

GRPC_BENCHMARK_DURATION=50s
GRPC_BENCHMARK_WARMUP=20s
GRPC_SERVER_CPUS=3
GRPC_SERVER_RAM=512m
GRPC_CLIENT_CONNECTIONS=50
GRPC_CLIENT_CONCURRENCY=1000
GRPC_CLIENT_QPS=0
GRPC_CLIENT_CPUS=9
GRPC_REQUEST_PAYLOAD=100B
-----
Benchmark finished. Detailed results are located in: results/211805T233358
--------------------------------------------------------------------------------------------------------------------------------
| name               |   req/s |   avg. latency |        90 % in |        95 % in |        99 % in | avg. cpu |   avg. memory |
--------------------------------------------------------------------------------------------------------------------------------
| java_hotspot_grpc_pgc |   60200 |       15.97 ms |       32.56 ms |       44.28 ms |       72.06 ms |  223.35% |    199.76 MiB |
| scala_akka         |   16948 |       58.71 ms |       99.88 ms |      112.57 ms |      182.43 ms |  298.79% |    264.26 MiB |
| scala_fs2          |   15886 |       62.76 ms |       69.74 ms |       83.58 ms |      138.01 ms |  299.52% |     243.0 MiB |
| scala_zio          |   15837 |       62.96 ms |       97.95 ms |      114.88 ms |      191.66 ms |  300.35% |    293.31 MiB |
--------------------------------------------------------------------------------------------------------------------------------
3 Likes

So after this simple fix Akka, fs2 and zio all perform comparably? That’s great to know, thanks for your work.