Benchmarking problem

lbialy · April 12, 2018, 2:48pm

Hi everyone,

My colleagues (Paweł Dolega & Marcin Zagórski) from VirtusLab and I are preparing a presentation for ReactSphere conference in which we are trying to showcase the differences between traditional sync microservices and reactive microservices. We’ve selected Scalatra as it’s based on servlets api as a sync platform and Akka HTTP as a reactive platform and built small stacks of 3 microservices in each paradigm. Our whole codebase resides @ https://github.com/VirtusLab/ReactSphere-reactive-beyond-hype

We’re currently load testing the whole stack in Tectonic kubernetes cluster on AWS using Gatling and we’ve encountered several strange issues with Akka HTTP that don’t show up with scalatra. We’ve been able to escalate load tests from 700 users in 300 seconds up to 2300 users in 300 seconds with scalatra and only after that point sync stack started to crash. Akka-http on the other hand has nicer latencies in 75th percentile, but starts to yield timeouts at 1500/300.

Here’s an excerpt from Gatling’s log:

---- Errors --------------------------------------------------------------------
> j.n.ConnectException: connection timed out: identity-service-t    509 (85.98%)
ertiary.microservices.svc.cluster.local/10.3.240.9:80
> status.find.in(200), but actually found 400                        33 ( 5.57%)
> j.u.c.TimeoutException: Request timeout to identity-service-te     27 ( 4.56%)
rtiary.microservices.svc.cluster.local/10.3.240.9:80 after 600...
> status.find.in(200), but actually found 503                         9 ( 1.52%)
> status.find.is(201), but actually found 503                         8 ( 1.35%)
> status.find.is(200), but actually found 503                         5 ( 0.84%)
> status.find.in(201,409), but actually found 503                     1 ( 0.17%)
================================================================================

400s are actually expected in results as those are related to cassandra’s eventual consistency.

We’re trying to understand what might be causing this problem. Do you have any clues?

Łukasz

manuelbernhardt · April 13, 2018, 8:14am

It’s really hard to say anything without running the benchmark and profiling it, which I don’t have time for right now. So let me just venture some wild guesses, probably all in the wrong direction, but who knows

https://github.com/VirtusLab/ReactSphere-reactive-beyond-hype/blob/master/codebase/base-async/src/main/scala/com/virtuslab/base/async/Http.scala#L8

-> probably hasn’t gotten much to do with this, but I’d recommend using the ActorSystem’s dispatcher anyway and not mix thread pools when possible (system.dispatcher)

https://github.com/VirtusLab/ReactSphere-reactive-beyond-hype/blob/master/codebase/auction-house-primary-async/src/main/scala/com/virtuslab/auctionhouse/primaryasync/AuctionRoutes.scala#L44

I see you are using com.typesafe.scalalogging, I don’t know this logger, but does it provide any asynchronous capability? If not then judging from the logback config at e.g. https://github.com/VirtusLab/ReactSphere-reactive-beyond-hype/blob/master/codebase/auction-house-primary-async/src/main/resources/logback.xml you don’t log asynchronously, which might get in the way of performance. You can configure async appenders in logback.

Hope this helps,

Manuel

lbialy · April 13, 2018, 9:15am

Thanks Manuel,

We’ll take your insight into consideration. I think we’ve been able to pinpoint the culprit - in billing service we’re performing an http call to payment system, which in turn models a slow, external service that processes, well, payments. Payment system simply sleeps for 1 second before answering and apparently this long http call was causing systemic trouble - when replaced with
onComplete(after(1.second, system.scheduler)(Future("ok")))
problem seemed to disappear (at least locally), so we figure that it must be somehow related to akka http client connection pools and connection starvation. We will deploy full load tests with solution suggested by docs - a separate connection pool for that long request - today and hopefully everything will go fine this time.

Łukasz

johanandren · April 13, 2018, 11:35am

If you do Thread.sleep anywhere in the sample app without taking special care putting it on a separate dispatcher you are likely starving the default fork join pool, meaning that there are no threads available to handle other work. So very likely a culprit. If you need to do blocking you should isolate that onto a separate thread pool based dispatcher.

lbialy · April 13, 2018, 11:40am

@johanandren payment-system app is scalatra-based, so Thread.sleep(1000) is at home there

It’s basically a way for us to introduce a variable of integration with slow, legacy system and see how it impacts a microservices stack and how different paradigms cope with that - it’s a huge pain for a sync stack as it blocks threads, but it shouldn’t be a huge problem for a reactive stack. It seems that using single http connection pool for all outgoing connections including connections to payment-service that just sleeps for 1 second was a bad decision.

johanandren · April 13, 2018, 11:57am

Ah, I thought that was in the Akka HTTP app you were benching.

If you are doing 2300 concurrent users and every request for those will trigger a HTTP request to the slower system, and that slower system can handle many concurrent requests you may want to tweak the HTTP client thread pool configuration, I think the default max-connections per host is something like 4, so that could probably be a bottle neck.

lbialy · April 13, 2018, 2:43pm

We’ve been able to pinpoint the culprit. Apparently the slowest part of the equation is alpakka’s S3 integration which we are using to put dummy invoices in S3. That S3 driver caused inter-service 503s due to request timeouts.

We have tuned configuration to limit errors as we want a cleaner overview of latencies and ended up with this:

akka.http.server.request-timeout="infinite"
akka.http.client.connecting-timeout = 15 min
akka.http.host-connection-pool.max-open-requests=256
akka.http.host-connection-pool.max-connections=20

@johanandren we’ve already increased max-connections some time ago, do you happen to know why is the default limit set to 4? It’s quite small.

Topic		Replies	Views
Please review TechEmpower performance benchmark variant with Akka HTTP and Slick Akka HTTP akka , akka-http , scala , configuration , slick	5	2319	June 12, 2019
Akka HTTP server timeouts on start Akka akka-http	1	469	July 22, 2020
Need help with load using akka.http or netty.http Akka HTTP	2	2279	April 23, 2019
Akka HTTP2:many GET requests, multiple POST with extractStrictEntity of 30 second timesout Akka HTTP	4	574	March 26, 2020
Help triaging actor-creation timeout in Http#bindAndHandle Akka HTTP	0	765	August 14, 2019

Benchmarking problem

Related Topics