Random 502 and 504 errors during load testing

When 200 requests per second is loaded, around 2% to 10% of the times, HTTP status 502 returns and around 1% it is 504 as well.

We run Play server in K8s cluster in private cloud with 3 replicas. K8s configured with ingress nginx proxy server that directs the request to Play. We use OpenJDK JRE 1.8.

The time out setting are made as below.
play.server.http.idleTimeout = 180s
play.server.akka.requestTimeout = 300s

Tried both default server and Netty to see same issue. This issue doesnot occur even if we load 1000 concurrent requests in development VM. We figured out that the private cloud provider network is pretty slow and that is causing this issue. Now, how do we customise the configuration of Play for such slow network scenario so that we can reduce the error rate in Jmeter load testing?

Play server log reports no error. However, the nginx proxy server log shows below error:
recv() failed (104: Connection reset by peer) while reading response header from upstream.

Any suggestion or help is appreciated.

Hi @dilipmys,

What about nginx timeout proxy pass configurations? Do you have proxy_connect_timeout and proxy_read_timeout configured or are you using the default values? Also, do you have any timeouts configured on JMeter side?

What is the application doing? Accessing another service, or a database, etc.? Can you confirm the requests are reaching the application or they are timing out somewhere else?

Best.

Thanks for your response [marcospereira].

The proxy_connect_timeout and proxy_read_timeout at nginx is set at 180 seconds.

Jmeter timeout is set at 300 seconds.

Application is reading and writing to DB which could be a bottleneck. Any idea on how to tune for such condition to avoid 502 error?Appreciate your help on it.

Hey @dilipmys,

Which version of Play are you using? Since it is not clear yet where the bottleneck is, I recommend you to use a profiler and check where your application is slowing down. Then you can tune if from there.

Best.

Thanks @marcospereira,

Will try to do that to find exact root cause. But even if we assume that the response is slower due to some reason, can’t we tune it for that to avoid 502? Is there a parameter available for that either in Play or proxy server side? 2.6.x is the Play version being used.

@dilipmys,

Well, I prefer to go with the profiler and see what it will show since I don’t want to speculate on what is causing the slowdown and how a configuration could magically resolve it. The way I see, increasing limits and timeouts will only postpone the problem a little bit.

But still, I can ask some questions:

  1. Is the project using Java or Scala?
  2. How is the connection pool configured?
  3. How is the thread pool configured?
  4. Is there something blocking the default execution context (for example, a get on a CompletableFuture)?
  5. Is there calls to external services or just to the database?

Best.

Thanks @marcospereira,

  1. It is using Java with OpenJDK 8
  2. Connection pool and Thread pool are not changed from default setting.
  3. Not sure on if something is blocking default execution context. Will take a look at that.
  4. JSON message received is less than 1 KB and it is just pushed to Kafka topic. This completes the request, response cycle. Other services pull data from Kafka topic.

The 502 issue appears when load crosses 300 concurrent requests. There are 3 replicas of Play services that are load balanced by K8s Ingress.