40 second delay within Play framework when receiving a request

I’m experiencing an issue that seems very similar to 60 second delay within Play framework when receiving a request · Issue #6943 · playframework/playframework · GitHub and How to debug the delay between Netty and Akka Default Dispatcher.

Play Version

2.8.5

API

Scala (2.12.12)

Operating System

Linux 4.14.238-125.422.amzn1.x86_64 #1 SMP

JDK

openjdk version “1.8.0_302”
OpenJDK Runtime Environment Corretto-8.302.08.1 (build 1.8.0_302-b08)
OpenJDK 64-Bit Server VM Corretto-8.302.08.1 (build 25.302-b08, mixed mode)

Expected Behavior

Play Framework promptly calls the controller for processing the request.

Actual Behavior

We are observing what looks like Play getting into a “bad state” for some period of time (several minutes up to several hours) where many requests that are being made to that Play instance time out from the perspective of the caller. We have verified via (caller and Play) logs as well as packet captures on the target Play instance/VM, sequences which look like the following:

  • Server where Play is running receives the request (verified via packet capture)
  • 10 seconds later, caller times out waiting for a response
  • 30 seconds after the previous step (40 seconds since the request was sent), our logging in Play (a Filter which logs before and after request processing) makes a log which shows that the request was processed in something like 5 msec. (By this time, the caller has long since timed out)

We have observed instances with a high uptime and a low uptime go into this state. At the same time, other identical instances are seemingly unaffected. There may be a period of days/hours before the “bad state” where the instance is just fine, and days/hours after the “bad state” where the instance is just fine… processing hundreds of thousands / millions of requests without any issues. We have engaged AWS support as well and they have ruled out any infrastructure / networking / hardware issues during these times.

Dependency versions

dependencyOverrides += "com.typesafe.akka" %% "akka-actor"                  % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-actor-typed"            % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-slf4j"                  % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-cluster"                % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-cluster-typed"          % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-cluster-sharding"       % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-cluster-sharding-typed" % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-cluster-tools"          % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-serialization-jackson"  % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-testkit"                % "2.6.11"
dependencyOverrides += "com.typesafe.akka" %% "akka-http-core"              % "10.2.3"
dependencyOverrides += "com.typesafe.akka" %% "akka-http"                   % "10.2.3"
dependencyOverrides += "com.typesafe.play" %% "play-ahc-ws"                 % "2.8.5"
dependencyOverrides += "com.typesafe.play" %% "play-caffeine-cache"         % "2.8.5"
dependencyOverrides += "com.typesafe.play" %% "play-guice"                  % "2.8.5"
dependencyOverrides += "com.typesafe.play" %% "play-jdbc-api"               % "2.8.5"
dependencyOverrides += "com.typesafe.play" %% "play-json"                   % "2.8.1"

looks like thread starvation to me… I’d check if you have blocking code in your codebase using up all the request processing threads (e.g. jdbc)

One performance issue that I’ve run into the past is with random number generation. If you are using cryptographic operations, eg TLS/SSL, then this might affect you. Haveged - ArchWiki can be installed and can help in scenarios where random number generation is very slow.

Thank you for the responses.

This service isn’t doing any SSL. The issue was indeed thread starvation. Very unexpected and a bit difficult to suss out from the logs, but that was absolutely it.