Improving Resiliency of Akka IO DNS

We recently encountered an interesting failure mode with some of our critical infrastructure that is based on Akka and Akka HTTP. At the time of the failure there was resource contention on the host and the container running the Akka application had recently restarted. When the application attempted to make the first HTTP request, we encountered the following failure:

2023-04-01 20:46:25,731 ERROR [hello-world-akka.actor.default-dispatcher-19] OneForOneStrategy: unable to create native thread: possibly out of memory or process/resource limits reached
akka.actor.ActorInitializationException: akka://hello-world/system/IO-DNS: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:196)
at akka.actor.ActorCell.create(ActorCell.scala:664)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:514)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:536)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:295)
at akka.dispatch.Mailbox.run(Mailbox.scala:230)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290)
at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020)
at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183)
Caused by: java.lang.reflect.InvocationTargetException: null
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at akka.util.Reflect$.instantiate(Reflect.scala:73)
at akka.actor.ArgsReflectConstructor.produce(IndirectActorProducer.scala:101)
at akka.actor.Props.newActor(Props.scala:226)
at akka.actor.ActorCell.newActor(ActorCell.scala:616)
at akka.actor.ActorCell.create(ActorCell.scala:643)
… 10 common frames omitted
Caused by: java.lang.OutOfMemoryError: unable to create native thread: possibly out of memory or process/resource limits reached
at java.base/java.lang.Thread.start0(Native Method)
at java.base/java.lang.Thread.start(Thread.java:798)
at java.base/java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:937)
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1343)
at akka.dispatch.ExecutorServiceDelegate.execute(ThreadPoolBuilder.scala:219)
at akka.dispatch.ExecutorServiceDelegate.execute$(ThreadPoolBuilder.scala:219)
at akka.dispatch.Dispatcher$LazyExecutorServiceDelegate.execute(Dispatcher.scala:43)
at akka.dispatch.Dispatcher.registerForExecution(Dispatcher.scala:127)
at akka.dispatch.MessageDispatcher.attach(AbstractDispatcher.scala:152)
at akka.actor.dungeon.Dispatch.start(Dispatch.scala:121)
at akka.actor.dungeon.Dispatch.start$(Dispatch.scala:119)
at akka.actor.ActorCell.start(ActorCell.scala:411)
at akka.actor.LocalActorRef.start(ActorRef.scala:384)
at akka.actor.dungeon.Children.makeChild(Children.scala:325)
at akka.actor.dungeon.Children.actorOf(Children.scala:46)
at akka.actor.dungeon.Children.actorOf$(Children.scala:45)
at akka.actor.ActorCell.actorOf(ActorCell.scala:411)
at akka.routing.Pool.newRoutee(RouterConfig.scala:204)
at akka.routing.Pool.newRoutee$(RouterConfig.scala:203)
at akka.routing.ConsistentHashingPool.newRoutee(ConsistentHashing.scala:283)
at akka.routing.RoutedActorCell.$anonfun$start$1(RoutedActorCell.scala:113)
at scala.collection.StrictOptimizedSeqFactory.fill(Factory.scala:330)
at scala.collection.StrictOptimizedSeqFactory.fill$(Factory.scala:325)
at scala.collection.immutable.Vector$.fill(Vector.scala:34)
at akka.routing.RoutedActorCell.start(RoutedActorCell.scala:113)
at akka.routing.RoutedActorCell.start(RoutedActorCell.scala:40)
at akka.actor.RepointableActorRef.point(RepointableActorRef.scala:116)
at akka.actor.RepointableActorRef.initialize(RepointableActorRef.scala:87)
at akka.actor.LocalActorRefProvider.actorOf(ActorRefProvider.scala:738)
at akka.actor.dungeon.Children.makeChild(Children.scala:312)
at akka.actor.dungeon.Children.actorOf(Children.scala:48)
at akka.actor.dungeon.Children.actorOf$(Children.scala:47)
at akka.actor.ActorCell.actorOf(ActorCell.scala:411)
at akka.io.SimpleDnsManager.(SimpleDnsManager.scala:27)
… 19 common frames omitted

For the remainder of the life of the application, we received the following log:

2023-04-01 20:46:25,739 INFO [hello-world-akka.actor.default-dispatcher-19] RepointableActorRef: Message [akka.io.dns.DnsProtocol$Resolve] from Actor[akka://hello-world/system/IO-TCP/selectors/$a/76#-810619897] to Actor[akka://hello-world/system/IO-DNS#-770742294] was not delivered. [10] dead letters encountered, no more dead letters will be logged in next [5.000 min]. If this is not an expected behavior then Actor[akka://hello-world/system/IO-DNS#-770742294] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings ‘akka.log-dead-letters’ and ‘akka.log-dead-letters-during-shutdown’.

While I believe I understand the core of what is happening here: IO-DNS actor fails to start up, Akka is no longer able to create TcpOutgoingConnection objects because DNS resolution never responds. [Source akka/io/TcpOutgoingConnection.scala#L65-L73] I have two threads [pun intended] I am trying to pull on:

  1. Is there a way to make this actor more resilient to failure? It would be ideal if we could either supervise this actor or have the application crash if it is unable to start successfully…

  2. These requests are actually made to a specific IP, port combination: 127.0.0.1:4315. Is there some way to avoid DNS in this scenario?

A few details about our setup:

  • com.typesafe.akka:akka-actor_2.13:2.6.18
  • com.typesafe.akka:akka-http_2.13:10.2.9
  • OpenJDK Runtime Environment Zulu11.62+17-CA (build 11.0.18+10-LTS)

We start up our Actor system by doing something like:

implicit val actorSystem = ActorSystem("hello-world", config)

And the HTTP server like:

  logger.info(s"Akka-HTTP binding to port: $port and interface: $interface...")
  private val bindingFuture = Http(actorSystem).newServerAt(interface, port).bind(routes)

  // for simplicity, block until the HTTP server is actually started
  private val tryBinding = Try(Await.result(bindingFuture, 10.seconds))

  private val binding = tryBinding match {
    case Success(b) =>
      logger.info(s"Successfully bound to port: ${b.localAddress.getPort} on interface: $interface")
      logger.info("HttpServer started")
      b
    case Failure(e) =>
      logger.info(s"Failed to bind to port, exception: $e")
      throw e
  }

Any tips or thoughts would be greatly appreciated.

If you are hitting out of memory or process limits like that blocking the JVM from starting threads you very likely would hit more problems further down the road even if the DNS actor startup would retry and possibly succeed (but it would likely just hit the same limit over and over again).

2 Likes