Artery Errors: ConductorServiceTimeoutException and Aeron client conductor is closed

Hello,
I recently switched to using Artery from netty.tcp.
I ran into this error : Insufficient usable storage for new log of length=50335744 in /dev/shm (tmpfs).
After searching for a solution, I understood that this is because I am using the defaults to start the media driver, which will be inside the same JVM of the actor. As described at AKKA documentation, it is better to start the driver externally and have it shared among actors.
I followed the instructions in the documentation, I even used the same example configuration.

Now the problem is that my actors start and then after a little time they all fail with an exception:

[ERROR] [04/24/2018 20:49:49.606] [aeron-client-conductor] [akka.remote.artery.aeron.ArteryAeronUdpTransport(akka://ActorClusterSystemName)] Fatal Aeron error ConductorServiceTimeoutException. Have to terminate ActorSystem because it lost contact with the external Aeron media driver. Possible configuration properties to mitigate the problem are 'client-liveness-timeout' or 'driver-timeout'. io.aeron.exceptions.ConductorServiceTimeoutException: Exceeded (ns): 5000000000

io.aeron.exceptions.ConductorServiceTimeoutException: Exceeded (ns): 5000000000
	at io.aeron.ClientConductor.checkServiceInterval(ClientConductor.java:745)
	at io.aeron.ClientConductor.onCheckTimeouts(ClientConductor.java:720)
	at io.aeron.ClientConductor.service(ClientConductor.java:659)
	at io.aeron.ClientConductor.doWork(ClientConductor.java:151)
	at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:233)
	at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:159)
	at java.lang.Thread.run(Thread.java:748)

[ERROR] [04/24/2018 20:49:49.963] [aeron-client-conductor] [akka.remote.artery.aeron.ArteryAeronUdpTransport(akka://ActorClusterSystemName)] Aeron error, org.agrona.concurrent.AgentTerminationException
org.agrona.concurrent.AgentTerminationException
	at io.aeron.ClientConductor.doWork(ClientConductor.java:148)
	at org.agrona.concurrent.AgentRunner.doDutyCycle(AgentRunner.java:233)
	at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:159)
	at java.lang.Thread.run(Thread.java:748)

and then after many undelivered messages:

[INFO] [04/24/2018 20:49:52.270] [ActorClusterSystemName-akka.remote.default-remote-dispatcher-8] [akka://ActorClusterSystemName@xx.xx.xx.xx:2559/system/remoting-terminator] Remote daemon shut down; proceeding with flushing remote transports.

Other actors in the cluster display this error message:

[ERROR] [04/24/2018 20:52:27.468] [ActorClusterSystemName-akka.actor.default-dispatcher-18] [akka://ActorClusterSystemName@xx.xx.xx.xx:2557/] swallowing exception during message send
java.lang.IllegalStateException: Aeron client conductor is closed
	at io.aeron.ClientConductor.ensureOpen(ClientConductor.java:635)
	at io.aeron.ClientConductor.addPublication(ClientConductor.java:367)
	at io.aeron.Aeron.addPublication(Aeron.java:247)
	at akka.remote.artery.aeron.AeronSink$$anon$1.<init>(AeronSink.scala:103)
	at akka.remote.artery.aeron.AeronSink.createLogicAndMaterializedValue(AeronSink.scala:100)
	at akka.stream.impl.GraphStageIsland.materializeAtomic(PhasedFusingActorMaterializer.scala:630)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:450)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:415)
	at akka.stream.impl.PhasedFusingActorMaterializer.materialize(PhasedFusingActorMaterializer.scala:406)
	at akka.stream.scaladsl.RunnableGraph.run(Flow.scala:588)
	at akka.remote.artery.Association.runOutboundOrdinaryMessagesStream(Association.scala:710)
	at akka.remote.artery.Association.$anonfun$runOutboundOrdinaryMessagesStream$3(Association.scala:720)
	at akka.remote.artery.Association.$anonfun$attachOutboundStreamRestart$1(Association.scala:814)
	at akka.remote.artery.Association$LazyQueueWrapper.runMaterialize(Association.scala:89)
	at akka.remote.artery.Association$LazyQueueWrapper.offer(Association.scala:93)
	at akka.remote.artery.Association$LazyQueueWrapper.offer(Association.scala:84)
	at akka.remote.artery.Association.send(Association.scala:379)
	at akka.remote.artery.ArteryTransport.send(ArteryTransport.scala:714)
	at akka.remote.RemoteActorRef.$bang(RemoteActorRefProvider.scala:574)
	at akka.actor.ActorRef.tell(ActorRef.scala:124)
	at akka.actor.ActorSelection$.rec$1(ActorSelection.scala:265)
	at akka.actor.ActorSelection$.deliverSelection(ActorSelection.scala:269)
	at akka.actor.ActorSelection.tell(ActorSelection.scala:46)
	at akka.actor.ScalaActorSelection.$bang(ActorSelection.scala:280)
	at akka.actor.ScalaActorSelection.$bang$(ActorSelection.scala:280)
	at akka.actor.ActorSelection$$anon$1.$bang(ActorSelection.scala:198)
	at akka.cluster.ClusterCoreDaemon.gossipTo(ClusterDaemon.scala:1285)
	at akka.cluster.ClusterCoreDaemon.gossip(ClusterDaemon.scala:1009)
	at akka.cluster.ClusterCoreDaemon.gossipTick(ClusterDaemon.scala:972)
	at akka.cluster.ClusterCoreDaemon$$anonfun$initialized$1.applyOrElse(ClusterDaemon.scala:484)
	at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
	at akka.actor.Actor.aroundReceive(Actor.scala:517)
	at akka.actor.Actor.aroundReceive$(Actor.scala:515)
	at akka.cluster.ClusterCoreDaemon.aroundReceive(ClusterDaemon.scala:288)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
	at akka.actor.ActorCell.invoke_aroundBody0(ActorCell.scala:557)
	at akka.actor.ActorCell$AjcClosure1.run(ActorCell.scala:1)
	at org.aspectj.runtime.reflect.JoinPointImpl.proceed(JoinPointImpl.java:149)
	at akka.kamon.instrumentation.ActorMonitors$$anon$1.$anonfun$processMessage$1(ActorMonitor.scala:123)
	at kamon.Kamon$.withContext(Kamon.scala:120)
	at akka.kamon.instrumentation.ActorMonitors$$anon$1.processMessage(ActorMonitor.scala:123)
	at akka.kamon.instrumentation.ActorCellInstrumentation.aroundBehaviourInvoke(ActorInstrumentation.scala:45)
	at akka.actor.ActorCell.invoke(ActorCell.scala:550)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
	at akka.dispatch.Mailbox.run(Mailbox.scala:225)
	at kamon.executors.Executors$InstrumentedExecutorService$$anon$7.run(Executors.scala:270)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

The media driver seem to be working fine! I checked the log files in the shm directory and there are no errors in the loss-report.dat or other files.

I have increased both net.core.rmem_max and net.core.wmem_max to 4194304, I also set java Xms to 1024M. I am running 7 actors in the cluster.

This is my artery configuration:

akka.remote {
    log-remote-lifecycle-events = off
    maximum-payload-bytes = 15 MiB
    artery {
      enabled = on
      transport = aeron-udp
      canonical.hostname = "127.0.0.1"
      canonical.hostname = ${?HOST}
      canonical.port = ${PORT}
      advanced {
        maximum-large-frame-size = 15 MiB
        send-buffer-size = 15 MiB
        receive-buffer-size = 15 MiB
        maximum-frame-size = 15 MiB
        outbound-message-queue-size = 2480000
        aeron-dir = /dev/shm/aeron
        embedded-media-driver = off
      }
    }
}

Akka version: 2.5.12. Scala version: 2.12.5. Linux distribution: debian v9. I added aeron-driver-1.7.0.jar, aeron-client-1.7.0.jar, and agrona-0.9.12.jar to classpath when I start the external shared MediaDriver.

I would appreciate if you can point me to how I can fix these errors.
Thanks

Troubleshooting for /dev/shm issue is described here: https://github.com/real-logic/aeron#troubleshooting

Perhaps you are trying to run too much on a single machine (virtual machine, docker, or whatever) so that it’s overloaded?

If you have a constrained environment it might be worth trying Artery with TCP instead, see docs.

By the way, 15 MiB messages is not going to work well, not even with Artery.

Thanks Patrik. I will check the troubleshooting page.
Yes, these errors started appearing when I increased the number of actors running on bare-metal machines (I am not using any VMs or containers) to more than 5. I assume that it is OK, especially that I used to run larger number of actors on the same machines when I was using netty.tcp.
Regarding the messages sizes, I also used the same max sizes for messages when I used netty.tcp so I did not expect errors to show when I switch to artrey.
Thanks

You say number of actors, but I guess you mean number of ActorSystems. In each ActorSystem you can run many Actors.

I think Aeron has a limitation of the message size to 1/8th the size of the term buffer so then you would have to increase that. I wouldn’t recommend > 2-3 MiB. Smaller is better.

Yes, I meant ActorSystem. This drew my attention to something that I might be doing wrong.
In the cluster example in the AKKA documentation, only one actor was created for each actor system, so I was under the impression that this what is recommended when running an AKKA cluster.
I will check that as well.
Thanks

Ouch, there is apparently a danger with too simplistic examples. That is definitely not recommended. Actors are lightweight, you can have thousands or even millions of them within one ActorSystem, which is heavyweight. Here is some more reading: Actor Systems • Akka Documentation

Thank you very much for pointing this out. Once I started the actors from within the same actor system, the areon error was gone.

Using routers is still not working, no errors though – but it could be related to the configuration of routers, again it used to work fine with tcp.