Why my cluster breaks after a while due to quarantined nodes?

(Tolga Cakiroglu) #1

We have a messaging application and our cluster is failing sometimes. Some nodes are not communicating each other and finally application is not working correctly.

We are seeing lots of warning about quarantined nodes in the akka logs. And we increase system-message-buffer-size to 40000 to solve the issue. After updating this parameter the system is working more stable but still getting issue. We have following exception

akka.remote.ResendBufferCapacityReachedException: Resend buffer capacity of [40000] has been reached.
	at akka.remote.AckedSendBuffer.buffer(AckedDelivery.scala:124) ~[com.typesafe.akka.akka-remote_2.11-2.5.19.jar:2.5.19]
	at akka.remote.ReliableDeliverySupervisor.akka$remote$ReliableDeliverySupervisor$$tryBuffer(Endpoint.scala:436) ~[com.typesafe.akka.akka-remote_2.11-2.5.19.jar:2.5.19]
	at akka.remote.ReliableDeliverySupervisor.akka$remote$ReliableDeliverySupervisor$$handleSend(Endpoint.scala:419) ~[com.typesafe.akka.akka-remote_2.11-2.5.19.jar:2.5.19]
	at akka.remote.ReliableDeliverySupervisor$$anonfun$receive$1.applyOrElse(Endpoint.scala:300) ~[com.typesafe.akka.akka-remote_2.11-2.5.19.jar:2.5.19]
	at akka.actor.Actor$class.aroundReceive(Actor.scala:517) ~[com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.remote.ReliableDeliverySupervisor.aroundReceive(Endpoint.scala:207) ~[com.typesafe.akka.akka-remote_2.11-2.5.19.jar:2.5.19]
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.actor.ActorCell.invoke(ActorCell.scala:557) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.Mailbox.run(Mailbox.scala:225) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [com.typesafe.akka.akka-actor_2.11-2.5.19.jar:2.5.19]

Do I need to increase system-message-buffer-size? What is the optimum value for it?

Or Do we have implementation error of akka remoting?

Now our system have 7 node (container) in rancher 1.6 and it has 3K socket connection per node. Memory and CPU usage is at normal levels.

I will be very appreciate if you can help me.
Best regards,
Tolga