[Akka 2.5.x][Remoting] - Recovering from guaranteed nodes

Dear list members,

My company runs a large scale deployment (hundreds of JVMs) based on Akka, deployed in different regions globally, while some of the services are communicating using Akka Remoting (TCP, not artery).

As it goes, global cloud deployments suffer from occasional disconnections between different regions, total disconnections or severe packet loss. We expect things to be shaky while network disruption happens, but we also expect everything to go back to normal, when storm passes.

Observing the logs we see many instances of the following:

Tried to associate with unreachable remote address [akka.tcp://systemName b@192.168.236.12:2558]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

AssociationError [akka.tcp://com-company-resource-sip@192.168.222.36:2558] -> [akka.tcp://systemName@192.168.236.11:2558]: Error [Invalid address: akka.tcp://M systemName@192.168.236.11:2558] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://systemName@192.168.236.11:2558 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. ]

Reading Akka Remoting documentation, those errors mean that the two remote actor system in question would never be able to communicate with each other, unless the systems are restarted.

What is a proper expected way of recovering from those situations? It does not sound logical to me that I need to restart all nodes of the system every time network disconnection occurs, what am I missing here?

Thanks in advance for your replies.

Regards,

Dima Gutzeit

You don’t mention using Cluster which is rather worrying – remoting should not be used stand-alone, it is an implementation detail of the cluster. The cluster provides resilience mechanisms that can survive more issues than plain remoting.

Can you confirm if you’re using Cluster or not?

We started using Akka way before Clustering was introduced, so yes, its pure Remoting.

Hi Dima,

You should investigate what is causing the quarantine, and that should not be caused by the connection issues alone. Read the section about Quarantine in the documentation. Even though that is written in the Artery documentation pretty much the same applies for classic remoting.

The usual suspect is watch or remote deployment when using Remoting without Cluster. As soon as the failure detector triggers it will quarantine the other system. Therefore I recommend against using these features when Cluster isn’t used.

Thank you Partik.

Looking on the documentation, I can see the following statement:

“Quarantine usually does not happen if remote watch or remote deployment is not used”.

In our code, don’t use either explicitly, we are not doing remote actor deployments and we are not watching remote actors. Still problem happen. Any idea why?

Good, thanks for confirming that.

Next would be to try to find more information from the logs, around the time the first mention of “quarantine”. Note that the log you supplied mention

Then it’s more interesting to look at the log on the other side to find out what was causing the quarantine.