[Akka 2.5.x][Remoting] - Recovering from guaranteed nodes

gutzeit · May 17, 2018, 10:45am

Dear list members,

My company runs a large scale deployment (hundreds of JVMs) based on Akka, deployed in different regions globally, while some of the services are communicating using Akka Remoting (TCP, not artery).

As it goes, global cloud deployments suffer from occasional disconnections between different regions, total disconnections or severe packet loss. We expect things to be shaky while network disruption happens, but we also expect everything to go back to normal, when storm passes.

Observing the logs we see many instances of the following:

Tried to associate with unreachable remote address [akka.tcp://systemName b@192.168.236.12:2558]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted.]

AssociationError [akka.tcp://com-company-resource-sip@192.168.222.36:2558] -> [akka.tcp://systemName@192.168.236.11:2558]: Error [Invalid address: akka.tcp://M systemName@192.168.236.11:2558] [ akka.remote.InvalidAssociation: Invalid address: akka.tcp://systemName@192.168.236.11:2558 Caused by: akka.remote.transport.Transport$InvalidAssociationException: The remote system has quarantined this system. No further associations to the remote system are possible until this system is restarted. ]

Reading Akka Remoting documentation, those errors mean that the two remote actor system in question would never be able to communicate with each other, unless the systems are restarted.

What is a proper expected way of recovering from those situations? It does not sound logical to me that I need to restart all nodes of the system every time network disconnection occurs, what am I missing here?

Thanks in advance for your replies.

Regards,

Dima Gutzeit

ktoso · May 17, 2018, 11:11am

You don’t mention using Cluster which is rather worrying – remoting should not be used stand-alone, it is an implementation detail of the cluster. The cluster provides resilience mechanisms that can survive more issues than plain remoting.

Can you confirm if you’re using Cluster or not?

gutzeit · May 17, 2018, 11:23am

We started using Akka way before Clustering was introduced, so yes, its pure Remoting.

patriknw · May 17, 2018, 7:18pm

Hi Dima,

You should investigate what is causing the quarantine, and that should not be caused by the connection issues alone. Read the section about Quarantine in the documentation. Even though that is written in the Artery documentation pretty much the same applies for classic remoting.

The usual suspect is watch or remote deployment when using Remoting without Cluster. As soon as the failure detector triggers it will quarantine the other system. Therefore I recommend against using these features when Cluster isn’t used.

gutzeit · May 23, 2018, 6:47am

Thank you Partik.

Looking on the documentation, I can see the following statement:

“Quarantine usually does not happen if remote watch or remote deployment is not used”.

In our code, don’t use either explicitly, we are not doing remote actor deployments and we are not watching remote actors. Still problem happen. Any idea why?

patriknw · May 23, 2018, 5:49pm

Good, thanks for confirming that.

Next would be to try to find more information from the logs, around the time the first mention of “quarantine”. Note that the log you supplied mention

Then it’s more interesting to look at the log on the other side to find out what was causing the quarantine.

Topic		Replies	Views
Prevent akka remote to reassociate after association error Akka	2	846	September 28, 2018
Unnecessary "errors" from Akka Remoting when a remote ActorSystem terminates Akka	6	2086	June 6, 2018
Unexpected quarantine Akka	4	1218	November 17, 2021
What akka remote is waiting for? Akka Cluster	1	325	November 23, 2022
Nodes not rejoining after cluster spilt Akka Cluster akka-cluster	3	1491	April 18, 2018

[Akka 2.5.x][Remoting] - Recovering from guaranteed nodes

Related Topics