Unexpected quarantine

Hello,

We’re running Akka 2.5.19 and are not using cluster (but instead remote Artery).
We have an odd behavior in which we occassionally get quarantined system when communicating between two of the actor systems, due to GracefulShutdownQuarantinedEvent.

For simplication, we have the following setup with 3 different remote actor systems:

Server1
Client1
Client2

An ActorRef from Client2 is serialized and sent remotely (as part of a message) to Server1, which then sends that same serialized ActorRef to Client1.
Client1 is then able to use this ActorRef to send messages directly to Client2. Note here that we do not do any actor-selection but use the ActorRef only, received from Server1.

This solution is usually working fine but we see that sometimes (it does not occur consistently) when Client2 is down for e.g. 6-7 minutes, Client1 puts the Client2 actor system in quarantined state. This is the logs we have gathered from akka and subscribing to various lifecycle events:


2019-11-27T13:30:00.924+0100 | Association to [akka://AS-A13601@Client2:13601] having UID [6812985527646718599] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
2019-11-27T13:30:00.926+0100 | Received GracefulShutdownQuarantinedEvent for remote ActorSystem: Association to [akka://AS-A13601@Client2:13601] having UID [6812985527646718599] has been stopped. All messages to this UID will be delivered to dead letters. Reason: ActorSystem terminated
2019-11-27T13:30:01.057+0100 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-123-1#1043978606]
2019-11-27T13:30:01.059+0100 | started (akka.stream.impl.io.TLSActor@2b060802)
2019-11-27T13:30:01.059+0100 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$cb#-370902978]
2019-11-27T13:30:01.060+0100 | now supervising Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]
2019-11-27T13:30:01.060+0100 | started (akka.io.TcpOutgoingConnection@2ca859a6)
2019-11-27T13:30:01.061+0100 | now watched by Actor[akka://AS-A13601/system/IO-TCP/selectors/$a#1664229512]
2019-11-27T13:30:01.061+0100 | Resolving Client2 before connecting
2019-11-27T13:30:01.061+0100 | Resolution request for Client2 from Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]
2019-11-27T13:30:01.061+0100 | Clear system message delivery of [akka://AS-A13601@Client2:13601#6812985527646718599]
2019-11-27T13:30:01.072+0100 | Attempting connection to [Client2/10.61.92.136:13601]
2019-11-27T13:30:01.075+0100 | Could not establish connection to [Client2:13601] due to java.net.ConnectException: Connection refused
2019-11-27T13:30:01.076+0100 | stopped
2019-11-27T13:30:01.076+0100 | received AutoReceiveMessage Envelope(Terminated(Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962]),Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/66#262019962])
2019-11-27T13:30:01.080+0100 | no longer watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$cb#-370902978]
2019-11-27T13:30:01.080+0100 | closing output
2019-11-27T13:30:01.080+0100 | stopped
2019-11-27T13:30:01.080+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], control stream] Upstream failed, cause: StreamTcpException: Tcp command [Connect(Client2:13601,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused
2019-11-27T13:30:01.080+0100 | Restarting graph due to failure. stack_trace:  (akka.stream.StreamTcpException: Tcp command [Connect(Client2:13601,None,List(),Some(5000 milliseconds),true)] failed because of java.net.ConnectException: Connection refused)
2019-11-27T13:30:01.081+0100 | Restarting graph in 2020836690 nanoseconds
2019-11-27T13:30:03.954+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], message stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:
2019-11-27T13:30:03.959+0100 | Outbound message stream to [akka://AS-A13601@Client2:13601] was quarantined and stopped. It will be restarted if used again.
2019-11-27T13:30:03.959+0100 | stopped
2019-11-27T13:30:03.962+0100 | [outbound connection to [akka://AS-A13601@Client2:13601], control stream] Upstream failed, cause: Association$OutboundStreamStopQuarantinedSignal$:
2019-11-27T13:30:03.963+0100 | Outbound control stream to [akka://AS-A13601@Client2:13601] was quarantined and stopped. It will be restarted if used again.
2019-11-27T13:30:03.963+0100 | stopped
2019-11-27T13:37:42.531+0100 | Dropping message [SomeMessage] from [Actor[akka://AS-A13601/user/$Ne#161182002]] to [Actor[akka://AS-A13601@Client2:13601/user/$b/$a#-640698163]] due to quarantined system [akka://AS-A13601@Client2:13601]

And the interesting aspect here is that restarting Client2 does not help! That triggers a new ActorRef from the restarted Client2 towards Server1, which is sent to Client1 but Client1 continues to think that Client2 is in quarantined state. After taking a heap-dump I see that the arterty AssociationState’s field cachedAssociation seems to be stuck in the quarantined state even though a new ActorRef is used…

Any ideas on if this is a bug and if there is a workaround through configuration etc.?
Is this really the idea of qurantined actor systems?

Thanks & Best Regards,
Gustav Åkesson

Sounds like it could be a bug, if you could create an isolated reproducer that would be great.

Thanks!

When my daily workload cools off I will try to reproduce it with a stand-alone application.
FYI - we’re currently able to workaround this issue by doing an ActorSelection in Client1, i.e. we receive a Client2 ActorRef from Server1, picks out the ActorPath from that ActorRef, and then continously do a an actor-selection whenever Client1 sends message to Client2. This gets rid of this quarantine issue and we also see this in the logs (where issue was previously noticed):

2019-11-28T10:44:36.065+0100 | DEBUG | ult-dispatcher-4 | .a.Association(akka://AS-A13601) | j.Slf4jLogger$$anonfun$receive$1 88 | 436 - com.typesafe.akka.slf4j - 2.5.19 | Quarantine piercing attempt with message [SomeMessage] to [Actor[akka://AS-A13601@Client2:13601/]]

It looks like that actor-selection then resolve the association’s graceful quarantine…

Thanks & Best Regards,
Gustav Åkesson