Occassionally dropped messages

Hi guys,

I have a strange behavior in Akka remote artery that suddenly happened.
I have one client and two servers, communicating through Akka. The two servers have a master-slave architecture, in which the client primarily communicates with one of the servers, for retrieving info and periodic pings to ensure master is still running as the master.

What happens is that when one of the servers is master (server1.net.sw.se), there is a significant amount of dropped messages in client’s send queue.
Connection is established because some messages are sent and received, but most sent are dropped. If I switch master to the other (slave) server, communication is normal - no dropped messages. And if I then restart the problematic server process (server1.net.sw.se) and let that server become master again, the problem reoccurrs on client side!
So it seems something is permanently corrupt in the client’s actor system in how it communicates with actor system on server1.net.sw.se .

Only workaround to this was to restart the client process.
It is not any significant application load from the clients - only occassional ping messages and info exchange messages.

Both client- and server actor systems are subscribing to the ThisActorSystemQuarantinedEvent event, and restarts when received.
However, this doesn’t seem to be a quarantine case since some messages are being transmitted.

Akka version is 2.5.19.
Do you have any thoughts on this? Something that might be solved in 2.5.20 or 2.5.21, or something that we’ve missed in our akka usage?

Unfortunately, the issue is not happening right now so I’m not able to retrieve any more debugging from the observation.
Logs from client, when switching master server to the problematic server1.net.sw.se:

2019-02-26T20:22:39.368+0100 | .a.Association(akka://AS-A13601) | com.typesafe.akka.slf4j - 2.5.19 | Quarantine piercing attempt with message [ActorSelectionMessage(RegisterClientMessage)] to [Actor[akka://AS-A16300@server1.net.sw.se:16300/]]
2019-02-26T20:22:39.368+0100 | .a.Association(akka://AS-A13601) | com.typesafe.akka.slf4j - 2.5.19 | Dropping message [ActorSelectionMessage(RegisterClientMessage)] from [Actor[akka://AS-A13601/user/$a/$a#2021643780]] to [Actor[akka://AS-A16300@server1.net.sw.se:16300/]] due to overflow of send queue, size [3072]
2019-02-26T20:22:39.603+0100 | a.i.TcpListener                  | com.typesafe.akka.slf4j - 2.5.19 | New connection accepted
2019-02-26T20:22:39.603+0100 | a.i.SelectionHandler             | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/3180#884609065]
2019-02-26T20:22:39.603+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | started (akka.io.TcpIncomingConnection@5f600bd2)
2019-02-26T20:22:39.603+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/IO-TCP/selectors/$a#-893472475]
2019-02-26T20:22:39.604+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4315-1#-1482106711]
2019-02-26T20:22:39.604+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4315-0-map#-1262488769]
2019-02-26T20:22:39.604+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$SX#65519072]
2019-02-26T20:22:39.604+0100 | a.s.i.f.ActorGraphInterpreter    | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.fusing.ActorGraphInterpreter@265b3f56)
2019-02-26T20:22:39.604+0100 | a.s.i.i.TLSActor                 | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.io.TLSActor@1af2b499)
2019-02-26T20:22:39.605+0100 | a.i.TcpListener                  | com.typesafe.akka.slf4j - 2.5.19 | New connection accepted
2019-02-26T20:22:39.605+0100 | a.i.SelectionHandler             | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/3181#-1869580023]
2019-02-26T20:22:39.605+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | started (akka.io.TcpIncomingConnection@56745ad9)
2019-02-26T20:22:39.605+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/IO-TCP/selectors/$a#-893472475]
2019-02-26T20:22:39.606+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4316-1#-879100037]
2019-02-26T20:22:39.606+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4316-0-map#-426425576]
2019-02-26T20:22:39.606+0100 | a.i.TcpIncomingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$TX#-584777727]
2019-02-26T20:22:39.606+0100 | a.s.i.f.ActorGraphInterpreter    | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.fusing.ActorGraphInterpreter@4f80d2dc)
2019-02-26T20:22:39.606+0100 | a.s.i.i.TLSActor                 | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.io.TLSActor@758e4643)
2019-02-26T20:22:39.637+0100 | a.r.a.Decoder                    | com.typesafe.akka.slf4j - 2.5.19 | Decoded message but unable to record hits for compression as no remoteAddress known. No association yet?
2019-02-26T20:22:39.639+0100 | a.r.a.Decoder                    | com.typesafe.akka.slf4j - 2.5.19 | Decoded message but unable to record hits for compression as no remoteAddress known. No association yet?
2019-02-26T20:22:39.639+0100 | .a.Association(akka://AS-A13601) | com.typesafe.akka.slf4j - 2.5.19 | Starting outbound control stream to [akka://AS-A16300@server1.net.sw.se:16300]
2019-02-26T20:22:39.639+0100 | .a.Association(akka://AS-A13601) | com.typesafe.akka.slf4j - 2.5.19 | Incarnation 4 of association to [akka://AS-A16300@server1.net.sw.se:16300] with new UID[-4102320532937819035] (old UID [8096406311518835292])
2019-02-26T20:22:39.639+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4317-0-ignoreSink#616861197]
2019-02-26T20:22:39.640+0100 | a.s.i.f.ActorGraphInterpreter    | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.fusing.ActorGraphInterpreter@2fc7eacd)
2019-02-26T20:22:39.640+0100 | a.s.i.StreamSupervisor           | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/StreamSupervisor-0/remote-4319-1#-748879484]
2019-02-26T20:22:39.640+0100 | a.i.TcpManager                   | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$UX#256478128]
2019-02-26T20:22:39.640+0100 | a.i.SelectionHandler             | com.typesafe.akka.slf4j - 2.5.19 | now supervising Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/3182#741439304]
2019-02-26T20:22:39.641+0100 | a.s.i.i.TLSActor                 | com.typesafe.akka.slf4j - 2.5.19 | started (akka.stream.impl.io.TLSActor@5810f4bf)
2019-02-26T20:22:39.641+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | started (akka.io.TcpOutgoingConnection@56ee83b4)
2019-02-26T20:22:39.642+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/IO-TCP/selectors/$a#-893472475]
2019-02-26T20:22:39.643+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | Resolving server1.net.sw.se before connecting
2019-02-26T20:22:39.643+0100 | a.i.SimpleDnsManager             | com.typesafe.akka.slf4j - 2.5.19 | Resolution request for server1.net.sw.se from Actor[akka://AS-A13601/system/IO-TCP/selectors/$a/3182#741439304]
2019-02-26T20:22:39.643+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | Attempting connection to [server1.net.sw.se/10.61.204.42:16300]
2019-02-26T20:22:39.643+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | Connection established to [server1.net.sw.se:16300]
2019-02-26T20:22:39.643+0100 | a.i.TcpManager                   | com.typesafe.akka.slf4j - 2.5.19 | no longer watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$UX#256478128]
2019-02-26T20:22:39.643+0100 | a.i.TcpOutgoingConnection        | com.typesafe.akka.slf4j - 2.5.19 | now watched by Actor[akka://AS-A13601/system/StreamSupervisor-0/$$UX#256478128]
2019-02-26T20:22:39.668+0100 | r.a.c.InboundActorRefCompression | com.typesafe.akka.slf4j - 2.5.19 | Initializing InboundActorRefCompression for originUid [-4102320532937819035]
2019-02-26T20:22:39.668+0100 | r.a.c.InboundManifestCompression | com.typesafe.akka.slf4j - 2.5.19 | Initializing InboundManifestCompression for originUid [-4102320532937819035]
2019-02-26T20:22:39.669+0100 | .a.Association(akka://AS-A13601) | com.typesafe.akka.slf4j - 2.5.19 | Dropping message [PingMessage] from [Actor[akka://AS-A13601/user/$a/$a#2021643780]] to [Actor[akka://AS-A16300@server1.net.sw.se:16300/user/proc/$c#1156750475]] due to overflow of send queue, size [3072]

Best Regards,

I don’t think there has been any changes in artery lately, a quick skim of the issuetracker says the last change was in 2.5.13 (#24973 and #24972) back in May last year.

The send buffer belongs to the association to a specific remote, and the queue will fill up if you feed it more messages than can pass across the network, if that is the case you’ll need to implement some sort of flow control to not do that.

We haven’t seen any performance degradation around artery lately though, that could explain why sending messages would be slower and let the buffer fill up, could there be something in your server infra that has changed? Also, if it isn’t all the time it could simply be some resource starvation perhaps?

As a side note, just so you are aware: we do not recommend using remoting for anything else than as the underlying layer for cluster, and especially for creating client-server solutions - the protocol is designed for peer to peer, see section in the docs here https://doc.akka.io/docs/akka/current/remoting-artery.html#introduction

Thanks!

I couldn’t find anything either in later releases… upgrading is like a shot in the dark, hoping not to encounter this issue again. I saw that artery was upgraded as well.

What puzzles me is that it required a restart of the client to get rid of the issue. After establishing the connection with server, the send queue was instantly full just as if there is retained state from previous incarnation of the remote actor system. Restarting the server processes had no impact.
I don’t think it is network-related since all hosts are within same low-latency network, and it is not many messages being exchanged between these peers (besides, one server worked and the other didn’t). Additionally, the systems were practically idle with 8 CPUs for their disposal.

Thanks for the recommendation pointer! But we implemented this way back before Akka cluster, and has just uppgraded to Akka 2.5.19, thus moving to artery as well.

Ah, yes, there has been Aeron upgrades in a few versions, I didn’t think about that: https://github.com/akka/akka/pulls?q=is%3Apr+aeron+is%3Aclosed

I don’t think any of the upgrades were in response to any known issues though, just keeping on top of their releases. Are you running with the external or the embedded media driver?

If you find out a way to repeat the issue it would be interesting to know more about the circumstances.

… we implemented this way back before Akka cluster …

Whoa, way to go application lifecycle, impressive!

Note that there is also a TCP version of artery that does not use Aeron/UDP but is built on top of Akka IO/Akka Streams.

Many thanks!
I will monitor this case, and if it happens again I will hopefully be able to extract more details on the matter.