Delivery guarantee for local messages in distributed pub sub

It is documented that

message delivery guarantee in distributed pub sub modes is
at-most-once delivery. In other words, messages can be lost
over the wire.

Does that implies a guaranteed delivery if publisher and subscriber are on the same node?

DISCLAIMER

I haven’t done any testing of this. This based on my hands on understanding of the Akka reliability mechanisms, combined with my theoretical knowledge of the distributed pub sub library.

SHORT ANSWER

It’s a grey area. Delivery would be very reliable if publisher and subscriber are both local, but it wouldn’t meet the classical definition of guaranteed delivery.

LONG ANSWER

Let’s start with the actor to actor delivery guarantees as that’s the underlying building block of distributed pub sub.

Start by reading the general docs on message reliability and then focus on the section for local message delivery.

That referenced local delivery section says that local messages will not be lost except under the following circumstances: StackOverflowError, OutOfMemoryError, VirtualMachineError, a full mailbox, or the actor fails while processing the message, or is the actor already terminated (i.e. the destination doesn’t exist). These exceptions already probably eliminate calling local message delivery “guaranteed delivery” under the classical definition.

But that list also relies on a technicality of “delivered”. Just because a message is successfully received to an actor’s mailbox doesn’t necessarily mean it will get processed. Maybe that actor will receive a stop or kill message before the message gets processed. Maybe the node will be shutdown. Maybe the datacenter loses power.

Since we are talking “pub sub” and not just point to point actors, we must also consider failures that happen in the “relaying” via the Topic actor. Firstly, we are just giving every error a second place to happen. The OutOfMemory could happen while the message is in the Topic mailbox or the final destination mailbox. But we we also have to consider the possibility of the Topic being thread starved or otherwise unable to process it’s mailbox. “Delivery” in a local send, actor to actor, is just a simple enqueue. But for our pubsub message to successfully make it to its final destination the Topic actor must be healthy enough to actually do the relaying of the message.

Note that all of the above failures are going to be extremely rare. But such is the nature of guaranteed delivery: “at most once” systems like Akka will deliver 100% of the time during normal conditions. The only thing that separates an “at most once” system from an “at least once” system is how it handles failure conditions such as a node failure or network partition. (This "100% under non-error conditions being evidenced by that section of the docs, where they explain that the Akka automated tests test for 100% message delivery success.)

Which is why I would assert that what really distinguishes guaranteed message delivery is durability (or pseudo-durability like replication.) Guaranteed delivery means that once we accept a message we will guarantee the delivery of that message even if the entire server explodes a nanosecond after we accept delivery. And for that to be true, we need to have saved the message durably so that when the node is recovered we can process the message. (Although that then brings up the question of how much redundancy/recovery we need in our storage to be considered guaranteed. But that’s getting esoteric.)

Akka does offer that kind of guaranteed delivery, but it comes with it’s own different set of tradeoffs (namely the need for persistence, significantly higher overhead, and potential loss of message ordering).

Which leads me back to full circle. Why are you asking? Make sure you read the Akka docs that I already linked on delivery reliability. There’s a whole section on why they’ve made the tradeoffs they have and basically the docs make the assertion that delivery guarantee is pretty useless for most Akka use cases. And in the pub sub docs that you reference in your question, the docs specifically call out that if you are in a use case that needs reliable delivery then you should probably be using Kafka.

So why do you need reliable delivery? I think sometimes people just get worried about the message delivery failure conditions without realizing that the kind of failure conditions that would cause message loss (say node failure, network partition, or data center failure) are the kind of failure conditions they couldn’t recover from anyway. Once you truly embrace the nature of distributed computing the old Akka motto of “let it fail” makes a lot of sense: the ability to recover from failure is much more powerful than preventing failure. (Because preventing failure is largely an illusion.)

But, on the other hand, if you do need that kind of reliable delivery, and you are also looking for pub/sub functionality, then you really need to be considering Kafka. That’s pretty much the canonical use case for Kafka and it’s designed to handle that use case extremely efficiently.

2 Likes